nlu.text_processing

This module is used for preprocessing user inputs before further analysis.

The user utterance is broken into tokens which contain additional information about the it.

Module Contents

Classes

Span

Token

Tokenizer

class nlu.text_processing.Span(text: str, start: int, end: int | None = None, lemma: str | None = None)
overlaps(other: Span) bool

Checks whether two spans overlap in the original utterance.

Parameters:

other – Span to compare against.

Returns:

True if there is overlap.

class nlu.text_processing.Token(text: str, start: int, end: int | None = None, lemma: str | None = None, is_stop: bool | None = False)

Bases: Span

class nlu.text_processing.Tokenizer(additional_stop_words: List[str] = None)
process_text(text: str) List[Token]

Processes given text.

The text is split into tokens which can be mapped back to the original text.

Parameters:

text – A piece of text.

Returns:

List of tokens.

remove_punctuation(text: str) str

Defines patterns of punctuation marks to remove in the utterance.

Parameters:

text – A piece of text.

Returns:

A piece of text without punctuation.

lemmatize_text(text: str) str

Returns string lemma.

Parameters:

text – A piece of text.

Returns:

Lemmatized piece of text.

tokenize(word_tokens: List[str], text: str) List[Token]

Returns a tokenized copy of text.

Parameters:

text – A piece of text.

Returns:

List of tokens.