`nlu.text_processing`¶

This module is used for preprocessing user inputs before further analysis.

The user utterance is broken into tokens which contain additional information about the it.

Module Contents¶

class nlu.text_processing.Span(text: str, start: int, end: int | None = None, lemma: str | None = None)¶

overlaps(other: Span) → bool¶

Checks whether two spans overlap in the original utterance.

class nlu.text_processing.Token(text: str, start: int, end: int | None = None, lemma: str | None = None, is_stop: bool | None = False)¶: Bases: Span

class nlu.text_processing.Tokenizer(additional_stop_words: List[str] = None)¶

process_text(text: str) → List[Token]¶

Processes given text.

The text is split into tokens which can be mapped back to the original text.

remove_punctuation(text: str) → str¶

Defines patterns of punctuation marks to remove in the utterance.

lemmatize_text(text: str) → str¶

Returns string lemma.

tokenize(word_tokens: List[str], text: str) → List[Token]¶

Returns a tokenized copy of text.