nlu.text_processing

This module is used for preprocessing user inputs before further analysis.

The user utterance is broken into tokens which contain additional information about the it.

Module Contents

Classes

Span

Token

Tokenizer

class nlu.text_processing.Span(text: str, start: int, end: Optional[int] = None, lemma: Optional[str] = None)
overlaps(other: Span) bool

Checks whether two spans overlap in the original utterance.

Parameters

other – Span to compare against.

Returns

True if there is overlap.

class nlu.text_processing.Token(text: str, start: int, end: Optional[int] = None, lemma: Optional[str] = None, is_stop: Optional[bool] = False)

Bases: Span

class nlu.text_processing.Tokenizer(additional_stop_words: List[str] = None)
process_text(text: str) List[Token]

Processes given text.

The text is split into tokens which can be mapped back to the original text.

Parameters

text – A piece of text.

Returns

List of tokens.

remove_punctuation(text: str) str

Defines patterns of punctuation marks to remove in the utterance.

Parameters

text – A piece of text.

Returns

A piece of text without punctuation.

lemmatize_text(text: str) str

Returns string lemma.

Parameters

text – A piece of text.

Returns

Lemmatized piece of text.

tokenize(word_tokens: List[str], text: str) List[Token]

Returns a tokenized copy of text.

Parameters

text – A piece of text.

Returns

List of tokens.