nlpretext.token module¶

nlpretext.token.preprocess.remove_smallwords(tokens, smallwords_threshold: int) → list[source]¶

Function that removes words which length is below a threshold [“hello”, “my”, “name”, “is”, “John”, “Doe”] –> [“hello”,”name”,”John”,”Doe”]

Parameters

text (list) – list of strings
smallwords_threshold (int) – threshold of small word

Returns

Return type

list

nlpretext.token.preprocess.remove_special_caracters_from_tokenslist(tokens) → list[source]¶

Remove tokens that doesn’t contains any number or letter. eg. [‘foo’,’bar’,’—’,“‘s”,’#’] -> [‘foo’,’bar’,“‘s”]

Parameters: tokens (list) – list of tokens to be cleaned
Returns: list of tokens without tokens that contains only special caracters
Return type: list

nlpretext.token.preprocess.remove_stopwords(tokens: list, lang: str, custom_stopwords: Optional[list] = None) → str[source]¶

Remove stopwords from a text. eg. ‘I like when you move your body !’ -> ‘I move body !’

Parameters

tokens (list(str)) – list of tokens
lang (str) – language iso code (e.g : “en”)
custom_stopwords (list(str)|None) – list of custom stopwords to add. None by default

Returns

tokens without stopwords

Return type

list

Raises

ValueError – When inputs is not a list

nlpretext.token.preprocess.remove_tokens_with_nonletters(tokens) → list[source]¶

Inputs a list of tokens, outputs a list of tokens without tokens that includes numbers of special caracters. [‘foo’,’bar’,’124’,’34euros’] -> [‘foo’,’bar’]

Parameters: tokens (list) – list of tokens to be cleaned
Returns: list of tokens without tokens with numbers
Return type: list