nlpretext.token module

nlpretext.token.preprocess.remove_smallwords(tokens, smallwords_threshold: int)list[source]

Function that removes words which length is below a threshold [“hello”, “my”, “name”, “is”, “John”, “Doe”] –> [“hello”,”name”,”John”,”Doe”]

Parameters
  • text (list) – list of strings

  • smallwords_threshold (int) – threshold of small word

Returns

Return type

list

nlpretext.token.preprocess.remove_special_caracters_from_tokenslist(tokens)list[source]

Remove tokens that doesn’t contains any number or letter. eg. [‘foo’,’bar’,’—’,“‘s”,’#’] -> [‘foo’,’bar’,“‘s”]

Parameters

tokens (list) – list of tokens to be cleaned

Returns

list of tokens without tokens that contains only special caracters

Return type

list

nlpretext.token.preprocess.remove_stopwords(tokens: list, lang: str, custom_stopwords: Optional[list] = None)str[source]

Remove stopwords from a text. eg. ‘I like when you move your body !’ -> ‘I move body !’

Parameters
  • tokens (list(str)) – list of tokens

  • lang (str) – language iso code (e.g : “en”)

  • custom_stopwords (list(str)|None) – list of custom stopwords to add. None by default

Returns

tokens without stopwords

Return type

list

Raises

ValueError – When inputs is not a list

nlpretext.token.preprocess.remove_tokens_with_nonletters(tokens)list[source]

Inputs a list of tokens, outputs a list of tokens without tokens that includes numbers of special caracters. [‘foo’,’bar’,’124’,’34euros’] -> [‘foo’,’bar’]

Parameters

tokens (list) – list of tokens to be cleaned

Returns

list of tokens without tokens with numbers

Return type

list