nlpretext.token module¶
-
nlpretext.token.preprocess.
remove_smallwords
(tokens, smallwords_threshold: int) → list[source]¶ Function that removes words which length is below a threshold [“hello”, “my”, “name”, “is”, “John”, “Doe”] –> [“hello”,”name”,”John”,”Doe”]
- Parameters
text (list) – list of strings
smallwords_threshold (int) – threshold of small word
- Returns
- Return type
list
-
nlpretext.token.preprocess.
remove_special_caracters_from_tokenslist
(tokens) → list[source]¶ Remove tokens that doesn’t contains any number or letter. eg. [‘foo’,’bar’,’—’,“‘s”,’#’] -> [‘foo’,’bar’,“‘s”]
- Parameters
tokens (list) – list of tokens to be cleaned
- Returns
list of tokens without tokens that contains only special caracters
- Return type
list
-
nlpretext.token.preprocess.
remove_stopwords
(tokens: list, lang: str, custom_stopwords: Optional[list] = None) → str[source]¶ Remove stopwords from a text. eg. ‘I like when you move your body !’ -> ‘I move body !’
- Parameters
tokens (list(str)) – list of tokens
lang (str) – language iso code (e.g : “en”)
custom_stopwords (list(str)|None) – list of custom stopwords to add. None by default
- Returns
tokens without stopwords
- Return type
list
- Raises
ValueError – When inputs is not a list
-
nlpretext.token.preprocess.
remove_tokens_with_nonletters
(tokens) → list[source]¶ Inputs a list of tokens, outputs a list of tokens without tokens that includes numbers of special caracters. [‘foo’,’bar’,’124’,’34euros’] -> [‘foo’,’bar’]
- Parameters
tokens (list) – list of tokens to be cleaned
- Returns
list of tokens without tokens with numbers
- Return type
list