Welcome to NLPretext’s documentation!¶
The NLPretext library aimed to be a meta-library to be used to help you get started on handling your NLP use-case preprocessing.
# Installation
Beware, this package has been tested on Python 3.6 & 3.7 & 3.8, and will probably not be working under python 2.7 as Python2.7 EOL is scheduled for December 2019.
To install this library you should first clone the repository:
pip install nlpretext
This library uses Spacy as tokenizer. Current models supported are en_core_web_sm and fr_core_news_sm. If not installed, run the following commands:
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
nlpretext¶
nlpretext.preprocessor module¶
-
class
nlpretext.preprocessor.
Preprocessor
[source]¶ Bases:
object
-
static
build_pipeline
(operation_list: List[dict]) → sklearn.pipeline.Pipeline[source]¶ Build sklearn pipeline from a operation list
- Parameters
operation_list (iterable) – list of __operations of preprocessing
- Returns
- Return type
sklearn.pipeline.Pipeline
-
static
nlpretext.basic module¶
-
nlpretext.basic.preprocess.
filter_non_latin_characters
(text) → str[source]¶ Function that filters non latin characters of a text
- Parameters
text (string) –
- Returns
- Return type
string
-
nlpretext.basic.preprocess.
fix_bad_unicode
(text, normalization: str = 'NFC') → str[source]¶ Fix unicode text that’s “broken” using ftfy; this includes mojibake, HTML entities and other code cruft, and non-standard forms for display purposes.
- Parameters
text (string) –
({'NFC' (normalization) – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods
'NFKC' – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods
'NFD' – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods
'NFKD'}) – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods
- Returns
- Return type
string
-
nlpretext.basic.preprocess.
lower_text
(text: str)[source]¶ Given
text
str, transform it into lowercase- Parameters
text (string) –
- Returns
- Return type
string
-
nlpretext.basic.preprocess.
normalize_whitespace
(text) → str[source]¶ Given
text
str, replace one or more spacings with a single space, and one or more linebreaks with a single newline. Also strip leading/trailing whitespace. eg. ” foo bar ” -> “foo bar”- Parameters
text (string) –
- Returns
- Return type
string
-
nlpretext.basic.preprocess.
remove_accents
(text, method: str = 'unicode') → str[source]¶ Remove accents from any accented unicode characters in
text
str, either by transforming them into ascii equivalents or removing them entirely.- Parameters
text (str) – raw text
method (({'unicode', 'ascii'})) –
if ‘unicode’, remove accented char for any unicode symbol with a direct ASCII equivalent; if ‘ascii’, remove accented char for any unicode symbol
NB: the ‘ascii’ method is notably faster than ‘unicode’, but less good
- Returns
- Return type
string
- Raises
ValueError – if
method
is not in {‘unicode’, ‘ascii’}
-
nlpretext.basic.preprocess.
remove_eol_characters
(text) → str[source]¶ Remove end of line (
) char.
text : str
str
-
nlpretext.basic.preprocess.
remove_multiple_spaces_and_strip_text
(text) → str[source]¶ Remove multiple spaces, strip text, and remove ‘-‘, ‘*’ characters.
- Parameters
text (str) – the text to be processed
- Returns
the text with removed multiple spaces and strip text
- Return type
string
-
nlpretext.basic.preprocess.
remove_punct
(text, marks=None) → str[source]¶ Remove punctuation from
text
by replacing all instances ofmarks
with whitespace.- Parameters
text (str) – raw text
marks (str or None) – If specified, remove only the characters in this string, e.g.
marks=',;:'
removes commas, semi-colons, and colons. Otherwise, all punctuation marks are removed.
- Returns
- Return type
string
Note
When
marks=None
, Python’s built-instr.translate()
is used to remove punctuation; otherwise, a regular expression is used instead. The former’s performance is about 5-10x faster.
-
nlpretext.basic.preprocess.
remove_stopwords
(text: str, lang: str, custom_stopwords: Optional[list] = None) → str[source]¶ Given
text
str, remove classic stopwords for a given language and custom stopwords given as a list.- Parameters
text (string) –
lang (string) –
custom_stopwords (list of strings) –
- Returns
- Return type
string
-
nlpretext.basic.preprocess.
replace_currency_symbols
(text, replace_with=None) → str[source]¶ Replace all currency symbols in
text
str with string specified byreplace_with
str.- Parameters
text (str) – raw text
replace_with (None or string) –
- if None (default), replace symbols with
their standard 3-letter abbreviations (e.g. ‘$’ with ‘USD’, ‘£’ with ‘GBP’); otherwise, pass in a string with which to replace all symbols (e.g. “CURRENCY”)
- Returns
- Return type
string
-
nlpretext.basic.preprocess.
replace_emails
(text, replace_with='*EMAIL*') → str[source]¶ Replace all emails in
text
str withreplace_with
str- Parameters
text (string) –
replace_with (string) – the string you want the email address to be replaced with.
- Returns
- Return type
string
-
nlpretext.basic.preprocess.
replace_numbers
(text, replace_with='*NUMBER*') → str[source]¶ Replace all numbers in
text
str withreplace_with
str.- Parameters
text (string) –
replace_with (string) – the string you want the number to be replaced with.
- Returns
- Return type
string
-
nlpretext.basic.preprocess.
replace_phone_numbers
(text, country_to_detect: list, replace_with: str = '*PHONE*', method: str = 'regex') → str[source]¶ Replace all phone numbers in
text
str withreplace_with
str- Parameters
text (string) –
replace_with (string) – the string you want the phone number to be replaced with.
method (['regex','detection']) – regex is faster but will omit a lot of numbers, while detection will catch every numbers, but takes a while.
country_to_detect (list) – If a list of country code is specified, will catch every number formatted. Only when method = ‘detection’.
- Returns
- Return type
string
-
nlpretext.basic.preprocess.
replace_urls
(text, replace_with: str = '*URL*') → str[source]¶ Replace all URLs in
text
str withreplace_with
str.- Parameters
text (string) –
replace_with (string) – the string you want the URL to be replaced with.
- Returns
- Return type
string
-
nlpretext.basic.preprocess.
unpack_english_contractions
(text) → str[source]¶ Replace English contractions in
text
str with their unshortened forms. N.B. The “‘d” and “‘s” forms are ambiguous (had/would, is/has/possessive), so are left as-is. eg. “You’re fired. She’s nice.” -> “You are fired. She’s nice.”- Parameters
text (string) –
- Returns
- Return type
string
nlpretext.social module¶
Convert emoji to their CLDR Short Name, according to the unicode convention http://www.unicode.org/emoji/charts/full-emoji-list.html eg. 😀 –> :grinning_face:
- Parameters
text (str) –
code_delimiters (tuple of symbols around the emoji code.) –
eg ((':',':') --> :grinning_face:) –
- Returns
string
- Return type
str
Function that extracts emojis from a text and translates them into words eg. “I take care of my skin 😀 :(” –> [“:grinning_face:”]
- Parameters
text (str) –
- Returns
list of all emojis converted with their unicode conventions
- Return type
list
Function that extracts words preceded with a ‘#’ eg. “I take care of my skin #selfcare#selfestim” –> [“skincare”, “selfestim”]
- Parameters
text (str) –
- Returns
list of all hashtags
- Return type
list
Function that extracts words preceded with a ‘@’ eg. “I take care of my skin with @thisproduct” –> [“@thisproduct”]
- Parameters
text (str) –
- Returns
- Return type
string
Remove emoji from any str by stripping any unicode in the range of Emoji unicode as defined in the unicode convention: http://www.unicode.org/emoji/charts/full-emoji-list.html
- Parameters
text (str) –
- Returns
- Return type
str
Function that removes words preceded with a ‘#’ eg. “I take care of my skin #selfcare#selfestim” –> “I take care of my skin”
- Parameters
text (str) –
- Returns
text of a post without hashtags
- Return type
str
Function that removes words between < and >
- Parameters
text (str) –
- Returns
- Return type
string
Function that removes words preceded with a ‘@’
- Parameters
text (str) –
- Returns
- Return type
string
nlpretext.token module¶
-
nlpretext.token.preprocess.
remove_smallwords
(tokens, smallwords_threshold: int) → list[source]¶ Function that removes words which length is below a threshold [“hello”, “my”, “name”, “is”, “John”, “Doe”] –> [“hello”,”name”,”John”,”Doe”]
- Parameters
text (list) – list of strings
smallwords_threshold (int) – threshold of small word
- Returns
- Return type
list
-
nlpretext.token.preprocess.
remove_special_caracters_from_tokenslist
(tokens) → list[source]¶ Remove tokens that doesn’t contains any number or letter. eg. [‘foo’,’bar’,’—’,“‘s”,’#’] -> [‘foo’,’bar’,“‘s”]
- Parameters
tokens (list) – list of tokens to be cleaned
- Returns
list of tokens without tokens that contains only special caracters
- Return type
list
-
nlpretext.token.preprocess.
remove_stopwords
(tokens: list, lang: str, custom_stopwords: Optional[list] = None) → str[source]¶ Remove stopwords from a text. eg. ‘I like when you move your body !’ -> ‘I move body !’
- Parameters
tokens (list(str)) – list of tokens
lang (str) – language iso code (e.g : “en”)
custom_stopwords (list(str)|None) – list of custom stopwords to add. None by default
- Returns
tokens without stopwords
- Return type
list
- Raises
ValueError – When inputs is not a list
-
nlpretext.token.preprocess.
remove_tokens_with_nonletters
(tokens) → list[source]¶ Inputs a list of tokens, outputs a list of tokens without tokens that includes numbers of special caracters. [‘foo’,’bar’,’124’,’34euros’] -> [‘foo’,’bar’]
- Parameters
tokens (list) – list of tokens to be cleaned
- Returns
list of tokens without tokens with numbers
- Return type
list
nlpretext.augmentation module¶
Bases:
ValueError
-
nlpretext.augmentation.text_augmentation.
are_entities_in_augmented_text
(entities: list, augmented_text: str) → bool[source]¶ Given a list of entities, check if all the words associated to each entity are still present in augmented text.
- Parameters
entities (list) –
entities associated to initial text, must be in the following format: [
- {
‘entity’: str, ‘word’: str, ‘startCharIndex’: int, ‘endCharIndex’: int
}, {
…
}
]
augmented_text (str) –
- Returns
- Return type
True if all entities are present in augmented text, False otherwise
-
nlpretext.augmentation.text_augmentation.
augment_text
(text: str, method: str, stopwords: Optional[List[str]] = None, entities: Optional[list] = None) → Tuple[str, list][source]¶ Given a text with or without associated entities, generate a new text by modifying some words in the initial one, modifications depend on the chosen method (substitution with synonym, addition, deletion). If entities are given as input, they will remain unchanged. If you want some words other than entities to remain unchanged, specify it within the stopwords argument.
- Parameters
text (string) –
method ({'wordnet_synonym', 'aug_sub_bert'}) – augmenter to use (‘wordnet_synonym’ or ‘aug_sub_bert’)
stopwords (list, optional) – list of words to freeze throughout the augmentation
entities (list, optional) –
entities associated to text if any, must be in the following format: [
- {
‘entity’: str, ‘word’: str, ‘startCharIndex’: int, ‘endCharIndex’: int
}, {
…
}
]
- Returns
- Return type
Augmented text and optional augmented entities
-
nlpretext.augmentation.text_augmentation.
check_interval_included
(element1: dict, element2: dict) → Optional[Tuple[dict, dict]][source]¶ Comparison of two entities on start and end positions to find if they are nested
- Parameters
element1 (dict) –
element2 (dict) –
both of them in the following format {
’entity’: str, ‘word’: str, ‘startCharIndex’: int, ‘endCharIndex’: int
}
- Returns
If there is an entity to remove among the two returns a tuple (element to remove, element to keep)
If not, returns None
-
nlpretext.augmentation.text_augmentation.
clean_sentence_entities
(text: str, entities: list) → list[source]¶ Paired entities check to remove nested entities, the longest entity is kept
- Parameters
text (str) – augmented text
entities (list) –
entities associated to augmented text, must be in the following format: [
- {
‘entity’: str, ‘word’: str, ‘startCharIndex’: int, ‘endCharIndex’: int
}, {
…
}
]
- Returns
- Return type
Cleaned entities
-
nlpretext.augmentation.text_augmentation.
get_augmented_entities
(sentence_augmented: str, entities: list) → list[source]¶ Get entities with updated positions (start and end) in augmented text
- Parameters
sentence_augmented (str) – augmented text
entities (list) –
entities associated to initial text, must be in the following format: [
- {
‘entity’: str, ‘word’: str, ‘startCharIndex’: int, ‘endCharIndex’: int
}, {
…
}
]
- Returns
- Return type
Entities with updated positions related to augmented text
-
nlpretext.augmentation.text_augmentation.
get_augmenter
(method: str, stopwords: Optional[List[str]] = None) → nlpaug.augmenter.word.synonym.SynonymAug[source]¶ Initialize an augmenter depending on the given method.
- Parameters
method (str (supported methods: wordnet_synonym and aug_sub_bert)) –
stopwords (list) – list of words to freeze throughout the augmentation
- Returns
- Return type
Initialized nlpaug augmenter
-
nlpretext.augmentation.text_augmentation.
process_entities_and_text
(entities: list, text: str, augmented_text: str)[source]¶ Given a list of initial entities, verify that they have not been altered by the data augmentation operation and are still in the augmented text. :param entities: entities associated to text, must be in the following format:
- [
- {
‘entity’: str, ‘word’: str, ‘startCharIndex’: int, ‘endCharIndex’: int
}, {
…
}
]
- Parameters
text (str) – initial text
augmented_text (str) – new text resulting of data augmentation operation
- Returns
- Return type
Augmented text and entities with their updated position in augmented text