nlpretext.basic module¶
-
nlpretext.basic.preprocess.filter_non_latin_characters(text) → str[source]¶ Function that filters non latin characters of a text
- Parameters
text (string) –
- Returns
- Return type
string
-
nlpretext.basic.preprocess.fix_bad_unicode(text, normalization: str = 'NFC') → str[source]¶ Fix unicode text that’s “broken” using ftfy; this includes mojibake, HTML entities and other code cruft, and non-standard forms for display purposes.
- Parameters
text (string) –
({'NFC' (normalization) – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods
'NFKC' – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods
'NFD' – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods
'NFKD'}) – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods
- Returns
- Return type
string
-
nlpretext.basic.preprocess.lower_text(text: str)[source]¶ Given
textstr, transform it into lowercase- Parameters
text (string) –
- Returns
- Return type
string
-
nlpretext.basic.preprocess.normalize_whitespace(text) → str[source]¶ Given
textstr, replace one or more spacings with a single space, and one or more linebreaks with a single newline. Also strip leading/trailing whitespace. eg. ” foo bar ” -> “foo bar”- Parameters
text (string) –
- Returns
- Return type
string
-
nlpretext.basic.preprocess.remove_accents(text, method: str = 'unicode') → str[source]¶ Remove accents from any accented unicode characters in
textstr, either by transforming them into ascii equivalents or removing them entirely.- Parameters
text (str) – raw text
method (({'unicode', 'ascii'})) –
if ‘unicode’, remove accented char for any unicode symbol with a direct ASCII equivalent; if ‘ascii’, remove accented char for any unicode symbol
NB: the ‘ascii’ method is notably faster than ‘unicode’, but less good
- Returns
- Return type
string
- Raises
ValueError – if
methodis not in {‘unicode’, ‘ascii’}
-
nlpretext.basic.preprocess.remove_eol_characters(text) → str[source]¶ Remove end of line (
) char.
text : str
str
-
nlpretext.basic.preprocess.remove_multiple_spaces_and_strip_text(text) → str[source]¶ Remove multiple spaces, strip text, and remove ‘-‘, ‘*’ characters.
- Parameters
text (str) – the text to be processed
- Returns
the text with removed multiple spaces and strip text
- Return type
string
-
nlpretext.basic.preprocess.remove_punct(text, marks=None) → str[source]¶ Remove punctuation from
textby replacing all instances ofmarkswith whitespace.- Parameters
text (str) – raw text
marks (str or None) – If specified, remove only the characters in this string, e.g.
marks=',;:'removes commas, semi-colons, and colons. Otherwise, all punctuation marks are removed.
- Returns
- Return type
string
Note
When
marks=None, Python’s built-instr.translate()is used to remove punctuation; otherwise, a regular expression is used instead. The former’s performance is about 5-10x faster.
-
nlpretext.basic.preprocess.remove_stopwords(text: str, lang: str, custom_stopwords: Optional[list] = None) → str[source]¶ Given
textstr, remove classic stopwords for a given language and custom stopwords given as a list.- Parameters
text (string) –
lang (string) –
custom_stopwords (list of strings) –
- Returns
- Return type
string
-
nlpretext.basic.preprocess.replace_currency_symbols(text, replace_with=None) → str[source]¶ Replace all currency symbols in
textstr with string specified byreplace_withstr.- Parameters
text (str) – raw text
replace_with (None or string) –
- if None (default), replace symbols with
their standard 3-letter abbreviations (e.g. ‘$’ with ‘USD’, ‘£’ with ‘GBP’); otherwise, pass in a string with which to replace all symbols (e.g. “CURRENCY”)
- Returns
- Return type
string
-
nlpretext.basic.preprocess.replace_emails(text, replace_with='*EMAIL*') → str[source]¶ Replace all emails in
textstr withreplace_withstr- Parameters
text (string) –
replace_with (string) – the string you want the email address to be replaced with.
- Returns
- Return type
string
-
nlpretext.basic.preprocess.replace_numbers(text, replace_with='*NUMBER*') → str[source]¶ Replace all numbers in
textstr withreplace_withstr.- Parameters
text (string) –
replace_with (string) – the string you want the number to be replaced with.
- Returns
- Return type
string
-
nlpretext.basic.preprocess.replace_phone_numbers(text, country_to_detect: list, replace_with: str = '*PHONE*', method: str = 'regex') → str[source]¶ Replace all phone numbers in
textstr withreplace_withstr- Parameters
text (string) –
replace_with (string) – the string you want the phone number to be replaced with.
method (['regex','detection']) – regex is faster but will omit a lot of numbers, while detection will catch every numbers, but takes a while.
country_to_detect (list) – If a list of country code is specified, will catch every number formatted. Only when method = ‘detection’.
- Returns
- Return type
string
-
nlpretext.basic.preprocess.replace_urls(text, replace_with: str = '*URL*') → str[source]¶ Replace all URLs in
textstr withreplace_withstr.- Parameters
text (string) –
replace_with (string) – the string you want the URL to be replaced with.
- Returns
- Return type
string
-
nlpretext.basic.preprocess.unpack_english_contractions(text) → str[source]¶ Replace English contractions in
textstr with their unshortened forms. N.B. The “‘d” and “‘s” forms are ambiguous (had/would, is/has/possessive), so are left as-is. eg. “You’re fired. She’s nice.” -> “You are fired. She’s nice.”- Parameters
text (string) –
- Returns
- Return type
string