nlpretext.basic module

nlpretext.basic.preprocess.filter_non_latin_characters(text)str[source]

Function that filters non latin characters of a text

Parameters

text (string) –

Returns

Return type

string

nlpretext.basic.preprocess.fix_bad_unicode(text, normalization: str = 'NFC')str[source]

Fix unicode text that’s “broken” using ftfy; this includes mojibake, HTML entities and other code cruft, and non-standard forms for display purposes.

Parameters
  • text (string) –

  • ({'NFC' (normalization) – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods

  • 'NFKC' – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods

  • 'NFD' – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods

  • 'NFKD'}) – if ‘NFC’, combines characters and diacritics written using separate code points, e.g. converting “e” plus an acute accent modifier into “é”; unicode can be converted to NFC form without any change in its meaning! if ‘NFKC’, additional normalizations are applied that can change the meanings of characters, e.g. ellipsis characters will be replaced with three periods

Returns

Return type

string

nlpretext.basic.preprocess.lower_text(text: str)[source]

Given text str, transform it into lowercase

Parameters

text (string) –

Returns

Return type

string

nlpretext.basic.preprocess.normalize_whitespace(text)str[source]

Given text str, replace one or more spacings with a single space, and one or more linebreaks with a single newline. Also strip leading/trailing whitespace. eg. ” foo bar ” -> “foo bar”

Parameters

text (string) –

Returns

Return type

string

nlpretext.basic.preprocess.remove_accents(text, method: str = 'unicode')str[source]

Remove accents from any accented unicode characters in text str, either by transforming them into ascii equivalents or removing them entirely.

Parameters
  • text (str) – raw text

  • method (({'unicode', 'ascii'})) –

    if ‘unicode’, remove accented char for any unicode symbol with a direct ASCII equivalent; if ‘ascii’, remove accented char for any unicode symbol

    NB: the ‘ascii’ method is notably faster than ‘unicode’, but less good

Returns

Return type

string

Raises

ValueError – if method is not in {‘unicode’, ‘ascii’}

nlpretext.basic.preprocess.remove_eol_characters(text)str[source]

Remove end of line (

) char.

text : str

str

nlpretext.basic.preprocess.remove_multiple_spaces_and_strip_text(text)str[source]

Remove multiple spaces, strip text, and remove ‘-‘, ‘*’ characters.

Parameters

text (str) – the text to be processed

Returns

the text with removed multiple spaces and strip text

Return type

string

nlpretext.basic.preprocess.remove_punct(text, marks=None)str[source]

Remove punctuation from text by replacing all instances of marks with whitespace.

Parameters
  • text (str) – raw text

  • marks (str or None) – If specified, remove only the characters in this string, e.g. marks=',;:' removes commas, semi-colons, and colons. Otherwise, all punctuation marks are removed.

Returns

Return type

string

Note

When marks=None, Python’s built-in str.translate() is used to remove punctuation; otherwise, a regular expression is used instead. The former’s performance is about 5-10x faster.

nlpretext.basic.preprocess.remove_stopwords(text: str, lang: str, custom_stopwords: Optional[list] = None)str[source]

Given text str, remove classic stopwords for a given language and custom stopwords given as a list.

Parameters
  • text (string) –

  • lang (string) –

  • custom_stopwords (list of strings) –

Returns

Return type

string

nlpretext.basic.preprocess.replace_currency_symbols(text, replace_with=None)str[source]

Replace all currency symbols in text str with string specified by replace_with str.

Parameters
  • text (str) – raw text

  • replace_with (None or string) –

    if None (default), replace symbols with

    their standard 3-letter abbreviations (e.g. ‘$’ with ‘USD’, ‘£’ with ‘GBP’); otherwise, pass in a string with which to replace all symbols (e.g. “CURRENCY”)

Returns

Return type

string

nlpretext.basic.preprocess.replace_emails(text, replace_with='*EMAIL*')str[source]

Replace all emails in text str with replace_with str

Parameters
  • text (string) –

  • replace_with (string) – the string you want the email address to be replaced with.

Returns

Return type

string

nlpretext.basic.preprocess.replace_numbers(text, replace_with='*NUMBER*')str[source]

Replace all numbers in text str with replace_with str.

Parameters
  • text (string) –

  • replace_with (string) – the string you want the number to be replaced with.

Returns

Return type

string

nlpretext.basic.preprocess.replace_phone_numbers(text, country_to_detect: list, replace_with: str = '*PHONE*', method: str = 'regex')str[source]

Replace all phone numbers in text str with replace_with str

Parameters
  • text (string) –

  • replace_with (string) – the string you want the phone number to be replaced with.

  • method (['regex','detection']) – regex is faster but will omit a lot of numbers, while detection will catch every numbers, but takes a while.

  • country_to_detect (list) – If a list of country code is specified, will catch every number formatted. Only when method = ‘detection’.

Returns

Return type

string

nlpretext.basic.preprocess.replace_urls(text, replace_with: str = '*URL*')str[source]

Replace all URLs in text str with replace_with str.

Parameters
  • text (string) –

  • replace_with (string) – the string you want the URL to be replaced with.

Returns

Return type

string

nlpretext.basic.preprocess.unpack_english_contractions(text)str[source]

Replace English contractions in text str with their unshortened forms. N.B. The “‘d” and “‘s” forms are ambiguous (had/would, is/has/possessive), so are left as-is. eg. “You’re fired. She’s nice.” -> “You are fired. She’s nice.”

Parameters

text (string) –

Returns

Return type

string