nlpretext.augmentation module

exception nlpretext.augmentation.text_augmentation.CouldNotAugment[source]

Bases: ValueError

exception nlpretext.augmentation.text_augmentation.UnavailableAugmenter[source]

Bases: ValueError

nlpretext.augmentation.text_augmentation.are_entities_in_augmented_text(entities: list, augmented_text: str)bool[source]

Given a list of entities, check if all the words associated to each entity are still present in augmented text.

Parameters
  • entities (list) –

    entities associated to initial text, must be in the following format: [

    {

    ‘entity’: str, ‘word’: str, ‘startCharIndex’: int, ‘endCharIndex’: int

    }, {

    }

    ]

  • augmented_text (str) –

Returns

Return type

True if all entities are present in augmented text, False otherwise

nlpretext.augmentation.text_augmentation.augment_text(text: str, method: str, stopwords: Optional[List[str]] = None, entities: Optional[list] = None)Tuple[str, list][source]

Given a text with or without associated entities, generate a new text by modifying some words in the initial one, modifications depend on the chosen method (substitution with synonym, addition, deletion). If entities are given as input, they will remain unchanged. If you want some words other than entities to remain unchanged, specify it within the stopwords argument.

Parameters
  • text (string) –

  • method ({'wordnet_synonym', 'aug_sub_bert'}) – augmenter to use (‘wordnet_synonym’ or ‘aug_sub_bert’)

  • stopwords (list, optional) – list of words to freeze throughout the augmentation

  • entities (list, optional) –

    entities associated to text if any, must be in the following format: [

    {

    ‘entity’: str, ‘word’: str, ‘startCharIndex’: int, ‘endCharIndex’: int

    }, {

    }

    ]

Returns

Return type

Augmented text and optional augmented entities

nlpretext.augmentation.text_augmentation.check_interval_included(element1: dict, element2: dict)Optional[Tuple[dict, dict]][source]

Comparison of two entities on start and end positions to find if they are nested

Parameters
  • element1 (dict) –

  • element2 (dict) –

    both of them in the following format {

    ’entity’: str, ‘word’: str, ‘startCharIndex’: int, ‘endCharIndex’: int

    }

Returns

  • If there is an entity to remove among the two returns a tuple (element to remove, element to keep)

  • If not, returns None

nlpretext.augmentation.text_augmentation.clean_sentence_entities(text: str, entities: list)list[source]

Paired entities check to remove nested entities, the longest entity is kept

Parameters
  • text (str) – augmented text

  • entities (list) –

    entities associated to augmented text, must be in the following format: [

    {

    ‘entity’: str, ‘word’: str, ‘startCharIndex’: int, ‘endCharIndex’: int

    }, {

    }

    ]

Returns

Return type

Cleaned entities

nlpretext.augmentation.text_augmentation.get_augmented_entities(sentence_augmented: str, entities: list)list[source]

Get entities with updated positions (start and end) in augmented text

Parameters
  • sentence_augmented (str) – augmented text

  • entities (list) –

    entities associated to initial text, must be in the following format: [

    {

    ‘entity’: str, ‘word’: str, ‘startCharIndex’: int, ‘endCharIndex’: int

    }, {

    }

    ]

Returns

Return type

Entities with updated positions related to augmented text

nlpretext.augmentation.text_augmentation.get_augmenter(method: str, stopwords: Optional[List[str]] = None)nlpaug.augmenter.word.synonym.SynonymAug[source]

Initialize an augmenter depending on the given method.

Parameters
  • method (str (supported methods: wordnet_synonym and aug_sub_bert)) –

  • stopwords (list) – list of words to freeze throughout the augmentation

Returns

Return type

Initialized nlpaug augmenter

nlpretext.augmentation.text_augmentation.process_entities_and_text(entities: list, text: str, augmented_text: str)[source]

Given a list of initial entities, verify that they have not been altered by the data augmentation operation and are still in the augmented text. :param entities: entities associated to text, must be in the following format:

[
{

‘entity’: str, ‘word’: str, ‘startCharIndex’: int, ‘endCharIndex’: int

}, {

}

]

Parameters
  • text (str) – initial text

  • augmented_text (str) – new text resulting of data augmentation operation

Returns

Return type

Augmented text and entities with their updated position in augmented text