omdenalore.natural_language_processing package

Submodules

omdenalore.natural_language_processing.ner module

class omdenalore.natural_language_processing.ner.TaggedCorpus(text: str, annotations: List[Tuple[int, int, str]], tokenizer: Optional[spacy.tokenizer.Tokenizer] = None)

Bases: object

to_dict() Dict[str, List[str]]

Get the entities in a dictionary.

Returns

Dictionary with token sequences (ex: {entity_1: [ENT1, …], entity_2: [ENT2, …], …})

to_doc(entity: Optional[str] = None) spacy.tokens.Doc

Get the entities in an annotated spaCy document.

Parameters

entity (str) – Optional parameter - if set, return document annotated with only selected entity

Returns

SpaCy document annotated with entities

to_iob_list(entity: Optional[str] = None) List[str]

Get the entities in a list of entities (IOB format). :param entity: Optional parameter - if set, return document annotated with only selected entity :type entity: str :return: List of strings where each string represents the entity associated with the token

to_multi_iob_list() Tuple[List[Tuple[str, ...]], List[str]]

Get all the entities in a list of entities (IOB format).

Returns

List of tuples where each tuple contains one element for each entity present in the document

to_token_char_spans() List[Tuple[int, int]]

Get the character spans for each token.

Returns

List of character spans

to_tokens() List[str]

Get the token list.

Returns

List of tokens

omdenalore.natural_language_processing.ner.create_blank_nlp(train_data: pandas.Series) spacy.Language

Create a blank NLP model for training

Parameters

train_data (pd.Series) – Dataframe of training data

Returns

Spacy model

omdenalore.natural_language_processing.ner.doc2ents(doc: spacy.tokens.Doc) List[str]

Take a spaCy Doc and return a list of IOB token-based entities

Parameters

doc (spacy.tokens.Doc) – Spacy document

Returns

List of entity strings

omdenalore.natural_language_processing.ner.generate_examples(texts: Iterable[str], entity_offsets: Iterable[List[Tuple[int, int, str]]], nlp: spacy.language.Language) List[spacy.training.Example]

Generate spacy Example objects for training

Parameters
  • texts (Iterable[str]) – List of texts to generate examples from

  • entity_offsets (Iterable[Annotations]) – List of annotations for each text

  • nlp (spacy.language.Language) – Spacy model

Returns

List of spacy Example objects

omdenalore.natural_language_processing.ner.get_entities(str: str) List[Tuple[int, int, str]]

Process annotations to get in the format of (start, stop, label)

Parameters

str (str) – String of annotations

Returns

List of tuples of the form (start, stop, label)

omdenalore.natural_language_processing.ner.rm_colliding(x: List[Tuple[int, int, str]]) List[Tuple[int, int, str]]

Remove any overlapping annotations (one token cannot have two entities)

Parameters

x (List[Tuple[int, int, str]]) – List of annotations

Returns

List of annotations with no overlapping annotations

omdenalore.natural_language_processing.ner.rm_groups(a: List[Tuple[int, int, str]]) List[Tuple[int, int, str]]

Remove the group number info - ie: g2_response_rate becomes response_rate

Parameters

a (Annotations) – Annotations

Returns

Annotations with group number removed

omdenalore.natural_language_processing.postprocess_text module

omdenalore.natural_language_processing.postprocess_text.adjust_tags(tags: List[str], tag_set: Optional[Set[str]] = None, length: int = 512) List[List[str]]

Keep only tags that were used in training and get all list of tokens to the same length.

Parameters
  • tags (List[str]) – List of tags

  • tag_set (Set[str]) – Set of tags to keep

  • length – Length of tokens to keep

Returns

List of tags

omdenalore.natural_language_processing.postprocess_text.df2jsonl(df: pandas.DataFrame, path: Optional[str] = None, pmid: str = 'pmid', text: str = 'text', start: str = 'start', stop: str = 'stop', entity: str = 'entity', entity_text: str = 'entity_text') Optional[str]

Get the output format JSONL from a pandas dataframe

Parameters
  • df (pd.DataFrame) – Dataframe with entity information

  • path – Destination file path

(if None will return a string with equivalent JSONL :type path: str :param pmid: Column name for PMID column :type pmid: str :param text: Column name for text column :type text: str :param start: Column name for start column :type start: str :param stop: Column name for stop column :type stop: str :param entity: Column name for entity column :type entity: str :param entity_text: Column name for entity_text column :type entity_text: str :return: None if path is provided (file is written to disk), JSONL string if path is None Output JSONL file has the shape {id:pmid, text: <abstract>, predictions: [{start: x, end: y, entity:x}, …]}

omdenalore.natural_language_processing.postprocess_text.doc2df(doc: spacy.tokens.doc.Doc, regex_pmid: str = 'PMID: (\\d+)') pandas.DataFrame

Get a dataframe from SpaCy Doc objects.

Parameters
  • doc (spacy.tokens.Doc) – A single spaCy Doc object with annotations (entities)

  • regex_pmid – Raw string with regex to get the PMID numbers

from text (defaults to r”PMID: (d+)” :type regex_pmid: str :return: Pandas dataframe with columns [pmid,`text`,`start`,`stop`,`entity`,`entity_text`]

omdenalore.natural_language_processing.postprocess_text.doc2df_wide(doc: spacy.tokens.doc.Doc, regex_pmid: str = 'PMID: (\\d+)') pandas.DataFrame

Get a wide dataframe from SpaCy Doc objects.

Parameters
  • doc (spacy.tokens.Doc) – A single spaCy Doc object with annotations (entities)

  • regex_pmid – Raw string with regex to get the PMID numbers

from text (defaults to r”PMID: (d+)” :type regex_pmid: str :return: Pandas dataframe with one column for each {start,`stop`,`entity_text`} in entity

omdenalore.natural_language_processing.postprocess_text.list2iob(tags: List[str]) List[str]

Go from a list of strings to IOB list of tags.

Parameters

tags (List[str]) – List of tags

Returns

List of IOB tags

omdenalore.natural_language_processing.preprocess_text module

class omdenalore.natural_language_processing.preprocess_text.TextPreprocessor(pos_tags_removal=None, spacy_model=None, remove_numbers: bool = True, remove_special_chars: bool = True, remove_stopwords: bool = True, lemmatize: bool = True, tokenize: bool = False)

Bases: object

Pre process texts

Parameters
  • pos_tags_removal (List[str]) – list of PoS tags to remove

  • spacy_model – spaCy model to be used, default is en_core_web_sm

  • remove_numbers (bool) – whether or not to remove numbers from the inputs

  • remove_special_chars (bool) – whether or not to remove the special characters

  • remove_stopwords (bool) – whether or not to remove stopwords

  • lemmatize (bool) – whether or not to lemmatize the inputs

  • tokenize (bool) – whether or not to tokenize the inputs

Example

>>> model = TextPreprocessor.load_model()
>>> preprocessor = TextPreprocessor(spacy_model=model)
>>> sample = "Hello wROld! I hAvE $400 Dollars in my checking account and I am sleeping"
>>> clean = preprocessor.preprocess_text(sample)
>>> print(f"Sample: {sample} Cleaned: {clean}")

Sample: Hello wROld! I hAvE $400 Dollars in my checking account and I am sleeping

Cleaned: hello wrold dollar checking account sleep

>>> my_list = ["my name is $Yudhiesh", "that house is far AWAY", "the placement @RT"]
>>> cleaned_list = preprocessor.preprocess_text_lists(my_list)
>>> print(f"Sample: {my_list} Cleaned: {cleaned_list}")

Sample: [‘my name is $Yudhiesh’, ‘that house is far AWAY’, ‘the placement @RT’]

Cleaned: [‘yudhiesh’, ‘house far away’, ‘placement rt’]

static download_spacy_model(model='en_core_web_sm')

Downloads a spaCy model to be used for preprocessing the text inputs

static load_model(model='en_core_web_sm')

Loads and returns a spaCy model

preprocess_text(text: str) str

Preprocesses a single string

Parameters

text – text string to clean

Returns

str

preprocess_text_lists(texts: List[str]) List[str]

Preprocesses a list of strings

Parameters

texts – List of texts to preprocess

Returns

List of strings

omdenalore.natural_language_processing.topic_modelling module

omdenalore.natural_language_processing.twitter module

omdenalore.natural_language_processing.utils module

class omdenalore.natural_language_processing.utils.TextUtils

Bases: object

Utility functions for handling text data

static noise_removal(input_text: str, noise_list=['is', 'a', 'this', 'the', 'an']) str

Remove noise from the input text. This function can be use to remove common english words like “is, a , the”. You can add your own noise words in the optional parameters “noise_list” default noise_list = [“is”, “a”, “this”, “the” , “an”]

Parameters
  • input_text (str) – Input text to remove noise from

  • noise_list – (Optional) List of words that you

want to remove from the input text :type noise_list: list :returns: Clean text with no noise

Example

from omdenalore.natural_language_processing.utils import TextUtils >>> input = “Hello, the chicken crossed the street” >>> TextUtils.noise_removal(input) “Hello, chicken crossed the street”

static remove_hashtags(text: str) str

Removing hastags from the input text :param text: Input text to remove hastags from :type input_text: str :returns: output text without hastags

Example

from omdenalore.natural_language_processing.utls import TextUtils >>> TextUtils.remove_hashtags(“I love #hashtags”) “I love “

static remove_regex_pattern(input_text: str, regex_pattern: str) str

Remove any regular expressions from the input text

Parameters
  • input_text (str) – Input text to remove regex from

  • noise_list – string of regex that you want to

remove from the input text :type noise_list: string :returns: Clean text with removed regex

Example

from omdenalore.natural_language_processing.utils import remove_regex_pattern

>>> regex_pattern = r"#[\w]*"
>>> input = "remove this #hastag"
>>> TextUtils.remove_regex_pattern(input, regex_pattern)
"remove this  hastag"
static remove_url(text: str) str

Removing urls from the input text :param text: Input text to remove urls from :type input_text: str :returns: text with standard words

Example

from omdenalore.natural_language_processing.utisl import TextUtils >>> TextUtils.remove_url(‘I like urls http://www.google.com’) ‘I like urls ‘

static standardize_words(input_text: str, lookup_dictionary: Dict[str, str]) Optional[str]

Replace acronyms, hastags, colloquial slangs etc with standard words

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Parameters
  • input_text (str) – Input text to remove regex from

  • lookup_dictionary – Dictionary with slang as index and

standard word as item or value :type lookup_dictionary: dict :returns: text with standard words

Example

from omdenalore.natural_language_processing.utils import TextUtils >>> lookup_dict = {‘rt’:’Retweet’, ‘dm’:’direct message’, “awsm” : “awesome”, “luv” :”love”} >>> TextUtils.standardize_words(“rt I luv to dm, it’s awsm”) Retweet I love to direct message, it’s awesome

Module contents