omdenalore.natural_language_processing package

Submodules

omdenalore.natural_language_processing.ner module

class omdenalore.natural_language_processing.ner.TaggedCorpus(text: str, annotations: List[Tuple[int, int, str]], tokenizer: Optional[spacy.tokenizer.Tokenizer] = None)

Bases: object

to_dict() → Dict[str, List[str]]

Get the entities in a dictionary.

Returns: Dictionary with token sequences (ex: {entity_1: [ENT1, …], entity_2: [ENT2, …], …})

to_doc(entity: Optional[str] = None) → spacy.tokens.Doc

Get the entities in an annotated spaCy document.

Parameters: entity (str) – Optional parameter - if set, return document annotated with only selected entity
Returns: SpaCy document annotated with entities

to_iob_list(entity: Optional[str] = None) → List[str]: Get the entities in a list of entities (IOB format). :param entity: Optional parameter - if set, return document annotated with only selected entity :type entity: str :return: List of strings where each string represents the entity associated with the token

to_multi_iob_list() → Tuple[List[Tuple[str, ...]], List[str]]

Get all the entities in a list of entities (IOB format).

Returns: List of tuples where each tuple contains one element for each entity present in the document

to_token_char_spans() → List[Tuple[int, int]]

Get the character spans for each token.

Returns: List of character spans

to_tokens() → List[str]

Get the token list.

Returns: List of tokens

omdenalore.natural_language_processing.ner.create_blank_nlp(train_data: pandas.Series) → spacy.Language

Create a blank NLP model for training

Parameters: train_data (pd.Series) – Dataframe of training data
Returns: Spacy model

omdenalore.natural_language_processing.ner.doc2ents(doc: spacy.tokens.Doc) → List[str]

Take a spaCy Doc and return a list of IOB token-based entities

Parameters: doc (spacy.tokens.Doc) – Spacy document
Returns: List of entity strings

omdenalore.natural_language_processing.ner.generate_examples(texts: Iterable[str], entity_offsets: Iterable[List[Tuple[int, int, str]]], nlp: spacy.language.Language) → List[spacy.training.Example]

Generate spacy Example objects for training

Parameters

texts (Iterable[str]) – List of texts to generate examples from
entity_offsets (Iterable[Annotations]) – List of annotations for each text
nlp (spacy.language.Language) – Spacy model

Returns

List of spacy Example objects

omdenalore.natural_language_processing.ner.get_entities(str: str) → List[Tuple[int, int, str]]

Process annotations to get in the format of (start, stop, label)

Parameters: str (str) – String of annotations
Returns: List of tuples of the form (start, stop, label)

omdenalore.natural_language_processing.ner.rm_colliding(x: List[Tuple[int, int, str]]) → List[Tuple[int, int, str]]

Remove any overlapping annotations (one token cannot have two entities)

Parameters: x (List[Tuple[int, int, str]]) – List of annotations
Returns: List of annotations with no overlapping annotations

omdenalore.natural_language_processing.ner.rm_groups(a: List[Tuple[int, int, str]]) → List[Tuple[int, int, str]]

Remove the group number info - ie: g2_response_rate becomes response_rate

Parameters: a (Annotations) – Annotations
Returns: Annotations with group number removed

omdenalore.natural_language_processing.postprocess_text module

omdenalore.natural_language_processing.postprocess_text.adjust_tags(tags: List[str], tag_set: Optional[Set[str]] = None, length: int = 512) → List[List[str]]

Keep only tags that were used in training and get all list of tokens to the same length.

Parameters

tags (List[str]) – List of tags
tag_set (Set[str]) – Set of tags to keep
length – Length of tokens to keep

Returns

List of tags

omdenalore.natural_language_processing.postprocess_text.df2jsonl(df: pandas.DataFrame, path: Optional[str] = None, pmid: str = 'pmid', text: str = 'text', start: str = 'start', stop: str = 'stop', entity: str = 'entity', entity_text: str = 'entity_text') → Optional[str]

Get the output format JSONL from a pandas dataframe

Parameters

df (pd.DataFrame) – Dataframe with entity information
path – Destination file path

(if None will return a string with equivalent JSONL :type path: str :param pmid: Column name for PMID column :type pmid: str :param text: Column name for text column :type text: str :param start: Column name for start column :type start: str :param stop: Column name for stop column :type stop: str :param entity: Column name for entity column :type entity: str :param entity_text: Column name for entity_text column :type entity_text: str :return: None if path is provided (file is written to disk), JSONL string if path is None Output JSONL file has the shape {id:pmid, text: <abstract>, predictions: [{start: x, end: y, entity:x}, …]}

omdenalore.natural_language_processing.postprocess_text.doc2df(doc: spacy.tokens.doc.Doc, regex_pmid: str = 'PMID: (\\d+)') → pandas.DataFrame

Get a dataframe from SpaCy Doc objects.

Parameters

doc (spacy.tokens.Doc) – A single spaCy Doc object with annotations (entities)
regex_pmid – Raw string with regex to get the PMID numbers

from text (defaults to r”PMID: (d+)” :type regex_pmid: str :return: Pandas dataframe with columns [pmid,`text`,`start`,`stop`,`entity`,`entity_text`]

omdenalore.natural_language_processing.postprocess_text.doc2df_wide(doc: spacy.tokens.doc.Doc, regex_pmid: str = 'PMID: (\\d+)') → pandas.DataFrame

Get a wide dataframe from SpaCy Doc objects.

Parameters

doc (spacy.tokens.Doc) – A single spaCy Doc object with annotations (entities)
regex_pmid – Raw string with regex to get the PMID numbers

from text (defaults to r”PMID: (d+)” :type regex_pmid: str :return: Pandas dataframe with one column for each {start,`stop`,`entity_text`} in entity

omdenalore.natural_language_processing.postprocess_text.list2iob(tags: List[str]) → List[str]

Go from a list of strings to IOB list of tags.

Parameters: tags (List[str]) – List of tags
Returns: List of IOB tags

omdenalore.natural_language_processing.preprocess_text module

class omdenalore.natural_language_processing.preprocess_text.TextPreprocessor(pos_tags_removal=None, spacy_model=None, remove_numbers: bool = True, remove_special_chars: bool = True, remove_stopwords: bool = True, lemmatize: bool = True, tokenize: bool = False)

Bases: object

Pre process texts

Parameters

pos_tags_removal (List[str]) – list of PoS tags to remove
spacy_model – spaCy model to be used, default is en_core_web_sm
remove_numbers (bool) – whether or not to remove numbers from the inputs
remove_special_chars (bool) – whether or not to remove the special characters
remove_stopwords (bool) – whether or not to remove stopwords
lemmatize (bool) – whether or not to lemmatize the inputs
tokenize (bool) – whether or not to tokenize the inputs

Example

>>> model = TextPreprocessor.load_model()
>>> preprocessor = TextPreprocessor(spacy_model=model)
>>> sample = "Hello wROld! I hAvE $400 Dollars in my checking account and I am sleeping"
>>> clean = preprocessor.preprocess_text(sample)
>>> print(f"Sample: {sample} Cleaned: {clean}")

Sample: Hello wROld! I hAvE $400 Dollars in my checking account and I am sleeping

Cleaned: hello wrold dollar checking account sleep

>>> my_list = ["my name is $Yudhiesh", "that house is far AWAY", "the placement @RT"]
>>> cleaned_list = preprocessor.preprocess_text_lists(my_list)
>>> print(f"Sample: {my_list} Cleaned: {cleaned_list}")

Sample: [‘my name is $Yudhiesh’, ‘that house is far AWAY’, ‘the placement @RT’]

Cleaned: [‘yudhiesh’, ‘house far away’, ‘placement rt’]

static download_spacy_model(model='en_core_web_sm'): Downloads a spaCy model to be used for preprocessing the text inputs

static load_model(model='en_core_web_sm'): Loads and returns a spaCy model

preprocess_text(text: str) → str

Preprocesses a single string

Parameters: text – text string to clean
Returns: str

preprocess_text_lists(texts: List[str]) → List[str]

Preprocesses a list of strings

Parameters: texts – List of texts to preprocess
Returns: List of strings

omdenalore.natural_language_processing.topic_modelling module

omdenalore.natural_language_processing.twitter module

omdenalore.natural_language_processing.utils module

class omdenalore.natural_language_processing.utils.TextUtils

Bases: object

Utility functions for handling text data

static noise_removal(input_text: str, noise_list=['is', 'a', 'this', 'the', 'an']) → str

Remove noise from the input text. This function can be use to remove common english words like “is, a , the”. You can add your own noise words in the optional parameters “noise_list” default noise_list = [“is”, “a”, “this”, “the” , “an”]

Parameters

input_text (str) – Input text to remove noise from
noise_list – (Optional) List of words that you

want to remove from the input text :type noise_list: list :returns: Clean text with no noise

Example

from omdenalore.natural_language_processing.utils import TextUtils >>> input = “Hello, the chicken crossed the street” >>> TextUtils.noise_removal(input) “Hello, chicken crossed the street”

static remove_hashtags(text: str) → str

Removing hastags from the input text :param text: Input text to remove hastags from :type input_text: str :returns: output text without hastags

Example

from omdenalore.natural_language_processing.utls import TextUtils >>> TextUtils.remove_hashtags(“I love #hashtags”) “I love “

static remove_regex_pattern(input_text: str, regex_pattern: str) → str

Remove any regular expressions from the input text

Parameters

input_text (str) – Input text to remove regex from
noise_list – string of regex that you want to

remove from the input text :type noise_list: string :returns: Clean text with removed regex

Example

from omdenalore.natural_language_processing.utils import remove_regex_pattern

>>> regex_pattern = r"#[\w]*"
>>> input = "remove this #hastag"
>>> TextUtils.remove_regex_pattern(input, regex_pattern)
"remove this  hastag"

static remove_url(text: str) → str

Removing urls from the input text :param text: Input text to remove urls from :type input_text: str :returns: text with standard words

Example

from omdenalore.natural_language_processing.utisl import TextUtils >>> TextUtils.remove_url(‘I like urls http://www.google.com’) ‘I like urls ‘

static standardize_words(input_text: str, lookup_dictionary: Dict[str, str]) → Optional[str]

Replace acronyms, hastags, colloquial slangs etc with standard words

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Parameters

input_text (str) – Input text to remove regex from
lookup_dictionary – Dictionary with slang as index and

standard word as item or value :type lookup_dictionary: dict :returns: text with standard words

Example

from omdenalore.natural_language_processing.utils import TextUtils >>> lookup_dict = {‘rt’:’Retweet’, ‘dm’:’direct message’, “awsm” : “awesome”, “luv” :”love”} >>> TextUtils.standardize_words(“rt I luv to dm, it’s awsm”) Retweet I love to direct message, it’s awesome

omdenalore.natural_language_processing package

Submodules

omdenalore.natural_language_processing.ner module

omdenalore.natural_language_processing.postprocess_text module

omdenalore.natural_language_processing.preprocess_text module

omdenalore.natural_language_processing.topic_modelling module

omdenalore.natural_language_processing.twitter module

omdenalore.natural_language_processing.utils module

Module contents