omdenalore.natural_language_processing package
Submodules
omdenalore.natural_language_processing.ner module
- class omdenalore.natural_language_processing.ner.TaggedCorpus(text: str, annotations: List[Tuple[int, int, str]], tokenizer: Optional[spacy.tokenizer.Tokenizer] = None)
Bases:
object
- to_dict() Dict[str, List[str]]
Get the entities in a dictionary.
- Returns
Dictionary with token sequences (ex: {entity_1: [ENT1, …], entity_2: [ENT2, …], …})
- to_doc(entity: Optional[str] = None) spacy.tokens.Doc
Get the entities in an annotated spaCy document.
- Parameters
entity (str) – Optional parameter - if set, return document annotated with only selected entity
- Returns
SpaCy document annotated with entities
- to_iob_list(entity: Optional[str] = None) List[str]
Get the entities in a list of entities (IOB format). :param entity: Optional parameter - if set, return document annotated with only selected entity :type entity: str :return: List of strings where each string represents the entity associated with the token
- to_multi_iob_list() Tuple[List[Tuple[str, ...]], List[str]]
Get all the entities in a list of entities (IOB format).
- Returns
List of tuples where each tuple contains one element for each entity present in the document
- to_token_char_spans() List[Tuple[int, int]]
Get the character spans for each token.
- Returns
List of character spans
- to_tokens() List[str]
Get the token list.
- Returns
List of tokens
- omdenalore.natural_language_processing.ner.create_blank_nlp(train_data: pandas.Series) spacy.Language
Create a blank NLP model for training
- Parameters
train_data (pd.Series) – Dataframe of training data
- Returns
Spacy model
- omdenalore.natural_language_processing.ner.doc2ents(doc: spacy.tokens.Doc) List[str]
Take a spaCy Doc and return a list of IOB token-based entities
- Parameters
doc (spacy.tokens.Doc) – Spacy document
- Returns
List of entity strings
- omdenalore.natural_language_processing.ner.generate_examples(texts: Iterable[str], entity_offsets: Iterable[List[Tuple[int, int, str]]], nlp: spacy.language.Language) List[spacy.training.Example]
Generate spacy Example objects for training
- Parameters
texts (Iterable[str]) – List of texts to generate examples from
entity_offsets (Iterable[Annotations]) – List of annotations for each text
nlp (spacy.language.Language) – Spacy model
- Returns
List of spacy Example objects
- omdenalore.natural_language_processing.ner.get_entities(str: str) List[Tuple[int, int, str]]
Process annotations to get in the format of (start, stop, label)
- Parameters
str (str) – String of annotations
- Returns
List of tuples of the form (start, stop, label)
- omdenalore.natural_language_processing.ner.rm_colliding(x: List[Tuple[int, int, str]]) List[Tuple[int, int, str]]
Remove any overlapping annotations (one token cannot have two entities)
- Parameters
x (List[Tuple[int, int, str]]) – List of annotations
- Returns
List of annotations with no overlapping annotations
- omdenalore.natural_language_processing.ner.rm_groups(a: List[Tuple[int, int, str]]) List[Tuple[int, int, str]]
Remove the group number info - ie: g2_response_rate becomes response_rate
- Parameters
a (Annotations) – Annotations
- Returns
Annotations with group number removed
omdenalore.natural_language_processing.postprocess_text module
- omdenalore.natural_language_processing.postprocess_text.adjust_tags(tags: List[str], tag_set: Optional[Set[str]] = None, length: int = 512) List[List[str]]
Keep only tags that were used in training and get all list of tokens to the same length.
- Parameters
tags (List[str]) – List of tags
tag_set (Set[str]) – Set of tags to keep
length – Length of tokens to keep
- Returns
List of tags
- omdenalore.natural_language_processing.postprocess_text.df2jsonl(df: pandas.DataFrame, path: Optional[str] = None, pmid: str = 'pmid', text: str = 'text', start: str = 'start', stop: str = 'stop', entity: str = 'entity', entity_text: str = 'entity_text') Optional[str]
Get the output format JSONL from a pandas dataframe
- Parameters
df (pd.DataFrame) – Dataframe with entity information
path – Destination file path
(if None will return a string with equivalent JSONL :type path: str :param pmid: Column name for PMID column :type pmid: str :param text: Column name for text column :type text: str :param start: Column name for start column :type start: str :param stop: Column name for stop column :type stop: str :param entity: Column name for entity column :type entity: str :param entity_text: Column name for entity_text column :type entity_text: str :return: None if path is provided (file is written to disk), JSONL string if path is None Output JSONL file has the shape {id:pmid, text: <abstract>, predictions: [{start: x, end: y, entity:x}, …]}
- omdenalore.natural_language_processing.postprocess_text.doc2df(doc: spacy.tokens.doc.Doc, regex_pmid: str = 'PMID: (\\d+)') pandas.DataFrame
Get a dataframe from SpaCy Doc objects.
- Parameters
doc (spacy.tokens.Doc) – A single spaCy Doc object with annotations (entities)
regex_pmid – Raw string with regex to get the PMID numbers
from text (defaults to r”PMID: (d+)” :type regex_pmid: str :return: Pandas dataframe with columns [pmid,`text`,`start`,`stop`,`entity`,`entity_text`]
- omdenalore.natural_language_processing.postprocess_text.doc2df_wide(doc: spacy.tokens.doc.Doc, regex_pmid: str = 'PMID: (\\d+)') pandas.DataFrame
Get a wide dataframe from SpaCy Doc objects.
- Parameters
doc (spacy.tokens.Doc) – A single spaCy Doc object with annotations (entities)
regex_pmid – Raw string with regex to get the PMID numbers
from text (defaults to r”PMID: (d+)” :type regex_pmid: str :return: Pandas dataframe with one column for each {start,`stop`,`entity_text`} in entity
- omdenalore.natural_language_processing.postprocess_text.list2iob(tags: List[str]) List[str]
Go from a list of strings to IOB list of tags.
- Parameters
tags (List[str]) – List of tags
- Returns
List of IOB tags
omdenalore.natural_language_processing.preprocess_text module
- class omdenalore.natural_language_processing.preprocess_text.TextPreprocessor(pos_tags_removal=None, spacy_model=None, remove_numbers: bool = True, remove_special_chars: bool = True, remove_stopwords: bool = True, lemmatize: bool = True, tokenize: bool = False)
Bases:
object
Pre process texts
- Parameters
pos_tags_removal (List[str]) – list of PoS tags to remove
spacy_model – spaCy model to be used, default is en_core_web_sm
remove_numbers (bool) – whether or not to remove numbers from the inputs
remove_special_chars (bool) – whether or not to remove the special characters
remove_stopwords (bool) – whether or not to remove stopwords
lemmatize (bool) – whether or not to lemmatize the inputs
tokenize (bool) – whether or not to tokenize the inputs
- Example
>>> model = TextPreprocessor.load_model() >>> preprocessor = TextPreprocessor(spacy_model=model) >>> sample = "Hello wROld! I hAvE $400 Dollars in my checking account and I am sleeping" >>> clean = preprocessor.preprocess_text(sample) >>> print(f"Sample: {sample} Cleaned: {clean}")
Sample: Hello wROld! I hAvE $400 Dollars in my checking account and I am sleeping
Cleaned: hello wrold dollar checking account sleep
>>> my_list = ["my name is $Yudhiesh", "that house is far AWAY", "the placement @RT"] >>> cleaned_list = preprocessor.preprocess_text_lists(my_list) >>> print(f"Sample: {my_list} Cleaned: {cleaned_list}")
Sample: [‘my name is $Yudhiesh’, ‘that house is far AWAY’, ‘the placement @RT’]
Cleaned: [‘yudhiesh’, ‘house far away’, ‘placement rt’]
- static download_spacy_model(model='en_core_web_sm')
Downloads a spaCy model to be used for preprocessing the text inputs
- static load_model(model='en_core_web_sm')
Loads and returns a spaCy model
- preprocess_text(text: str) str
Preprocesses a single string
- Parameters
text – text string to clean
- Returns
str
- preprocess_text_lists(texts: List[str]) List[str]
Preprocesses a list of strings
- Parameters
texts – List of texts to preprocess
- Returns
List of strings
omdenalore.natural_language_processing.topic_modelling module
omdenalore.natural_language_processing.twitter module
omdenalore.natural_language_processing.utils module
- class omdenalore.natural_language_processing.utils.TextUtils
Bases:
object
Utility functions for handling text data
- static noise_removal(input_text: str, noise_list=['is', 'a', 'this', 'the', 'an']) str
Remove noise from the input text. This function can be use to remove common english words like “is, a , the”. You can add your own noise words in the optional parameters “noise_list” default noise_list = [“is”, “a”, “this”, “the” , “an”]
- Parameters
input_text (str) – Input text to remove noise from
noise_list – (Optional) List of words that you
want to remove from the input text :type noise_list: list :returns: Clean text with no noise
- Example
from omdenalore.natural_language_processing.utils import TextUtils >>> input = “Hello, the chicken crossed the street” >>> TextUtils.noise_removal(input) “Hello, chicken crossed the street”
- static remove_hashtags(text: str) str
Removing hastags from the input text :param text: Input text to remove hastags from :type input_text: str :returns: output text without hastags
- Example
from omdenalore.natural_language_processing.utls import TextUtils >>> TextUtils.remove_hashtags(“I love #hashtags”) “I love “
- static remove_regex_pattern(input_text: str, regex_pattern: str) str
Remove any regular expressions from the input text
- Parameters
input_text (str) – Input text to remove regex from
noise_list – string of regex that you want to
remove from the input text :type noise_list: string :returns: Clean text with removed regex
- Example
from omdenalore.natural_language_processing.utils import remove_regex_pattern
>>> regex_pattern = r"#[\w]*" >>> input = "remove this #hastag" >>> TextUtils.remove_regex_pattern(input, regex_pattern) "remove this hastag"
- static remove_url(text: str) str
Removing urls from the input text :param text: Input text to remove urls from :type input_text: str :returns: text with standard words
- Example
from omdenalore.natural_language_processing.utisl import TextUtils >>> TextUtils.remove_url(‘I like urls http://www.google.com’) ‘I like urls ‘
- static standardize_words(input_text: str, lookup_dictionary: Dict[str, str]) Optional[str]
Replace acronyms, hastags, colloquial slangs etc with standard words
Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.
- Parameters
input_text (str) – Input text to remove regex from
lookup_dictionary – Dictionary with slang as index and
standard word as item or value :type lookup_dictionary: dict :returns: text with standard words
- Example
from omdenalore.natural_language_processing.utils import TextUtils >>> lookup_dict = {‘rt’:’Retweet’, ‘dm’:’direct message’, “awsm” : “awesome”, “luv” :”love”} >>> TextUtils.standardize_words(“rt I luv to dm, it’s awsm”) Retweet I love to direct message, it’s awesome