Contractions in NLP

Out of the plethora of libraries and packages available to use with Python to process data for Natural Language Processing, there is only one that assists with contractions, and it is insufficient. I won’t call it out by name, but it was useless in my endeavors. Due to this, and the wide array of contractions in the English Language, I created my own process that can be used in any NLP task from here on out.

First, I gathered contractions, and after searching several websites for colloquialisms, contractions, or colloquialisms. Wikipedia defines colloquialism as:  colloquial language is the linguistic style used for casual communication. It is the most common functional style of speech, the idiom normally employed in conversation and other informal contexts. Colloquialism is characterized by wide usage of interjections and other expressive devices; it makes use of non-specialist terminology, and has a rapidly changing lexicon. It can also be distinguished by its usage of formulations with incomplete logical and syntactic ordering.

When processing text based data, colloquialisms are a commonly used on social media sites like Facebook and Twitter, and a LOT of contractions have variations. I also added internet slang acronyms, saved as a csv file, which I imported as a dictionary assigned to ‘word_expansion_dict’.

Currently it has 343 words or abbreviations with matching expansions, including standard contractions with both the punctuation included in the keys as well as entries for without apostrophes for ease of use. I won’t include the entire text, but here is a snippet. I uploaded this as a gist on Github, along with other helpful word expansions, such as states and provinces with their postal codes or abbreviations.

Once I have this dictionary in my project, I can create a function to quickly knock these out before further exploration of the data. Analysis on these and their use in certain scenarios is also an option, but for this example, this is just used to expand to their root words.

This can be included in a larger pipeline to clean data, in my case, I used it on tweets.

This process quickly

Leave a comment

Design a site like this with WordPress.com
Get started