#ExpandHashtags for NLP

Expand hashtags with upper() and lower() letters for Natural Language Processing

9 I Dislike Hashtags! Seriously! ideas | humor, make me laugh, bones funny

The infamous hashtag can present issues when processing text data for NLP. The annoyances begin with “those” people who overuse hashtags on a regular basis. The bothersome nature continues when millennials use them in casual conversation, or when the entire purpose is muted, such as hashtags on Facebook, cringe. Hashtags are metadata tags that are always preceded by a hash symbol or #. Always user generated, don’t fault the hashtag concept, fault the user when used in ways that are unnecessary. On platforms like Twitter or Instagram, ‘hashtagging’ is a useful metadata tagging system that allows users to filter topics via the metadata from the tags, at least , when users use them correctly.

Now, to expand hashtags in this manner, the user needs to have capitalized the first character of each word, following with lowercase characters for the rest of the string, like a proper noun is formatted, also known as ‘camel case’. Start by importing the following:

# regular expressions aka regex 
import re
# inflect engine used to expand numbers to words
import inflect

To begin this process, first create two empty lists, one for the original hashtags, and one for the expanded hashtags that will be extracted using regex to refer to the uppercase and lowercase characters in the hashtag.

 hashtag_list = []
 hashtag_exp_list = []

I created a quick function to remove any possible incorrectly formatted hashtag that may(for some ungodly reason) contain URLs using the following line of code:

new_text = re.sub(r"\S*https?:\S*",  r"", text)

Also, removing any characters that are not letters or numbers, essentially removing punctuation, with the exception of the hash symbol using regex as such:

text_sans_punct = re.sub(r"[^\w\s#]",  r"", text_sans_url)

Then, ensuring that the text is all unicode, removing any emojis or other characters that are not in the unicode format:

text_unicode = re.sub('[^\u0000-\u007f]', '',  text_sans_punct)

Next, since the underscore character is not removed in the previous lines, this line of code removes this:

new_text_ = re.sub('_', '',  text_unicode)

Combining the above process into a function to denoise the text:

def denoise_hashtag_text(text):
        text_sans_url = re.sub(r"\S*https?:\S*",  r"", text)
        text_sans_punct = re.sub(r"[^\w\s#]",  r"", text_sans_url)
        text_unicode = re.sub('[^\u0000-\u007f]', '',  text_sans_punct)
        new_text_ = re.sub('_', '',  text_unicode)
        return new_text_

Then the following function uses the inflect package to expand numbers from digits to words:

def replace_numbers(text):
        digit_to_word = inflect.engine()
        new_text = []
        for word in text:
            if word.isdigit():
                new_word = digit_to_word.number_to_words(word)
                new_text.append(new_word)
            else:
                new_text.append(word)
        return new_text

Then using a for loop, iterate through the split() text, and if the string starts with a hash symbol, run the above functions on the word to remove extraneous noise from the text.

for tweet in text:
        for x in tweet.split():
            if x.startswith('#') == True:
                clean_text = denoise_hashtag_text(x)
                cleaner_text = replace_numbers(clean_text)
                hashtag_list.append(''.join(cleaner_text))

Next up… the headache truly begins with formatting regex…

Regular Expressions. An Introduction to Regex in Ruby | by Lee Bardon | Medium

Now, the hashtag text is cleaned up a bit, and the hashtag strings are appended to a the hashtag list. The next step is to use regex to remove the hash symbol, then using negative lookbehind assertion, begin the regex((?<!\A)) to assert the beginning of the string, followed by positive assertion((?<=[a-z])[A-Z]) to find lowercase characters followed by uppercase characters ([A-Z]) or(|) lookahead assertion for uppercase characters, when immediately followed by lowercase characters ((?<!\A)(?=[A-Z])[a-z+])), to ensure that either way the hashtag is formatted, regex sees it and adds the space where applicable(r’ \1′).

def camel_case_split(text):
        text = re.sub('#', ' ', text)
        exp_hashtags = re.sub(r'((?<!\A)(?<=[a-z])[A-Z]|(?<!\A)(?=[A-Z])[a-z+])', r' \1', text)
        return exp_hashtags

Then another for loop on the previously created hashtag_list, appending the expanded hashtags to the hashtag_exp_list:

for hashtag in hashtag_list: 
        exp_hashtag = camel_case_split(hashtag)
        strip_hash = exp_hashtag.strip()
        hashtag_exp_list.append(strip_hash)

These steps can be combined as the following function:

def expand_hashtags(text):
    hashtag_list = []
    hashtag_exp_list = []

    def camel_case_split(text):
        text = re.sub('#', ' ', text)
        # regex to insert space before uppercase letter when not at start of line using pos.lookahead and pos.lookbehind
        exp_hashtags = re.sub(r'((?<!\A)(?<=[a-z])[A-Z]|(?<!\A)(?=[A-Z])[a-z+])', r' \1', text)
        return exp_hashtags
        
    def denoise_hashtag_text(text):
        text_sans_url = re.sub(r"\S*https?:\S*",  r"", text)
        text_sans_punct = re.sub(r"[^\w\s#]",  r"", text_sans_url)
        text_unicode = re.sub('[^\u0000-\u007f]', '',  text_sans_punct)
        new_text_ = re.sub('_', '',  text_unicode)
        return new_text_
    
    def replace_numbers(tokens):
        digit_to_word = inflect.engine()
        new_tokens = []
        for word in tokens:
            if word.isdigit():
                new_word = digit_to_word.number_to_words(word)
                new_tokens.append(new_word)
            else:
                new_tokens.append(word)
        return new_tokens
    
    for tweet in text:
        for x in tweet.split():
            if x.startswith('#') == True:
                clean_text = denoise_hashtag_text(x)
                cleaner_text = replace_numbers(clean_text)
                hashtag_list.append(''.join(cleaner_text))
                
    for hashtag in hashtag_list: 
        exp_hashtag = camel_case_split(hashtag)
        strip_hash = exp_hashtag.strip()
        hashtag_exp_list.append(strip_hash)
        
    return dict(zip(hashtag_list, hashtag_exp_list))

Correctly formatting regex can be a trial and error pursuit, as I tried and failed many times before getting all of this to run the way I intended.

not sure if regex is the cause of or solution to all my problems - Futurama Fry | Meme Generator

Share this:

Related

Leave a comment Cancel reply