Data Insights

Get started in CV

Chihuahua vs. Muffin Enterprise Computer Vision APIs Clarifai Google Cloud Rekognition Cloudsight

Computer Vision, or CV for short, is the art of teaching a computer to analyze digital images such as photographs and videos to find differences or similarities to extract a conclusion. Acquiring images can be a heavy load on its own. Images can be created from sound waves, image files can be altered to provide different information, and that’s just the beginning. Extracting high dimensional data to produce numerical information is called digital signal processing. Once signals are processed, then, training the computer to analyze the information is the next talk.

Various algorithms are used in processing the signal and then again in training the computer to infer from the processed signal. Sometimes, such as in various image classification tasks, the signal does not require a lot of processing. For instance, if classifying images of cats and dogs, the computer can be trained on raw images, often downsampled to conserve time and memory. The computer is trained to find commonalities and differences and over time, will become capable of the simple binary classification task.

Linear algebra is a useful skill in CV tasks. Additional knowledge in image framing, rotation, and manipulation are used to create more accurate models. I will be working on a basic CV project in the coming weeks and will be diving far deeper into the world of Computer Vision.

To start, numpy and pandas are basic packages that will be used in the early stages, additionally using your gpu can save on time, so I would suggest downloading tensorflow gpu if you have a PC.

import numpy as np
import pandas as pd

from tensorflow.keras.utils import to_categorical

from sklearn.preprocessing import LabelEncoder
import cv2
import tensorflow as tf

from tensorflow.keras.preprocessing.image import load_img, img_to_array

from pathlib import Path
import os
from tensorflow.keras import layers
import glob
import re
import splitfolders

Verify your install as follows to be sure the correct version of Tensorflow is activated in your environment.

Next, import your data using Tensorflow.

datagen = ImageDataGenerator(rescale=1./255)
# load and iterate training dataset
train_it = datagen.flow_from_directory('output/train/', class_mode='categorical',  target_size=(224, 224),)
# load and iterate validation dataset
val_it = datagen.flow_from_directory('output/val/', class_mode='categorical',  target_size=(224, 224),)

This is set up to do a ResNet model, hence the target size, but typically, I keep my data shaped in pixels relating to squares of 2. For example: 256 (2⁸ ), 512 (2⁹ ), 1024 (2¹⁰ ), 2048 (2¹¹ ). Explore your options for importing datasets, as the pixels and methods used can positively or negatively impact your model later on.

#ExpandHashtags for NLP

Expand hashtags with upper() and lower() letters for Natural Language Processing

9 I Dislike Hashtags! Seriously! ideas | humor, make me laugh, bones funny

The infamous hashtag can present issues when processing text data for NLP. The annoyances begin with “those” people who overuse hashtags on a regular basis. The bothersome nature continues when millennials use them in casual conversation, or when the entire purpose is muted, such as hashtags on Facebook, cringe. Hashtags are metadata tags that are always preceded by a hash symbol or #. Always user generated, don’t fault the hashtag concept, fault the user when used in ways that are unnecessary. On platforms like Twitter or Instagram, ‘hashtagging’ is a useful metadata tagging system that allows users to filter topics via the metadata from the tags, at least , when users use them correctly.

Now, to expand hashtags in this manner, the user needs to have capitalized the first character of each word, following with lowercase characters for the rest of the string, like a proper noun is formatted, also known as ‘camel case’. Start by importing the following:

# regular expressions aka regex 
import re
# inflect engine used to expand numbers to words
import inflect

To begin this process, first create two empty lists, one for the original hashtags, and one for the expanded hashtags that will be extracted using regex to refer to the uppercase and lowercase characters in the hashtag.

 hashtag_list = []
 hashtag_exp_list = []

I created a quick function to remove any possible incorrectly formatted hashtag that may(for some ungodly reason) contain URLs using the following line of code:

new_text = re.sub(r"\S*https?:\S*",  r"", text)

Also, removing any characters that are not letters or numbers, essentially removing punctuation, with the exception of the hash symbol using regex as such:

text_sans_punct = re.sub(r"[^\w\s#]",  r"", text_sans_url)

Then, ensuring that the text is all unicode, removing any emojis or other characters that are not in the unicode format:

text_unicode = re.sub('[^\u0000-\u007f]', '',  text_sans_punct)

Next, since the underscore character is not removed in the previous lines, this line of code removes this:

new_text_ = re.sub('_', '',  text_unicode)

Combining the above process into a function to denoise the text:

def denoise_hashtag_text(text):
        text_sans_url = re.sub(r"\S*https?:\S*",  r"", text)
        text_sans_punct = re.sub(r"[^\w\s#]",  r"", text_sans_url)
        text_unicode = re.sub('[^\u0000-\u007f]', '',  text_sans_punct)
        new_text_ = re.sub('_', '',  text_unicode)
        return new_text_

Then the following function uses the inflect package to expand numbers from digits to words:

def replace_numbers(text):
        digit_to_word = inflect.engine()
        new_text = []
        for word in text:
            if word.isdigit():
                new_word = digit_to_word.number_to_words(word)
                new_text.append(new_word)
            else:
                new_text.append(word)
        return new_text

Then using a for loop, iterate through the split() text, and if the string starts with a hash symbol, run the above functions on the word to remove extraneous noise from the text.

for tweet in text:
        for x in tweet.split():
            if x.startswith('#') == True:
                clean_text = denoise_hashtag_text(x)
                cleaner_text = replace_numbers(clean_text)
                hashtag_list.append(''.join(cleaner_text))

Next up… the headache truly begins with formatting regex…

Regular Expressions. An Introduction to Regex in Ruby | by Lee Bardon | Medium

Now, the hashtag text is cleaned up a bit, and the hashtag strings are appended to a the hashtag list. The next step is to use regex to remove the hash symbol, then using negative lookbehind assertion, begin the regex((?<!\A)) to assert the beginning of the string, followed by positive assertion((?<=[a-z])[A-Z]) to find lowercase characters followed by uppercase characters ([A-Z]) or(|) lookahead assertion for uppercase characters, when immediately followed by lowercase characters ((?<!\A)(?=[A-Z])[a-z+])), to ensure that either way the hashtag is formatted, regex sees it and adds the space where applicable(r’ \1′).

def camel_case_split(text):
        text = re.sub('#', ' ', text)
        exp_hashtags = re.sub(r'((?<!\A)(?<=[a-z])[A-Z]|(?<!\A)(?=[A-Z])[a-z+])', r' \1', text)
        return exp_hashtags

Then another for loop on the previously created hashtag_list, appending the expanded hashtags to the hashtag_exp_list:

for hashtag in hashtag_list: 
        exp_hashtag = camel_case_split(hashtag)
        strip_hash = exp_hashtag.strip()
        hashtag_exp_list.append(strip_hash)

These steps can be combined as the following function:

def expand_hashtags(text):
    hashtag_list = []
    hashtag_exp_list = []

    def camel_case_split(text):
        text = re.sub('#', ' ', text)
        # regex to insert space before uppercase letter when not at start of line using pos.lookahead and pos.lookbehind
        exp_hashtags = re.sub(r'((?<!\A)(?<=[a-z])[A-Z]|(?<!\A)(?=[A-Z])[a-z+])', r' \1', text)
        return exp_hashtags
        
    def denoise_hashtag_text(text):
        text_sans_url = re.sub(r"\S*https?:\S*",  r"", text)
        text_sans_punct = re.sub(r"[^\w\s#]",  r"", text_sans_url)
        text_unicode = re.sub('[^\u0000-\u007f]', '',  text_sans_punct)
        new_text_ = re.sub('_', '',  text_unicode)
        return new_text_
    
    def replace_numbers(tokens):
        digit_to_word = inflect.engine()
        new_tokens = []
        for word in tokens:
            if word.isdigit():
                new_word = digit_to_word.number_to_words(word)
                new_tokens.append(new_word)
            else:
                new_tokens.append(word)
        return new_tokens
    
    for tweet in text:
        for x in tweet.split():
            if x.startswith('#') == True:
                clean_text = denoise_hashtag_text(x)
                cleaner_text = replace_numbers(clean_text)
                hashtag_list.append(''.join(cleaner_text))
                
    for hashtag in hashtag_list: 
        exp_hashtag = camel_case_split(hashtag)
        strip_hash = exp_hashtag.strip()
        hashtag_exp_list.append(strip_hash)
        
    return dict(zip(hashtag_list, hashtag_exp_list))

Correctly formatting regex can be a trial and error pursuit, as I tried and failed many times before getting all of this to run the way I intended.

not sure if regex is the cause of or solution to all my problems - Futurama Fry | Meme Generator

Contractions in NLP

Out of the plethora of libraries and packages available to use with Python to process data for Natural Language Processing, there is only one that assists with contractions, and it is insufficient. I won’t call it out by name, but it was useless in my endeavors. Due to this, and the wide array of contractions in the English Language, I created my own process that can be used in any NLP task from here on out.

First, I gathered contractions, and after searching several websites for colloquialisms, contractions, or colloquialisms. Wikipedia defines colloquialism as: colloquial language is the linguistic style used for casual communication. It is the most common functional style of speech, the idiom normally employed in conversation and other informal contexts. Colloquialism is characterized by wide usage of interjections and other expressive devices; it makes use of non-specialist terminology, and has a rapidly changing lexicon. It can also be distinguished by its usage of formulations with incomplete logical and syntactic ordering.

When processing text based data, colloquialisms are a commonly used on social media sites like Facebook and Twitter, and a LOT of contractions have variations. I also added internet slang acronyms, saved as a csv file, which I imported as a dictionary assigned to ‘word_expansion_dict’.

Currently it has 343 words or abbreviations with matching expansions, including standard contractions with both the punctuation included in the keys as well as entries for without apostrophes for ease of use. I won’t include the entire text, but here is a snippet. I uploaded this as a gist on Github, along with other helpful word expansions, such as states and provinces with their postal codes or abbreviations.

Once I have this dictionary in my project, I can create a function to quickly knock these out before further exploration of the data. Analysis on these and their use in certain scenarios is also an option, but for this example, this is just used to expand to their root words.

This can be included in a larger pipeline to clean data, in my case, I used it on tweets.

This process quickly

NLP Preprocess Tweets

Use the Force Learn NLP - Obi-wan kenobi advice | Meme Generator

Natural language processing, also known as NLP, combines computer science and linguistics to understand and process the relationships contained within communication languages. Words, characters, documents, sentences, and punctuation can play a factor in how humans understand language, and using this information, computers are capable of also learning and understanding how humans communicate by analyzing these factors.

In the data science/machine learning field, preprocessing data is said to take 80-90% of the time spent on a project. Thorough preprocessing can make or break a model.

I am going to use the “Disaster Tweets” dataset from Kaggle, which can be found here. There are a multitude of packages that can be used to make this easier, such as spaCy and NLTK. For this example, I will be using regular expressions and NLTK.

Start by loading the csv file with the following line of code:

df = pd.read_csv('train.csv')

For now, I am only going to focus on the “text” column, which contains the tweet text. There are non-ascii characters, digits, upper and lowercases, hashtags, and handles to deal with, and that’s just the beginning. First, I am adding a single whitespace at the start of the tweet text, which will be explained later. Next, I remove the duplicate tweets from the dataset.

df['text'] = " " + df.text
df.drop_duplicates(subset=['text'], inplace=True)

The reason I added the whitespace at the head of each tweet is so I can expand contractions and abbreviations using a dictionary I created for this purpose. The dictionary appears as follows:

    contractions_dict = {
                          " aint": " are not",
                          " arent": " are not",
                          " cant": " can not",
                          " cause": " because",
                          " couldve": " could have",
                          " couldnt": " could not",
                          " didnt": " did not",
                          " doesnt": " does not",
                          " dont": " do not",
                          " hadnt": " had not",
                          " hasnt": " has not",
                          " havent": " have not",
                          " hed": " he would",
                          " hes": " he is",
                          " howd": " how did",
                          " howdy": " how do you",
                          " howll": " how will",
                          " hows": " how is",
                          " id": " i would",
                          " ida": " i would have",
                          " im": " i am",
                          " ive": " i have",
                          " isnt": " is not",
                          " itd": " it had",
                          " itll": " it will",
                          " its": " it is",
                          " lets": " let us",
                          " maam": " madam",
                          " mightve": " might have",
                          " mighta": " might have",
                          " mightnt": " might not",
                          " mustve": " must have",
                          " musta": " must have",
                          " mustnt": " must not",
                          " neednt": " need not",
                          " oclock": " of the clock",
                          " shes": " she is",
                          " shoulda": " should have",
                          " shouldve": " should have",
                          " shouldnt": " should not",
                          " so'd": " so did",
                          " thatd": " that would",
                          " thats": " that is",
                          " thered": " there had",
                          " theres": " there is",
                          " theyd": " they would",
                          " theyda": " they would have",
                          " theyll": " they will",
                          " theyre": " they are",
                          " theyve": " they have",
                          " wasnt": " was not",
                          " weve": " we have",
                          " werent": " were not",
                          " whatll": " what will",
                          " whatllve": " what will have",
                          " whatre": " what are",
                          " whats": " what is",
                          " whatve": " what have",
                          " whens": " when is",
                          " whenve": " when have",
                          " whered": " where did",
                          " whers": " where is",
                          " whereve": " where have",
                          " wholl": " who will",
                          " whollve": " who will have",
                          " whos": " who is",
                          " whove": " who have",
                          " whys": " why is",
                          " whyve": " why have",
                          " willve": " will have",
                          " wont": " will not",
                          " wontve": " will not have",
                          " wouldve": " would have",
                          " wouldnt": " would not",
                          " wouldntve": " would not have",
                          " yall": " you all",
                          " yalls": " you alls",
                          " yalld": " you all would",
                          " yalldve": " you all would have",
                          " yallre": " you all are",
                          " yallve": " you all have",
                          " youd": " you had",
                          " youda": " you would have",
                          " youdve": " you would have",
                          " youll": " you you will",
                          " youllve": " you you will have",
                          " youre": " you are",
                          " youve": " you have",
                          " ain't": " are not",
                          " aren't": " are not",
                          " can't": " can not",
                          " can't've": " can not have",
                          " 'cause": " because",
                          " bc": " because",
                          " b/c": " because",
                          " could've": " could have",
                          " couldn't": " could not",
                          " couldn't've": " could not have",
                          " didn't": " did not",
                          " doesn't": " does not",
                          " don't": " do not",
                          " hadn't": " had not",
                          " hadn't've": " had not have",
                          " hasn't": " has not",
                          " haven't": " have not",
                          " he'd": " he would",
                          " he'd've": " he would have",
                          " he'll": " he will",
                          " he'll've": " he will have",
                          " he's": " he is",
                          " how'd": " how did",
                          " how'd'y": " how do you",
                          " how'll": " how will",
                          " how's": " how is",
                          " i'd": " i would",
                          " i'd've": " i would have",
                          " i'll": " i will",
                          " i'll've": " i will have",
                          " i'm": " i am",
                          " i've": " i have",
                          " isn't": " is not",
                          " it'd": " it had",
                          " it'd've": " it would have",
                          " it'll": " it will",
                          " it'll've": " it will have",
                          " it's": " it is",
                          " let's": " let us",
                          " ma'am": " madam",
                          " mayn't": " may not",
                          " might've": " might have",
                          " mightn't": " might not",
                          " mightn't've": " might not have",
                          " must've": " must have",
                          " mustn't": " must not",
                          " mustn't've": " must not have",
                          " needn't": " need not",
                          " needn't've": " need not have",
                          " o'clock": " of the clock",
                          " oughtn't": " ought not",
                          " oughtn't've": " ought not have",
                          " shan't": " shall not",
                          " sha'n't": " shall not",
                          " shan't've": " shall not have",
                          " she'd": " she would",
                          " she'd've": " she would have",
                          " she'll": " she will",
                          " she'll've": " she will have",
                          " she's": " she is",
                          " should've": " should have",
                          " shouldn't": " should not",
                          " shouldn't've": " should not have",
                          " so've": " so have",
                          " so's": " so is",
                          " that'd": " that would",
                          " that'd've": " that would have",
                          " that's": " that is",
                          " there'd": " there had",
                          " there'd've": " there would have",
                          " there's": " there is",
                          " they'd": " they would",
                          " they'd've": " they would have",
                          " they'll": " they will",
                          " they'll've": " they will have",
                          " they're": " they are",
                          " they've": " they have",
                          " to've": " to have",
                          " wasn't": " was not",
                          " we'd": " we had",
                          " we'd've": " we would have",
                          " we'll": " we will",
                          " we'll've": " we will have",
                          " we're": " we are",
                          " we've": " we have",
                          " weren't": " were not",
                          " what'll": " what will",
                          " what'll've": " what will have",
                          " what're": " what are",
                          " what's": " what is",
                          " what've": " what have",
                          " when's": " when is",
                          " when've": " when have",
                          " where'd": " where did",
                          " where's": " where is",
                          " where've": " where have",
                          " who'll": " who will",
                          " who'll've": " who will have",
                          " who's": " who is",
                          " who've": " who have",
                          " why's": " why is",
                          " why've": " why have",
                          " will've": " will have",
                          " won't": " will not",
                          " won't've": " will not have",
                          " would've": " would have",
                          " wouldn't": " would not",
                          " wouldn't've": " would not have",
                          " y'all": " you all",
                          " y'alls": " you alls",
                          " y'all'd": " you all would",
                          " y'all'd've": " you all would have",
                          " y'all're": " you all are",
                          " y'all've": " you all have",
                          " you'd": " you had",
                          " you'da": " you would have",
                          " you'd've": " you would have",
                          " you'll": " you you will",
                          " you'll've": " you you will have",
                          " you're": " you are",
                          " you've": " you have",
                          " hwy": " highway",
                          " fvck": " fuck",
                          " im": " i am",
                          " rt": " retweet",
                          " fyi": " for your information",
                          " omw": " on my way",
                          " 1st": " first",
                          " 2nd": " second",
                          " 3rd": " third",
                          " 4th": " fourth",
                          " u ": " you ",
                          " r ": " are ",


}

There are other options, for instance, removing the punctuation before processing the contractions will allow for a shorter dictionary, however, when the punctuation is removed, sometimes, these contractions end up being a completely different word. An example of this is “she’ll”, which would become “shell”, and these words have completely different meanings. So I process the expansion before cleaning the punctuation from the text so that the word meaning is not lost, as that is the entire point of NLP. Using the following function, I use regex to compile the dictionary keys, then I match the contraction to the correlating expansion phrase or word, then follow up with the regex ‘sub’ function, returning the text replacing the contraction.

 def expand_contractions(text, c_re=c_re):
        c_re = re.compile('|'.join('(%s)' % k for k in contractions_dict.keys()))
        def replace(match):
            expansion =  f" {contractions_dict[match.group(0)]}"
            return expansion
        text = c_re.sub(replace, text.lower())
        return text

Next, I will remove the ‘noise’ from the text. Noise includes emojis, punctuation, and URLs. I am also running the contraction expansion function in this block of self explanatory code:

    def denoise_text(text):
        new_text = re.sub(r"\S*https?:\S*",  r"", text.lower())
        new_text_contractions = expand_contractions(new_text)
        new_text_punct = re.sub(r"[^\w\s@#]",  r"", new_text_contractions)
        new_text_ascii = re.sub('[^\u0000-\u007f]', '',  new_text_punct)
        text = new_text_ascii.strip()
        return text

After denoising the tweets, I will tokenize and lemmatize the text using NLTK’s TweetTokenizer() and WordNetLemmatizer(). With the TweetTokenizer(), NLTK provides a quick and accesible way to remove the twitter handles, and reduce the length. Reduce_len is an optional argument, but I use it due to the way people use language in tweets, for example, “waaaaay” becomes “waaay”, or “waaaay” becomes “waaay”, providing a uniform baseline for the exaggerated words.

    def lemmatize_text(text):
        tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
        lemmatizer = WordNetLemmatizer() 
        return [lemmatizer.lemmatize(w, pos='v') for w in tokenizer.tokenize(text)]

There are several instances of digits in the tweets provided, and I am going to expand these digits to words using inflect. Instructions for installation can be found here.

    def replace_numbers(tokens):
# replace integers with string formatted words for numbers
        dig2word = inflect.engine()
        new_tokens = []
        for word in tokens:
            if word.isdigit():
                new_word = dig2word.number_to_words(word)
                new_tokens.append(new_word)
            else:
                new_tokens.append(word)
        return new_tokens

I will also remove the non-ASCII characters from the tokens like so:

    def remove_non_ascii(tokens):
# remove non ascii characters from text
        new_tokens = []
        for word in tokens:
            new_token = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
            new_tokens.append(new_token)
        return new_tokens

I will be removing the “stopwords” from the tweets as well, using NLTK’s stopwords list, which is available in multiple languages. It removes common words that can occur too frequently to allow any information to be obtained from their use.

    def remove_stopwords(tokens):
        stop_list = stopwords.words('english')  
        new_tokens = []
        for word in tokens:
            if word not in stop_list:
                new_tokens.append(word)
        return new_tokens

Next, I will create a wrapper to normalize the tokens combining the processing methods.

    def norm_text(tokens):
        words = replace_numbers(tokens)
        tokens = remove_stopwords(words)
        return tokens

I am able to join all of the above processing techniques in the following function:

    def process_text(text):
        clean_text = denoise_text(text)
        lem_text = lemmatize_text(clean_text)
        text = ' '.join([x for x in norm_text(lem_text)])
        text = re.sub(r"-",  r" ", text)
        return text

To utilize these in a simple manner, I will combine these and run them on the “text” dataframe column.

def tweet_preprocess(df): 
     """
          combine regex and nltk processing for tweet text processing.
          includes contractions dictionary, stemming option, just replace
          lemmatize tokenizer with the stemming function.

     """

    contractions_dict = {
                          " aint": " are not",
                          " arent": " are not",
                          " cant": " can not",
                          " cause": " because",
                          " couldve": " could have",
                          " couldnt": " could not",
                          " didnt": " did not",
                          " doesnt": " does not",
                          " dont": " do not",
                          " hadnt": " had not",
                          " hasnt": " has not",
                          " havent": " have not",
                          " hed": " he would",
                          " hes": " he is",
                          " howd": " how did",
                          " howdy": " how do you",
                          " howll": " how will",
                          " hows": " how is",
                          " id": " i would",
                          " ida": " i would have",
                          " im": " i am",
                          " ive": " i have",
                          " isnt": " is not",
                          " itd": " it had",
                          " itll": " it will",
                          " its": " it is",
                          " lets": " let us",
                          " maam": " madam",
                          " mightve": " might have",
                          " mighta": " might have",
                          " mightnt": " might not",
                          " mustve": " must have",
                          " musta": " must have",
                          " mustnt": " must not",
                          " neednt": " need not",
                          " oclock": " of the clock",
                          " shes": " she is",
                          " shoulda": " should have",
                          " shouldve": " should have",
                          " shouldnt": " should not",
                          " so'd": " so did",
                          " thatd": " that would",
                          " thats": " that is",
                          " thered": " there had",
                          " theres": " there is",
                          " theyd": " they would",
                          " theyda": " they would have",
                          " theyll": " they will",
                          " theyre": " they are",
                          " theyve": " they have",
                          " wasnt": " was not",
                          " weve": " we have",
                          " werent": " were not",
                          " whatll": " what will",
                          " whatllve": " what will have",
                          " whatre": " what are",
                          " whats": " what is",
                          " whatve": " what have",
                          " whens": " when is",
                          " whenve": " when have",
                          " whered": " where did",
                          " whers": " where is",
                          " whereve": " where have",
                          " wholl": " who will",
                          " whollve": " who will have",
                          " whos": " who is",
                          " whove": " who have",
                          " whys": " why is",
                          " whyve": " why have",
                          " willve": " will have",
                          " wont": " will not",
                          " wontve": " will not have",
                          " wouldve": " would have",
                          " wouldnt": " would not",
                          " wouldntve": " would not have",
                          " yall": " you all",
                          " yalls": " you alls",
                          " yalld": " you all would",
                          " yalldve": " you all would have",
                          " yallre": " you all are",
                          " yallve": " you all have",
                          " youd": " you had",
                          " youda": " you would have",
                          " youdve": " you would have",
                          " youll": " you you will",
                          " youllve": " you you will have",
                          " youre": " you are",
                          " youve": " you have",
                          " ain't": " are not",
                          " aren't": " are not",
                          " can't": " can not",
                          " can't've": " can not have",
                          " 'cause": " because",
                          " bc": " because",
                          " b/c": " because",
                          " could've": " could have",
                          " couldn't": " could not",
                          " couldn't've": " could not have",
                          " didn't": " did not",
                          " doesn't": " does not",
                          " don't": " do not",
                          " hadn't": " had not",
                          " hadn't've": " had not have",
                          " hasn't": " has not",
                          " haven't": " have not",
                          " he'd": " he would",
                          " he'd've": " he would have",
                          " he'll": " he will",
                          " he'll've": " he will have",
                          " he's": " he is",
                          " how'd": " how did",
                          " how'd'y": " how do you",
                          " how'll": " how will",
                          " how's": " how is",
                          " i'd": " i would",
                          " i'd've": " i would have",
                          " i'll": " i will",
                          " i'll've": " i will have",
                          " i'm": " i am",
                          " i've": " i have",
                          " isn't": " is not",
                          " it'd": " it had",
                          " it'd've": " it would have",
                          " it'll": " it will",
                          " it'll've": " it will have",
                          " it's": " it is",
                          " let's": " let us",
                          " ma'am": " madam",
                          " mayn't": " may not",
                          " might've": " might have",
                          " mightn't": " might not",
                          " mightn't've": " might not have",
                          " must've": " must have",
                          " mustn't": " must not",
                          " mustn't've": " must not have",
                          " needn't": " need not",
                          " needn't've": " need not have",
                          " o'clock": " of the clock",
                          " oughtn't": " ought not",
                          " oughtn't've": " ought not have",
                          " shan't": " shall not",
                          " sha'n't": " shall not",
                          " shan't've": " shall not have",
                          " she'd": " she would",
                          " she'd've": " she would have",
                          " she'll": " she will",
                          " she'll've": " she will have",
                          " she's": " she is",
                          " should've": " should have",
                          " shouldn't": " should not",
                          " shouldn't've": " should not have",
                          " so've": " so have",
                          " so's": " so is",
                          " that'd": " that would",
                          " that'd've": " that would have",
                          " that's": " that is",
                          " there'd": " there had",
                          " there'd've": " there would have",
                          " there's": " there is",
                          " they'd": " they would",
                          " they'd've": " they would have",
                          " they'll": " they will",
                          " they'll've": " they will have",
                          " they're": " they are",
                          " they've": " they have",
                          " to've": " to have",
                          " wasn't": " was not",
                          " we'd": " we had",
                          " we'd've": " we would have",
                          " we'll": " we will",
                          " we'll've": " we will have",
                          " we're": " we are",
                          " we've": " we have",
                          " weren't": " were not",
                          " what'll": " what will",
                          " what'll've": " what will have",
                          " what're": " what are",
                          " what's": " what is",
                          " what've": " what have",
                          " when's": " when is",
                          " when've": " when have",
                          " where'd": " where did",
                          " where's": " where is",
                          " where've": " where have",
                          " who'll": " who will",
                          " who'll've": " who will have",
                          " who's": " who is",
                          " who've": " who have",
                          " why's": " why is",
                          " why've": " why have",
                          " will've": " will have",
                          " won't": " will not",
                          " won't've": " will not have",
                          " would've": " would have",
                          " wouldn't": " would not",
                          " wouldn't've": " would not have",
                          " y'all": " you all",
                          " y'alls": " you alls",
                          " y'all'd": " you all would",
                          " y'all'd've": " you all would have",
                          " y'all're": " you all are",
                          " y'all've": " you all have",
                          " you'd": " you had",
                          " you'da": " you would have",
                          " you'd've": " you would have",
                          " you'll": " you you will",
                          " you'll've": " you you will have",
                          " you're": " you are",
                          " you've": " you have",
                          " hwy": " highway",
                          " fvck": " fuck",
                          " im": " i am",
                          " rt": " retweet",
                          " fyi": " for your information",
                          " omw": " on my way",
                          " 1st": " first",
                          " 2nd": " second",
                          " 3rd": " third",
                          " 4th": " fourth",
                          " u ": " you ",
                          " r ": " are ",


}

    def expand_contractions(text, c_re=c_re):
        c_re = re.compile('|'.join('(%s)' % k for k in contractions_dict.keys()))
        def replace(match):
            expansion =  f" {contractions_dict[match.group(0)]}"
            return expansion
        text = c_re.sub(replace, text.lower())
        return text

    # function to expand contractions, remove urls and characters before tokenization processing
    def denoise_text(text):
        new_text = re.sub(r"\S*https?:\S*",  r"", text.lower())
        new_text_contractions = expand_contractions(new_text)
        new_text_punct = re.sub(r"[^\w\s@#]",  r"", new_text_contractions)
        new_text_ascii = re.sub('[^\u0000-\u007f]', '',  new_text_punct)
        strip_text = new_text_ascii.strip()
        text = re.sub('#\w+', '',  strip_text)
        return text 
    
# tokenization & lemmatization function returns tokens    
    def lemmatize_text(text):
        tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
        lemmatizer = WordNetLemmatizer() 
        return [lemmatizer.lemmatize(w, pos='v') for w in tokenizer.tokenize(text)]

# tokenization & stemmer function returns tokens
    def stem_text(text):
        tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
        stemmer = PorterStemmer()
        return [stemmer.stem(w) for w in tokenizer.tokenize(text)]

    def replace_numbers(tokens):
# replace integers with string formatted words for numbers
        dig2word = inflect.engine()
        new_tokens = []
        for word in tokens:
            if word.isdigit():
                new_word = dig2word.number_to_words(word)
                new_tokens.append(new_word)
            else:
                new_tokens.append(word)
        return new_tokens
    
    def remove_non_ascii(tokens):
# remove non ascii characters from text
        new_tokens = []
        for word in tokens:
            new_token = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
            new_tokens.append(new_token)
        return new_tokens
    
# remove stopwords   
    def remove_stopwords(tokens):
        stop_list = stopwords.words('english')  
        new_tokens = []
        for word in tokens:
            if word not in stop_list:
                new_tokens.append(word)
        return new_tokens
  
 
    def norm_text(tokens):
        words = replace_numbers(tokens)
        tokens = remove_stopwords(words)
        return tokens
    

    def process_text(text):
        clean_text = denoise_text(text)
        lem_text = lemmatize_text(clean_text)
        text = ' '.join([x for x in norm_text(lem_text)])
        text = re.sub(r"-",  r" ", text)
        return text
    
    new_df = [process_text(x) for x in df]

    return new_df

To process the text:

df['tweets'] = tweet_preprocess(df.text)

This is just the initial preprocessing of the tweets to clean up the text noise and extras that aren’t needed and will end up convoluting the data when modelling.

Easy MFCCs

The mel cepstral frequency coefficients, also known as MFCCs, are set of features derived from a digital signal, consisting of 12-20 digits per sample, used to describe the overall shape of a spectral envelope. One can obtain more or fewer MFCCs depending on their application. The MFCCs are taken per sample, so if the sample rate is 32000, there are 32000 frames within each sample, and out of that, the MFCC formula allow us to derive a wealth of information in fewer datapoints.

def plot_mfccs(file):
    y, sr = librosa.load(file, sr=32000)
    S = librosa.feature.mfcc(y=y, sr=sr, 
                             n_mels=12,
                             fmax=13000)

    plt.figure(figsize=(8,5))
    librosa.display.specshow(S, x_axis='time')
    title = re.sub(r'C:\\Users\\root\\Projects\\new_birds\\New Folder\\',      '', file)
    plt.colorbar()
    plt.title(title)
    plt.show()

These can be used to visualize the spectral envelope of the audio signal in question through plotting as above. I plotted three different types of bird calls using MFCCs to provide an example:

Librosa allows quick extraction of MFCCs. Behind the scenes ,the signal is framed into shorter frames, the power spectrum is estimated as a periodogram. Next the logarithm is taken from the energy of the spectrum. Then, the DCT, or discrete cosine transformation is taken, and the coefficients resulting are what we know as MFCCs.

MFCCs are used in speech recognition and other audio classification tasks, and this feature combined with spectral, and linear features provides a well rounded view on a signal.

Should you prefer to create a vector using the resulting MFCCs, these can be extracted and added to a numpy array to be used in an LSTM classification model or in a multitude of other ways.

def get_mfcc(path):
 # save label in mappings
    pathname = path.replace(f'{data}','' )
    pn = pathname.split("/")
    semanticLabel = pn[-1]
    mapping.append(semanticLabel)
    signal, sr = librosa.load(path, sr=SR)
    start_sample = 1
    end_sample = 90000
    S = librosa.feature.melspectrogram(signal[start_sample:end_sample], sr=sr, n_fft = 4096,
                                       hop_length=512, win_length=200, power=2,
                                       fmin=100, fmax=13000, window='hann')
    
    mfcc = librosa.feature.mfcc(S=librosa.power_to_db(S))
    mfcc_data.append(mfcc.T.tolist())

Audio File Conversion

Multicolored Audio Spectrum Stock Footage Video (100% Royalty-free) 1002835 | Shutterstock

When it comes to processing audio files for classification in Python, having a .WAV format is critical to creating visualizations and extracting features from the data/audio file. Common audio file formats include .FLAC, .MP3, and several other codecs. To convert files from their original format to .wav formatted files, the Pydub library provides a convenient method to do so. First, ensure that FFMPEG is installed on your machine and that the environment variables point to the executable path where FFMPEG is installed.

Import the package as follows:

from pydub import AudioSegment
from pydub.utils import which
AudioSegment.converter = which("ffmpeg")

Next, obtain the path of the audio file you wish to convert. If the audio levels need to be standardized according to the decibel level, or loudness, run the following block of code:

# to import the file  if file is .mp3
# substitute the 'path' variable with the string of the path to the   # audio file being converted

sound = AudioSegment.from_mp3(path_to_file)

# if the file is another format

sound = AudioSegment.from_file(path_to_file)

# if the file is in stereo format and you wish to convert to mono

sound = sound.set_channels(1)

# set to target frame rate in Hz

sound = sound.set_frame_rate(32000)

# target average audio loudness level in decibels

tDb = target_dB

# get the original file loudness levels in decibels

fDb = sound.dBFS

# process change in dBFS by subtracting the original from the target

change_in_dBFS = tDb - fDb

# apply the change using the AudioSegment function '.apply_gain()'
# using the change_in_dBFS variable as the argument input

sound = sound.apply_gain(change_in_dBFS)

# to extract a portion of the audio and convert it, rather than the 
# entire file, the file can be sliced using milliseconds
# 20 seconds starting at frame 1 rather than frame 0:

sound = sound[1:20001]

# 10 seconds starting at frame 500:

sound = sound[500:15001]

# export the new file

sound.export(name.replace(r'mp3', 'wav'), format="wav")

Above, I’ve shown some functionality for converting audio files to wav formatted files with 16 bit PCM integers representing the signal. FFMPEG provides functionality inside ‘export’ to convert to other file formats as well. See the documentation regarding the export options here:

https://ffmpeg.org/ffmpeg.html#Video-and-Audio-file-format-conversion

Should there be several audio files to be converted, using Librosa, it’s easy to gather all of the files ending with a specific extension.

import librosa
files = librosa.util.find_files(path_to_folder, ext=['mp3']) 
filesArray = np.asarray(files)

Then, just iterate through the list of file paths and apply the process outlined above.

Visualizing Sound in Python

How do they know that the Big Bang made a bang noise if there is no sound in space? - Philosoraptor | Meme Generator

An audio file, or any sound really, has the following properties:

Frequency, Wavelength, Amplitude, Speed, Direction

These are characteristics that can be used to distinguish sounds from each other in neural networks, not unlike in the human brain.

Visualizing audio files is an important task in data science, due to the means of processing classification tasks. Extracted information and converting it to an image enables us to use computer vision type algorithms to compare and classify sounds.

First, we have to import librosa and matplotlib.

Initially, I created a dictionary by creating a dictionary from the lists I created, one with the file paths as strings, and the other with the name of the bird that is present in the audio clip by using the following line of code:

bird_dict = dict(zip(birds, audio_clips)) bird_dict = dict(zip(birds, audio_clips))

I am going to start with just plotting the waveform of the .wav files. In order to do this, the audio samples must be in .wav format, as .mp3 is not a recognized format in librosa. To plot this, the audio file is loaded and displayed via librosa

I am using the bird_dict dictionary that I created to iterate through the files and bird names for the titles on their respective waveforms.

Now, as you can see, each of the waveforms look very different, as they are all different species of birds. The waveform plots the amplitude envelope over time, or the pressure of the sound as seen in the peaks from the origin up in crests and down in the troughs of the wave(form). It’s the power behind the sound, if you will. The wavelength is the distance from a specific point of height on the wav to the next spot on the wave where that exact value occurs on the same axis, at the same height, going in a constant direction from a temporal perspective. This doesn’t have to be any particular predetermined point, just a specific, constant point.

Next, the spectrogram is used to display frequency over time, measured in Hertz, or Hz. Frequency is exactly what it sounds like, how many cycles or vibrations repeat per second. The inverse of wavelength, longer wavelengths are the result of lower frequencies, and shorter wavelengths are the result of higher frequencies.

Before displaying the spectrogram, using librosa, the audio time series amplitude is converted to decibels after being transformed by obtaining the STFT (Short time Fourier transform) creating the variable “D” above. The y-axis argument should be ‘log’ to display the spectrogram as logarithmic.

This can also be shown linearly, however, due to the variation of frequencies that birds are capable of creating, the low frequencies are more obvious in a logarithmic scale. Humans perceive sound logarithmically, rather than in a linear fashion.

Next, the chromagram, which visually conveys the pitch profile of a sound, where it falls in the chromatic scale. The human brain processes sounds of the same chroma, occurring at different tone heights, or octaves, as related as a physiological response.

for k,v in bird_dict.items():
    plot_chromagram(k,v)

Librosa provides a plethora of other variations on these, as well as the ability to control the STFT variables like window_length, hop_length, and n_fft, and there are always wavelets to dive into later, but this provides the basic framework to create train/test data for audio classification in neural networks.

Web Scraping #7

Creating and utilizing web scraping and connection simulation bot for L-i-n-k-e-d-I-n

Building on the previous installments on web scraping, this is a culmination of the previous work on this script.

First, the automated login function is as follows:

The PATH variable depends on the browser used. In this case, I used Chrome, therefore the code is as follows:

Enter the user’s email address and password, and a browser window will open, controlled by the Selenium WebDriver.

Next, the people_scrape function returns a dataframe with information obtained through the search bar, specifically for people/connection results

A function used within the previous function is used to paginate through the search results.

Next, create the function to get the job history information for each profile.

Then this function is to obtain the email address from the profile, if available:

Then, with the information provided, a message can be composed and a connection initiated using a personal message.

Then, combining these together as such returns a dataframe, saving a .csv file for reference:

Then, to complete the script, this culminates to a function that does all of the aforementioned tasks, by running it with the arguments representing the search term and the number of pages the initial scrape is to include for the minimized search results page. From there, the information is acquired, saved and transformed to initialize a connection on the site, and then the web driver is closed, finalizing the script.

Web Scraping #6

Using automation and web scraping on LinkedIn to build your network.

To begin building a bot to interact with LinkedIn, enabling a person to be far more active and contact more people without over analyzing every word, a bot can be super helpful to start some interactions, as not everyone will respond on the site, so why not maximize the chance?

To begin with, we have the following code to automate the login procedure:

This code automates login process on linkedin, user provides email and password as strings, webdriver gets page, as detailed in Web Scraping #1.

To obtain several profiles to scrape, pagination is required, using the following function:

Next, a function can be created to search LinkedIn for people from a certain company or with a specific job title. The follow function uses the preceding pagination function that is used to navigate search results. This returns a dataframe, which will be used later to get in depth information from each profile.

This dataframe contains information that can be used to get a more complete profile on the individual. By creating the df[‘fetch’] column, the complete url is available to obtain each profile as used in the following function.

In addition to the job experience information, a lot of times it’s handy to have the education information, perhaps someone went to school with you or somewhere you’ve previously lived, leaving a great conversation opener. Here is how the school information is extracted:

In addition to the education information, the contact info is extracted using the following function:

To run these and obtain all of the information for multiple profiles returned from the initial scraping function:

These are added to a dataframe, then the first name extracted and used to create a personalized message:

Here, the messages created are iterated through in list format, and each profile receives a personalized connection request.

Label by Subfolder

Importing data into machine learning projects differs depending on how the data is delivered. For data that is split beforehand, into train, test, and validation folders, there are often subfolders, sometimes within subfolders containing data, whether it be image, audio, or video files, labels are necessary for supervised learning models. If the data is organized into labeled folders, there are a couple of methods that can be used to extract the label information from the containing folders.

First, define the paths for the data.

One of these methods is using the glob module. Start by importing glob, or installing it if you do not yet have it.

Glob uses unix style pattern matching to obtain pathnames. If your samples are all in the same file format, as they should be, then the glob module can be accessed and used to get pathnames in string format.

This method returns a list of the files in each of their respective directories, that end with ‘.png’, located within a subfolder, which is included in the pathname.

Following this method, some string manipulation to extract the subfolder label between the sets of ‘\\’ , and this will provide a list of target labels.

Another method is using os.walk, walking through the files, and string manipulation with the ‘.split()’ function

Once the file information and labels have been obtained, it’s nice to stay organized with a dataframe.

All of the data is in one place, verify the datasets by checking the length of the dataframe.

Now it’s easy to iterate through, manipulate the pathname string, or simply assign your ‘y’ variable with ‘y = test_df.label’