Bigrams from Word2Vec

8 “Simple” Guidelines For Data Projects | by Dat Tran | Built to Adapt |  Medium

Once the text has been scrubbed, tokenized and stemmed, there is additional information worth extracting. Bigrams are recurring pairs of words that occur in the same order in a dataset, which can be a general corpus, like Text8Corpus or a built-in NLTK corpus.

When I model my data, I will use a binary classification method, in a one vs. rest analysis using word vectors created from the data I am obtaining in this process. Using the Word2Vec tool, Phrases, I will extract bigrams from the text based on the times the combination occurs and a threshold, which is defined in the Gensim documentation as threshold (float, optional) – Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words a followed by b is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function, see the scoring parameter.

Increasing the threshold decreases the phrases returned from the data.

I currently have my text in a Pandas DataFrame, it has been cleaned, tokenized, lemmatized according to part-of-speech tagging using NLTK. I am simply assigning as ‘texts’, in string format.

Now I split the above text into documents, each document, being a tweet in this case.

Here, I create the Word2Vec phrases instance, using a min_count of 5, indicating that the combination of words returned will occur no less than 5 times in the dataset.

Create an empty list, called “bigrams”, to hold the text including bigrams in place of the original combination of words, and iterating through the list of tweets, created earlier called “doc”, using the sentence_to_bigrams functions from above to each row of the data.

This can now be used to create word vectors to build and train the model.

Leave a comment

Design a site like this with WordPress.com
Get started