The was done for all the words in

The dataset was cleaned of all unwanted sections like username, hashtags and links. After removing all these terms, we tokenize every word in the sentence using NLTK’s Tweet tokenizer to convert them into list of strings. We then use the LabeledSentence method available in class Doc2Vec to convert all the sentences into sentence objects creating individual tokens. Tokens refer to instances of a sequence of characters in a document grouped together as a semantic unit for processing1.  The LabeledSentence creates an instance of TaggedDocument to serve the purpose. This method aggregates all the words in a sentence into an object attached with a unique tag. Therefore each sentence becomes a unique list of strings and tag pair.Since twitter’s 140 character limitation, we did not remove the stop words (i.e. prepositions and articles etc.) under the premise that they contribute to the semantic structure of the sentence. It is vital for the word2vec model to capture the context between these words.The word embedding was done for all the words in the tweets using the Word2Vec library under gensim which created a vector of dimension 300. After this was done, to be used in a classifier, each word had a vector representation but our tweets were in the form a sentences. A method had to be decided to represent our entire tweet in a vector format. At this point, we had 2 possible options. One obvious and widely used option was to add the vectors of each word in a sentence and average them by the length of the tweet. But there is always a possibility of a tweet comprising of many articles and prepositions which may not really contribute to context of the tweet. Therefore, frequency index(TF-IDF value) of all the words were calculated. The TF-IDF score gives the importance of each word in any given document in a corpus. TF is the term frequency which is used to measure the frequency of occurrence of a specific term in  a document. To avoid the case where certain terms might occur multiple times in a longer document, compared to shorter ones, the TF is divided by the number of words in the document, or otherwise referred to as the length of the document2*. TF calculation takes all the words and considers them as equally important in the document. Therefore the IDF, referred to as the Inverse Document Frequency, is calculated as the log of the total number of documents to the number of documents with the specified term in it2*. This inverse proportion would give a higher value for terms that do not occur frequently and a lower value for frequently occurring terms. The product of both, would give us the TF-IDF score2*. This helped us in deciding the more efficient method to vectorize tweets which is considering their TF-IDF scores as weights of each word which further can be used to multiply with its respective word, adding all these values for each and every word that compose a tweet and then divide the value by the length of the tweet. This method was particularly useful in our experiment as we did not remove any of the stop words like ‘as’, ‘is’ and ‘the’ etc., due to the character limitation of tweets.Once the vectorization was done, the imbalances in the dataset had to be resolved. Random positive samples were oversampled to equalize the negative sample count. All the average vectors were scaled using scikit-learn’s scale method which is provided as part of its preprocessing library.