Tag Archives: Natural Language Processing

Sentiment Analysis using Word Embeddings

Image result for sentiment analysis
source: https://www.disruptiveadvertising.com/social-media/sentiment-analysis/

Word Embeddings are a feature representation of words that are typically more complex than count or frequency vectors (those used in the Bag of Word Model as described in my previous post). The vector representations or embeddings for the entire document or corpus of words can then be combined in some way to generate inputs to a classifier(otherwise called a model or an algorithm).

The pre-trained embedding vectors for 100,000 possible vocabulary words have been used (i.e n was limited to n = 100000). Each line of that file consists of a word, followed by a 50-column embedding vector for it. The embeddings have been generated by the Global vectors method.

Similar to the previous project, we use the same data of sentiment data from three different domains: Amazon, imdb and yelp consisting of 2400 examples for the input and output variables. Again the output variable y is categorical indicating whether a review is good (y = 1) or bad (y = 0).

Again three models were compared using a k-fold cross validation approach with k = 10. The models are: Logistic Regression, Neural Network and Support Vector Machines. In each of these models one hyperparameter is varied in order to determine the optimal value that maximises the cross-validated accuracy for the chosen hyperparameter. I also show the variance across the selected range of values. The details on how this was run can be found in my GitHub below: https://github.com/Qunlexie/Machine-Learning-for-Sentiment-Analysis

Preprocessing

The aim of this step is to ensure that the entire corpus of word is represented in a form that the model can draw meaning from. Similar preprocessing steps as carried out in the sentiment analysis using Bag of Word model was carried out. However instead of using count or frequency representation, we have used an existing trained model that contains vector representation of each words- called Glove vectors. The vector is limited to 50 feature column (d = 50, where d is the dimension) for each word and any word in the input set will be replaced with this vector representation. Also, I have used the average value along the rows to obtained a single 50 feature representation for each line. For the entire input set (x) we are therefore able to obtain a comprehensive column of features (2400 by 50) that represents the nuances of each sentence. With this we can predict the sentiment using the models which I will talk about next.

Model Development and Results

Three models were built and tested. Each of them were tested based on their performance on both the training set and the validation set using a 10-fold cross validation technique.

1. Logistic Regression:

Logistic regression was fitted to the embedded vector transformation of the input obtained after the preprocessing step. The hyperparameter which was varied here is the C value evaluated and validated using a k-fold validation with k = 10.

The C value in the Logistic model was varied within the range of 0.1 – 10. The resulting curve is shown in the Figure 1. As shown here, the model achieves a higher training accuracy across the range of C values. Both the training and the validation curve also have an undulating shape. The lowest accuracy is obtained when C is in range 0 – 2. The highest accuracy of 77.3% is obtained at C = 2.15.

Fig 1: Accuracy curve across the range of varied parameter (C-value)
Fig 2: Standard Deviation for Training and Validation across the range of varied parameter for logistic regression

2. Neural Network Model

A Neural Network classifier was built and the accuracy was observed using the same Cross Validation technique. The hyperparameter varied was the number of hidden layers (N).

The number of hidden layers was varied between 1-10 to evaluate the accuracy of the model in training and validation using 10 – fold cross validation. The resulting curve is given in the Figure 3. The training accuracy show an undulating pattern and the best value of accuracy on the validation set is obtained with 8 hidden layers.

Fig 3: Accuracy curve across the range of varied parameter (Number of Hidden Layers).
Fig 4: Standard Deviation for Training and Validation across the range of varied parameter for Neural Network

3. Support Vector Machine (SVM)

A Support Vector Machine was similarly fitted to the transformed input data. The gamma value – which intuitively quantifies the variance or deviation – was the varied hyperparameter for this model.

The gamma value of the support vector machine was varied between the logarithmic range of -6 to 1.1 and the accuracy on the training and validation set was evaluated similar to the tow models above. Increasing the gamma value of the SVM results in an increasing trend in accuracy on the training set. As the gamma value increases beyond 0.01, the slope for the training set tend to increase while the slope of accuracy curve for the validation set maintains a constant, increases and then decreases at higher values of gamma.

Fig 5: Accuracy curve across the range of varied parameter (Gamma).
Fig 6: Standard Deviation for Training and Validation across the range of varied parameter for SVM

Conclusion

Tab 1: Parameter value at maximum. accuracy for the three compared models

From the the table above, it can be seen that the SVM achieves the highest accuracy among the three models compared. A comprehensive hyper parameter search while varying the “Gamma” value gives an accuracy of 78.2% at Gamma value of 0.043. Also, the Neural Network (MLP Classifier) gives a slightly better accuracy than the logistic regression at N = 8 (where N is the number of hidden layers in the network).

In all the models, the standard deviation obtained for the best accuracy for each hyperparameter optimised is reasonably low often in the range of 0.035 – 0.04.

Further Work

In this work we have used a 50-dimension Glove vector representation for the corpus. It might be interesting to consider higher dimensions though this may not necessarily give a better results (curse of dimensionality).

Also it might be interesting to consider more than one hyperparameter optimisation for each model. In this project, we varied one hyperparameter for each model but a complex grid search that includes several hyperparameters will likely yield better accuracy values, although this might be quite computationally expensive.

Acknowledgement

I will like to acknowledge Professor Mike Hughes and Professor Marty Allen whose course on Machine Learning at Tufts University birthed this project.

Sentiment Analysis using Bag of Words model

This project is one I really enjoyed. It is based off of my “Introduction to Machine Learning” class. The goal was to be able to predict whether a review was good or bad using data from three domains: imdb.com, amazon.com, and yelp.com. This post explains the work but the full details of the code can be found here. This data was obtained from the paper by D Kotzias, M Denil, N De Freitas, P Smyth (2005) which was presented at the KDD ’15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data .

Here, I have used 2400 data samples of one-sentence reviews and their corresponding labels whether positive or negative (positive = 1, negative = 0) which will be split into training and testing sets. A snapshot of the input data (x-variable) is shown below:

Figure 1: Snapshot of reviews data

For some reason, the snapshot above seem to be composed of bad reviews only, but trust me amazon has good reviews too :). There are three main steps in the analysis that I performed here, these includes:

  1. Feature Extraction
  2. Model Development
  3. Performance Evaluation

Feature Extraction

The objective of the this step is to ensure that the data input (which is the called the corpus in Natural Language Processing) to the model is in the most useful form: one that facilitates the extraction of “maximum” information. In order to do this, the following were carried out:

1. Removal of punctuation (” “) from each sentence in the predictor variable.
2. Retain only alphabetic words i.e. exclude numbers and alphanumeric letters.
3. Removal of common English stop words. Stop words are basically words whose presence do not alter the meaning of a sentence.
4. Eliminating all one-lettered words e.g a, I
5. Converting all of the words to lower case for consistency.

Further in the Feature extraction step, it is important to define the pipeline that will be used for the input data. Here I have used a basic, single word tokenizer for each word in a sentence. Note that the “tokens” produced has taken into account the initial preprocessing step. Here is an example of a token for the first row of input. Clearly you can notice that the words “and”, “I”, “to”, “the” have been removed compared to that in Figure 1.

First row of input data after tokenizing

The next thing in the Feature extraction step is too create the vocabulary. The vocabulary is the list with which each line in the corpus will be compared. There are several ways to design a vocabulary but here we use a list of each individual words in the corpus without repetition. Each sentence in the corpus is then compared against this set. The process of comparing the corpus with the vocabulary to create a vector is called vectorisation.

There are many approaches to vectorisation ranging from the basic counting approach to Term Frequency and inverse frequency method. The basic counting technique produces a binary feature vector based on the occurrence of each word in the dictionary for each sample in the input. The Term Frequency & Inverse Document Frequency (TF-IDF) on the other hand tend to account for high frequency or rarity of words in corpus. Rare words hold more weight since they could possibly contribute more to determining whether a review is positive or negative, while frequently occurring words are penalised. Initial tests carried out on this dataset showed a better performance for TF-IDF over the basic Count Vector approach. Therefore TF-IDF was used in this project.

Model Development

Three models were built and tested. Each were tested based on their performance on both the training set and the validation set using a cross validation technique. The goal was to use a single hyperparmeter search to determine the performance of the model across the range of values for that hyperparameter.

  1. Logistic Regression: the Logistic regression was fitted to the vectorized and transformed training data obtained using the TfidfVectorizer function in Sklearn. The hyperparameter which was varied here is the C value. The accuracy scores were evaluated for the training and validation set using a k-fold validation with k = 10.
  2. MLP Classifier: An MLP classifier was built and the accuracy was observed using a similar Cross Validation technique (k=10). The hyper parameter varied was the number of hidden layers.
  3. Support Vector Machine (SVM): Support Vector Machine was also fitted to the transformed input data. SVMs are frequently used in Natural Language tasks especially in sentiment analysis and hence its adoption. The gamma is the free parameter of the radial basis function which intuitively quantifies the variance or deviation. The performance of the model will be observed for varying values of gamma (also with k-fold validation with k = 10).

Model Performance

  • Logistic Regression Model Performance

Figures 2a and 2b show the Mean and Standard deviation of the accuracy of the model. The C value in the Logistic model was varied within the range of 0.1 to 10. Some experimentation was done to select this range as the accuracy essentially flattens out with C values above 10. The resulting graph is given in Figure 1. As shown in the figure the model achieves a higher training accuracy relative to the validation data across all range of C values. The validation accuracy like the training accuracy increases sharply at initial C values but plateaus shortly before decreasing slightly at higher C values. Comparison of Figure 2a and 2b shows some overfitting at higher C values. The maximum validation accuracy of 82.4% is achieved with this model within the specified range of C.

Figure 2a: Mean Accuracy Score across a range of C values for Logistic Regression Model
Figure 2b: Standard deviation of Accuracy Score across a range of C values for Logistic Regression Model
  • Support Vector Machine (SVM) Model Performance

The gamma value of the support vector machine was varied and the accuracy curves as shown in the Figures 3a, 3b were observed. For gamma values less than 0.5, an increase in the gamma value results in a corresponding increase in accuracy in both training and validation sets. As the gamma value increases beyond 0.5 there is a clear evidence of overfitting as the accuracy of the validation set decreases while that of the training set continues to rise. The best accuracy on validation set within the range of gamma values for SVM occurs around gamma values = 0.5.

Figure 3a: Mean Accuracy Score across a range of Gamma for SVM
Figure 3b: Standard Deviation of Accuracy Score across a range of Gamma for SVM
  • Neural Network Model Performance

The number of hidden layers in the Neural Network was varied between 1 – 10 to evaluate the accuracy of the model against the training and testing set. Increasing the range of hidden layers above 10 does not improve the model and hence the adoption of this range. The resulting graphs are given in figure 4a, 4b. The maximum accuracy of 80.25% was obtained on the validation set using 2 hidden layers. The training accuracy is fairly constant across the range of numbers of hidden layers. The standard deviation curve shows an undulating pattern of highs and low and a relatively high standard deviation is observed even when number of hidden layers = 2.

Figure 4a: Mean Accuracy Score across a range of Number of Hidden layers for a Neural Network
Figure 4b: Standard Deviation of Accuracy Score across a range of Number of Hidden layers for a Neural Network

Conclusion

The Logistic Model is the leading model in terms of accuracy score for the sentiment analysis using Bag of Words model carried out here. The Logistic model is also quite good at avoiding overfitting (Figure 2a). The overfitting probability is quite low and it occurs only at extreme values of C. Also in terms of the stability of the logistic regression model, it is quite stable as shown in the Standard deviation curve especially at C values greater than 4 (Figure 2b). The SVM is second though the overfitting potential is high at high values of gamma (Figure 3a). The Neural Network comes in third place with a model accuracy less than both Logistic regression and the SVM. Also the model standard deviation is quite unstable across the range of hyper parameter selected (Figure 4b).

Further Work

This project takes a very simple approach and there is substantial room for improvement. This model could possible be improved by using a bi-gram or tri-gram vectoriser as opposed to the simple unigram vectoriser used here. “gram” simply means the number of words that are combined in the vocabulary and instead of a single word every two words (bi-gram) or three words (tri-gram) will form a component of the vocabulary space. This in some ways could add more meaning to the model. For example given a sentence: “not like it”. The bi-gram token will be: [“not like” , “like it” ] which could mean two different things and thus influence the label output. Also we could use some other models such as Word Embedding using Glove Vectors and I will write something on the improvement of this model in my next post. Stay tuned 🙂