Word Embeddings are a feature representation of words that are typically more complex than count or frequency vectors (those used in the Bag of Word Model as described in my previous post). The vector representations or embeddings for the entire document or corpus of words can then be combined in some way to generate inputs to a classifier(otherwise called a model or an algorithm).
The pre-trained embedding vectors for 100,000 possible vocabulary words have been used (i.e n was limited to n = 100000). Each line of that file consists of a word, followed by a 50-column embedding vector for it. The embeddings have been generated by the Global vectors method.
Similar to the previous project, we use the same data of sentiment data from three different domains: Amazon, imdb and yelp consisting of 2400 examples for the input and output variables. Again the output variable y is categorical indicating whether a review is good (y = 1) or bad (y = 0).
Again three models were compared using a k-fold cross validation approach with k = 10. The models are: Logistic Regression, Neural Network and Support Vector Machines. In each of these models one hyperparameter is varied in order to determine the optimal value that maximises the cross-validated accuracy for the chosen hyperparameter. I also show the variance across the selected range of values. The details on how this was run can be found in my GitHub below: https://github.com/Qunlexie/Machine-Learning-for-Sentiment-Analysis
The aim of this step is to ensure that the entire corpus of word is represented in a form that the model can draw meaning from. Similar preprocessing steps as carried out in the sentiment analysis using Bag of Word model was carried out. However instead of using count or frequency representation, we have used an existing trained model that contains vector representation of each words- called Glove vectors. The vector is limited to 50 feature column (d = 50, where d is the dimension) for each word and any word in the input set will be replaced with this vector representation. Also, I have used the average value along the rows to obtained a single 50 feature representation for each line. For the entire input set (x) we are therefore able to obtain a comprehensive column of features (2400 by 50) that represents the nuances of each sentence. With this we can predict the sentiment using the models which I will talk about next.
Model Development and Results
Three models were built and tested. Each of them were tested based on their performance on both the training set and the validation set using a 10-fold cross validation technique.
1. Logistic Regression:
Logistic regression was fitted to the embedded vector transformation of the input obtained after the preprocessing step. The hyperparameter which was varied here is the C value evaluated and validated using a k-fold validation with k = 10.
The C value in the Logistic model was varied within the range of 0.1 – 10. The resulting curve is shown in the Figure 1. As shown here, the model achieves a higher training accuracy across the range of C values. Both the training and the validation curve also have an undulating shape. The lowest accuracy is obtained when C is in range 0 – 2. The highest accuracy of 77.3% is obtained at C = 2.15.
2. Neural Network Model
A Neural Network classifier was built and the accuracy was observed using the same Cross Validation technique. The hyperparameter varied was the number of hidden layers (N).
The number of hidden layers was varied between 1-10 to evaluate the accuracy of the model in training and validation using 10 – fold cross validation. The resulting curve is given in the Figure 3. The training accuracy show an undulating pattern and the best value of accuracy on the validation set is obtained with 8 hidden layers.
3. Support Vector Machine (SVM)
A Support Vector Machine was similarly fitted to the transformed input data. The gamma value – which intuitively quantifies the variance or deviation – was the varied hyperparameter for this model.
The gamma value of the support vector machine was varied between the logarithmic range of -6 to 1.1 and the accuracy on the training and validation set was evaluated similar to the tow models above. Increasing the gamma value of the SVM results in an increasing trend in accuracy on the training set. As the gamma value increases beyond 0.01, the slope for the training set tend to increase while the slope of accuracy curve for the validation set maintains a constant, increases and then decreases at higher values of gamma.
From the the table above, it can be seen that the SVM achieves the highest accuracy among the three models compared. A comprehensive hyper parameter search while varying the “Gamma” value gives an accuracy of 78.2% at Gamma value of 0.043. Also, the Neural Network (MLP Classifier) gives a slightly better accuracy than the logistic regression at N = 8 (where N is the number of hidden layers in the network).
In all the models, the standard deviation obtained for the best accuracy for each hyperparameter optimised is reasonably low often in the range of 0.035 – 0.04.
In this work we have used a 50-dimension Glove vector representation for the corpus. It might be interesting to consider higher dimensions though this may not necessarily give a better results (curse of dimensionality).
Also it might be interesting to consider more than one hyperparameter optimisation for each model. In this project, we varied one hyperparameter for each model but a complex grid search that includes several hyperparameters will likely yield better accuracy values, although this might be quite computationally expensive.
I will like to acknowledge Professor Mike Hughes and Professor Marty Allen whose course on Machine Learning at Tufts University birthed this project.