This project is one I really enjoyed. It is based off of my “Introduction to Machine Learning” class. The goal was to be able to predict whether a review was good or bad using data from three domains: imdb.com, amazon.com, and yelp.com. This post explains the work but the full details of the code can be found here. This data was obtained from the paper by D Kotzias, M Denil, N De Freitas, P Smyth (2005) which was presented at the KDD ’15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data .
Here, I have used 2400 data samples of one-sentence reviews and their corresponding labels whether positive or negative (positive = 1, negative = 0) which will be split into training and testing sets. A snapshot of the input data (x-variable) is shown below:
For some reason, the snapshot above seem to be composed of bad reviews only, but trust me amazon has good reviews too :). There are three main steps in the analysis that I performed here, these includes:
- Feature Extraction
- Model Development
- Performance Evaluation
The objective of the this step is to ensure that the data input (which is the called the corpus in Natural Language Processing) to the model is in the most useful form: one that facilitates the extraction of “maximum” information. In order to do this, the following were carried out:
1. Removal of punctuation (” “) from each sentence in the predictor variable.
2. Retain only alphabetic words i.e. exclude numbers and alphanumeric letters.
3. Removal of common English stop words. Stop words are basically words whose presence do not alter the meaning of a sentence.
4. Eliminating all one-lettered words e.g a, I
5. Converting all of the words to lower case for consistency.
Further in the Feature extraction step, it is important to define the pipeline that will be used for the input data. Here I have used a basic, single word tokenizer for each word in a sentence. Note that the “tokens” produced has taken into account the initial preprocessing step. Here is an example of a token for the first row of input. Clearly you can notice that the words “and”, “I”, “to”, “the” have been removed compared to that in Figure 1.
The next thing in the Feature extraction step is too create the vocabulary. The vocabulary is the list with which each line in the corpus will be compared. There are several ways to design a vocabulary but here we use a list of each individual words in the corpus without repetition. Each sentence in the corpus is then compared against this set. The process of comparing the corpus with the vocabulary to create a vector is called vectorisation.
There are many approaches to vectorisation ranging from the basic counting approach to Term Frequency and inverse frequency method. The basic counting technique produces a binary feature vector based on the occurrence of each word in the dictionary for each sample in the input. The Term Frequency & Inverse Document Frequency (TF-IDF) on the other hand tend to account for high frequency or rarity of words in corpus. Rare words hold more weight since they could possibly contribute more to determining whether a review is positive or negative, while frequently occurring words are penalised. Initial tests carried out on this dataset showed a better performance for TF-IDF over the basic Count Vector approach. Therefore TF-IDF was used in this project.
Three models were built and tested. Each were tested based on their performance on both the training set and the validation set using a cross validation technique. The goal was to use a single hyperparmeter search to determine the performance of the model across the range of values for that hyperparameter.
- Logistic Regression: the Logistic regression was fitted to the vectorized and transformed training data obtained using the TfidfVectorizer function in Sklearn. The hyperparameter which was varied here is the C value. The accuracy scores were evaluated for the training and validation set using a k-fold validation with k = 10.
- MLP Classifier: An MLP classifier was built and the accuracy was observed using a similar Cross Validation technique (k=10). The hyper parameter varied was the number of hidden layers.
- Support Vector Machine (SVM): Support Vector Machine was also fitted to the transformed input data. SVMs are frequently used in Natural Language tasks especially in sentiment analysis and hence its adoption. The gamma is the free parameter of the radial basis function which intuitively quantifies the variance or deviation. The performance of the model will be observed for varying values of gamma (also with k-fold validation with k = 10).
- Logistic Regression Model Performance
Figures 2a and 2b show the Mean and Standard deviation of the accuracy of the model. The C value in the Logistic model was varied within the range of 0.1 to 10. Some experimentation was done to select this range as the accuracy essentially flattens out with C values above 10. The resulting graph is given in Figure 1. As shown in the figure the model achieves a higher training accuracy relative to the validation data across all range of C values. The validation accuracy like the training accuracy increases sharply at initial C values but plateaus shortly before decreasing slightly at higher C values. Comparison of Figure 2a and 2b shows some overfitting at higher C values. The maximum validation accuracy of 82.4% is achieved with this model within the specified range of C.
- Support Vector Machine (SVM) Model Performance
The gamma value of the support vector machine was varied and the accuracy curves as shown in the Figures 3a, 3b were observed. For gamma values less than 0.5, an increase in the gamma value results in a corresponding increase in accuracy in both training and validation sets. As the gamma value increases beyond 0.5 there is a clear evidence of overfitting as the accuracy of the validation set decreases while that of the training set continues to rise. The best accuracy on validation set within the range of gamma values for SVM occurs around gamma values = 0.5.
- Neural Network Model Performance
The number of hidden layers in the Neural Network was varied between 1 – 10 to evaluate the accuracy of the model against the training and testing set. Increasing the range of hidden layers above 10 does not improve the model and hence the adoption of this range. The resulting graphs are given in figure 4a, 4b. The maximum accuracy of 80.25% was obtained on the validation set using 2 hidden layers. The training accuracy is fairly constant across the range of numbers of hidden layers. The standard deviation curve shows an undulating pattern of highs and low and a relatively high standard deviation is observed even when number of hidden layers = 2.
The Logistic Model is the leading model in terms of accuracy score for the sentiment analysis using Bag of Words model carried out here. The Logistic model is also quite good at avoiding overfitting (Figure 2a). The overfitting probability is quite low and it occurs only at extreme values of C. Also in terms of the stability of the logistic regression model, it is quite stable as shown in the Standard deviation curve especially at C values greater than 4 (Figure 2b). The SVM is second though the overfitting potential is high at high values of gamma (Figure 3a). The Neural Network comes in third place with a model accuracy less than both Logistic regression and the SVM. Also the model standard deviation is quite unstable across the range of hyper parameter selected (Figure 4b).
This project takes a very simple approach and there is substantial room for improvement. This model could possible be improved by using a bi-gram or tri-gram vectoriser as opposed to the simple unigram vectoriser used here. “gram” simply means the number of words that are combined in the vocabulary and instead of a single word every two words (bi-gram) or three words (tri-gram) will form a component of the vocabulary space. This in some ways could add more meaning to the model. For example given a sentence: “not like it”. The bi-gram token will be: [“not like” , “like it” ] which could mean two different things and thus influence the label output. Also we could use some other models such as Word Embedding using Glove Vectors and I will write something on the improvement of this model in my next post. Stay tuned 🙂