Tag Archives: Data Science

A Comprehensive Classification Model for Predicting Wildfires with Uncertainty

Image result for wildfire picture
Source: https://calmatters.org/explainers/californias-worsening-wildfires-explained/

Introduction

I came across this paper by Younes Oulad Sayad, Hajar Mousannif, Hassan Al Moatassime (2019) titled Predictive modeling of wildfires: A new dataset and machine learning approach. Wildfires are a serious environmental problem today, heightened by the menace of global warming and climate change. This work introduced a new data set for predicting the occurrence of wildfires using processed remote sensing data. Some brief on their work is given below:

“This Dataset was created based on Remote Sensing data to predict the occurrence of wildfires, it contains Data related to the state of crops (NDVI: Normalized Difference Vegetation Index), meteorological conditions (LST: Land Surface Temperature) as well as the fire indicator “Thermal Anomalies”. All three parameters were collected from MODIS (Moderate Resolution Imaging Spectroradiometer), an instrument carried on board the Terra platform. The collected data went through several preprocessing techniques before building the final Dataset. The experimental Dataset is considered as a case study to illustrate what can be done at larger scales”

(Oulad Sayad et al, 2019)

In their work they used 804 data samples out of the 1713 samples available to predict the occurrence of wildfires using two Machine Learning models: Neural Network and Support Vector Machines (SVM). Here, I use the full dataset consisting 1713 samples: 1327 instances of the class “no_fire” and 386 instances of the “fire” class (More details on this data can be found here). Also, I used 11 classification models to predict whether or not there is an occurrence of wildfire using the three feature variables: Normalized Difference Vegetation Index (NDVI), Land Surface Temperature (LST) and the Thermal Anomalies (referred to as the BURNED_AREA in the dataset). The aim of this work was to compare the performance of different Machine learning models (with careful hyperparameter selection) for wildfire prediction. The full code for this work can be found in my Github: https://github.com/Qunlexie/A-Comprehensive-Classification-Model-for-Predicting-Wildfires-with-Uncertainty.

Machine Learning Models

  1. Support Vector Machines: A support Vector Classfier is one that chooses the best hyperplane that maximises distance between the two classes. The non-linear ‘rbf’ kernel tends to perfrom better for this task so it was kept constant in the grid search. However, the C values and the gamma values were varied. The grid search parameters are given below. The best combination was then used in the classification model and the crossvaligdation accuracy (k-fold with k=5) and the standard deviation was recorded. the standard deviation is used to understand the uncertainty of the model.
    param_grid = {'C': np.logspace(1, 2.3, 10), 'gamma': np.linspace(0,2, 5), 'kernel': ['rbf']}
  2. Bagging Classifier: The idea of the bagging classifier is to be able to combine several simple classifiers to create a more complex classification boundary. Here several SVM classifiers were pooled together or ‘bagged’. This model was tuned using a grid search method over a range of hyperparameters selected. The grid search parameters are given below:
    param_grid = {'n_estimators': list(range(0, 110, 10))[1:], 'max_features': list (range(1,21, 2))}
  3. Extremely Randomised Trees Classifier: Extremely randomized trees pick a node split very extremely (both a variable index and variable splitting value are chosen randomly as opposed to Random Forest that finds the best split (optimal one by variable index and variable splitting value) among random subset of variables [1]. This model was tuned using a grid search method over a range of hyper parameters selected. The grid search parameters are given below:
    param_grid = {'n_estimators': list(range(0, 110, 10))[1:], 'max_features': list (range(1,21, 2)), 'max_depth' :list(range(1,21, 2))}
  4. Adaboost Classifier: AdaBoost or Adaptive Boosting classifier builds a strong classifier by combining multiple poorly performing classifiers so that you will get high accuracy strong classifier. The basic concept behind Adaboost is to set the weights of classifiers and training the data sample in each iteration such that it ensures the accurate predictions of unusual observations [2]. The AdaBoost Classification model was tuned using a grid search method over a range of hyper parameters selected. The grid search parameters are given below:
    param_grid = {'n_estimators': list(range(0, 110, 10))[1:], 'learning_rate': [0.2, 0.4, 0.6, 0.8, 1]}
  5. Gradient Boosting Classifier: Gradient boosting classifiers are the AdaBoosting method combined with weighted minimization, after which the classifiers and weighted inputs are recalculated. The objective of Gradient Boosting classifiers is to minimize the loss, or the difference between the actual class value of the training example and the predicted class value [3]. The Gradient boosting classifier was tuned using a grid search method over a range of hyper parameters selected. The grid search parameters are given below:
    param_grid = {'n_estimators': list(range(0, 110, 10))[1:], 'max_features': list (range(1,21, 2)), 'max_depth' :list(range(1,21, 2))}
  6. Extreme Gradient Boosting (XGBOOST) Classifier: XGBoost is a refined and customized version of a gradient boosting decision tree system, created with performance and speed in mind. XGBoost actually stands for “eXtreme Gradient Boosting”, and it refers to the fact that the algorithms and methods have been customized to push the limit of what is possible for gradient boosting algorithms [3]. The XGBOOST classifier was tuned using a grid search method over a range of hyper parameters selected. The grid search parameters are the same as those of the Gradient Boosting classifiers
  7. Random Forest: It is an ensemble tree-based learning algorithm. The Random Forest Classifier is a set of decision trees from randomly selected subset of training set. It aggregates the votes from different decision trees to decide the final class of the test object [4]. The Random forest classifier was tuned using a grid search method over a range of hyper parameters selected. The grid search parameters are same as those of the Gradient boosting classifiers and Extremely Randomized Trees.
  8. Neural Network: Neural networks are built of simple elements called neurons, which take in a real value, multiply it by a weight, and run it through a non-linear activation function. By constructing multiple layers of neurons, each of which receives part of the input variables, and then passes on its results to the next layers, the network can learn very complex functions [5].
    param_grid = {'hidden_layer_sizes': list(range(0, 51, 5))[1:]}
  9. Nearest Neighbour classifier: Classifies each data point by analyzing its nearest neighbors from the training set. The current data point is assigned the class most commonly found among its neighbors. The algorithm is non-parametric (makes no assumptions on the underlying data) and uses lazy learning (does not pre-train, all training data is used during classification) [5]. In the Nearest Neighbour Classifier, using a grid search method, the size of the hidden layers was the hyperparameter varied. Details on the range defined is given below:
    param_grid = {'n_neighbors': list(range(1, 15, 2))}
  10. Gaussian Naive Bayes Classifier: A probability-based classifier based on the Bayes algorithm. According to the concept of dependent probability, it calculates the probability that each of the features of a data point (the input variables) exists in each of the target classes. It then selects the category for which the probabilities are maximal [5].
  11. Stochastic Gradient descent classifier: Stochastic Gradient Descent is similar to the gradient descent except that it uses a different type of optimiser. The difference being that in SGD, the gradient of the cost function is obtained for a single example at each iteration instead of the sum of the gradient of the cost function of all the examples [6]. In the Stochastic Gradient descent classifier, using a grid search method, the type of loss function was the hyperparameter varied. Details on the range defined is given below:
    param_grid = {'loss': ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron', 'squared_loss', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive']}

Results and Discussion

Without Class Weight Balance

A class imbalance occurs since data samples in the ‘no_fire’ class is way more than that in the ‘fire’ class. Here the analysis was done without balancing the classes. The effect of class weight balancing can significantly affect the performance of the classifier.

The best accuracy among the 11 models is obtained using the Random Forest classifier. The best Random Forest parameters used to fit the model (as obtained from grid search) is given below:

{'max_depth': 16, 'max_features': 1, 'n_estimators': 40}

The accuracy is 84.76%. The standard deviation is used in this analysis to depict the uncertainty in the models. The model uncertainty for the best model is reasonably low with a standard deviation of 2.44%

Accuracy and standard deviation of classifiers for predicting wildfire without class weight balance

Four of the classifiers have accuracy values above 80% and they are: Extremely Randomised Trees, Gradient Boosting, Extreme Gradient Boosting and Random Forest classifiers. The lowest accuracy is obtained with the stochastic gradient descent algorithm though it also has the best standard deviation (least uncertainty). The accuracy of the Random Forest classifier and the Gradient boosting classifier is almost the same with the random forest predicting with lower uncertainty compared to the Gradient Boosting classifier. Beyond the best three classifiers, the SVM and the bagged SVMs tend to perform better than their remaining counterparts.

With Class Weight Balance

Here, the class weights were balanced for 4 of the classification models: Support Vector Machine, bagging Classifiers (of SVMs ), Extra Trees Classifier, Stochastic Gradient Descent. The accuracy tend to drop but the number of correctly classified no_fire class is increased in the classification report i.e. fewer mistakes are made in classifying the no_fire class (more details on the classification reports can be found in the Github code). The bar chart below shows the accuracy values for all classifiers including those with class weight balance.

Accuracy and standard deviation of classifiers for predicting wildfire with class weight balance

Conclusion

The Random Forest Algorithm tend to perform better than other classifiers for the wildfire prediction problem. Note that the accuracy values obtained in this work is lower than those obtained in the original paper, this is is because more samples of data were used.

References

[1] Extremely randomized trees: https://docs.opencv.org/2.4/modules/ml/doc/ertrees.html

[2] AdaBoost Classifier: https://www.datacamp.com/community/tutorials/adaboost-classifier-python

[3] Gradient boosting Classifiers: https://stackabuse.com/gradient-boosting-classifiers-in-python-with-scikit-learn/

[4] Random Forest Classification: https://towardsdatascience.com/random-forest-classification-and-its-implementation-d5d840dbead0

[5] Types of Classification Algorithms: Which Is Right for Your Problem?: https://missinglink.ai/guides/neural-network-concepts/classification-neural-networks-neural-network-right-choice/

[6] ML | Stochastic Gradient Descent (SGD): https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/

Sentiment Analysis using Word Embeddings

Image result for sentiment analysis
source: https://www.disruptiveadvertising.com/social-media/sentiment-analysis/

Word Embeddings are a feature representation of words that are typically more complex than count or frequency vectors (those used in the Bag of Word Model as described in my previous post). The vector representations or embeddings for the entire document or corpus of words can then be combined in some way to generate inputs to a classifier(otherwise called a model or an algorithm).

The pre-trained embedding vectors for 100,000 possible vocabulary words have been used (i.e n was limited to n = 100000). Each line of that file consists of a word, followed by a 50-column embedding vector for it. The embeddings have been generated by the Global vectors method.

Similar to the previous project, we use the same data of sentiment data from three different domains: Amazon, imdb and yelp consisting of 2400 examples for the input and output variables. Again the output variable y is categorical indicating whether a review is good (y = 1) or bad (y = 0).

Again three models were compared using a k-fold cross validation approach with k = 10. The models are: Logistic Regression, Neural Network and Support Vector Machines. In each of these models one hyperparameter is varied in order to determine the optimal value that maximises the cross-validated accuracy for the chosen hyperparameter. I also show the variance across the selected range of values. The details on how this was run can be found in my GitHub below: https://github.com/Qunlexie/Machine-Learning-for-Sentiment-Analysis

Preprocessing

The aim of this step is to ensure that the entire corpus of word is represented in a form that the model can draw meaning from. Similar preprocessing steps as carried out in the sentiment analysis using Bag of Word model was carried out. However instead of using count or frequency representation, we have used an existing trained model that contains vector representation of each words- called Glove vectors. The vector is limited to 50 feature column (d = 50, where d is the dimension) for each word and any word in the input set will be replaced with this vector representation. Also, I have used the average value along the rows to obtained a single 50 feature representation for each line. For the entire input set (x) we are therefore able to obtain a comprehensive column of features (2400 by 50) that represents the nuances of each sentence. With this we can predict the sentiment using the models which I will talk about next.

Model Development and Results

Three models were built and tested. Each of them were tested based on their performance on both the training set and the validation set using a 10-fold cross validation technique.

1. Logistic Regression:

Logistic regression was fitted to the embedded vector transformation of the input obtained after the preprocessing step. The hyperparameter which was varied here is the C value evaluated and validated using a k-fold validation with k = 10.

The C value in the Logistic model was varied within the range of 0.1 – 10. The resulting curve is shown in the Figure 1. As shown here, the model achieves a higher training accuracy across the range of C values. Both the training and the validation curve also have an undulating shape. The lowest accuracy is obtained when C is in range 0 – 2. The highest accuracy of 77.3% is obtained at C = 2.15.

Fig 1: Accuracy curve across the range of varied parameter (C-value)
Fig 2: Standard Deviation for Training and Validation across the range of varied parameter for logistic regression

2. Neural Network Model

A Neural Network classifier was built and the accuracy was observed using the same Cross Validation technique. The hyperparameter varied was the number of hidden layers (N).

The number of hidden layers was varied between 1-10 to evaluate the accuracy of the model in training and validation using 10 – fold cross validation. The resulting curve is given in the Figure 3. The training accuracy show an undulating pattern and the best value of accuracy on the validation set is obtained with 8 hidden layers.

Fig 3: Accuracy curve across the range of varied parameter (Number of Hidden Layers).
Fig 4: Standard Deviation for Training and Validation across the range of varied parameter for Neural Network

3. Support Vector Machine (SVM)

A Support Vector Machine was similarly fitted to the transformed input data. The gamma value – which intuitively quantifies the variance or deviation – was the varied hyperparameter for this model.

The gamma value of the support vector machine was varied between the logarithmic range of -6 to 1.1 and the accuracy on the training and validation set was evaluated similar to the tow models above. Increasing the gamma value of the SVM results in an increasing trend in accuracy on the training set. As the gamma value increases beyond 0.01, the slope for the training set tend to increase while the slope of accuracy curve for the validation set maintains a constant, increases and then decreases at higher values of gamma.

Fig 5: Accuracy curve across the range of varied parameter (Gamma).
Fig 6: Standard Deviation for Training and Validation across the range of varied parameter for SVM

Conclusion

Tab 1: Parameter value at maximum. accuracy for the three compared models

From the the table above, it can be seen that the SVM achieves the highest accuracy among the three models compared. A comprehensive hyper parameter search while varying the “Gamma” value gives an accuracy of 78.2% at Gamma value of 0.043. Also, the Neural Network (MLP Classifier) gives a slightly better accuracy than the logistic regression at N = 8 (where N is the number of hidden layers in the network).

In all the models, the standard deviation obtained for the best accuracy for each hyperparameter optimised is reasonably low often in the range of 0.035 – 0.04.

Further Work

In this work we have used a 50-dimension Glove vector representation for the corpus. It might be interesting to consider higher dimensions though this may not necessarily give a better results (curse of dimensionality).

Also it might be interesting to consider more than one hyperparameter optimisation for each model. In this project, we varied one hyperparameter for each model but a complex grid search that includes several hyperparameters will likely yield better accuracy values, although this might be quite computationally expensive.

Acknowledgement

I will like to acknowledge Professor Mike Hughes and Professor Marty Allen whose course on Machine Learning at Tufts University birthed this project.

Sentiment Analysis using Bag of Words model

This project is one I really enjoyed. It is based off of my “Introduction to Machine Learning” class. The goal was to be able to predict whether a review was good or bad using data from three domains: imdb.com, amazon.com, and yelp.com. This post explains the work but the full details of the code can be found here. This data was obtained from the paper by D Kotzias, M Denil, N De Freitas, P Smyth (2005) which was presented at the KDD ’15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data .

Here, I have used 2400 data samples of one-sentence reviews and their corresponding labels whether positive or negative (positive = 1, negative = 0) which will be split into training and testing sets. A snapshot of the input data (x-variable) is shown below:

Figure 1: Snapshot of reviews data

For some reason, the snapshot above seem to be composed of bad reviews only, but trust me amazon has good reviews too :). There are three main steps in the analysis that I performed here, these includes:

  1. Feature Extraction
  2. Model Development
  3. Performance Evaluation

Feature Extraction

The objective of the this step is to ensure that the data input (which is the called the corpus in Natural Language Processing) to the model is in the most useful form: one that facilitates the extraction of “maximum” information. In order to do this, the following were carried out:

1. Removal of punctuation (” “) from each sentence in the predictor variable.
2. Retain only alphabetic words i.e. exclude numbers and alphanumeric letters.
3. Removal of common English stop words. Stop words are basically words whose presence do not alter the meaning of a sentence.
4. Eliminating all one-lettered words e.g a, I
5. Converting all of the words to lower case for consistency.

Further in the Feature extraction step, it is important to define the pipeline that will be used for the input data. Here I have used a basic, single word tokenizer for each word in a sentence. Note that the “tokens” produced has taken into account the initial preprocessing step. Here is an example of a token for the first row of input. Clearly you can notice that the words “and”, “I”, “to”, “the” have been removed compared to that in Figure 1.

First row of input data after tokenizing

The next thing in the Feature extraction step is too create the vocabulary. The vocabulary is the list with which each line in the corpus will be compared. There are several ways to design a vocabulary but here we use a list of each individual words in the corpus without repetition. Each sentence in the corpus is then compared against this set. The process of comparing the corpus with the vocabulary to create a vector is called vectorisation.

There are many approaches to vectorisation ranging from the basic counting approach to Term Frequency and inverse frequency method. The basic counting technique produces a binary feature vector based on the occurrence of each word in the dictionary for each sample in the input. The Term Frequency & Inverse Document Frequency (TF-IDF) on the other hand tend to account for high frequency or rarity of words in corpus. Rare words hold more weight since they could possibly contribute more to determining whether a review is positive or negative, while frequently occurring words are penalised. Initial tests carried out on this dataset showed a better performance for TF-IDF over the basic Count Vector approach. Therefore TF-IDF was used in this project.

Model Development

Three models were built and tested. Each were tested based on their performance on both the training set and the validation set using a cross validation technique. The goal was to use a single hyperparmeter search to determine the performance of the model across the range of values for that hyperparameter.

  1. Logistic Regression: the Logistic regression was fitted to the vectorized and transformed training data obtained using the TfidfVectorizer function in Sklearn. The hyperparameter which was varied here is the C value. The accuracy scores were evaluated for the training and validation set using a k-fold validation with k = 10.
  2. MLP Classifier: An MLP classifier was built and the accuracy was observed using a similar Cross Validation technique (k=10). The hyper parameter varied was the number of hidden layers.
  3. Support Vector Machine (SVM): Support Vector Machine was also fitted to the transformed input data. SVMs are frequently used in Natural Language tasks especially in sentiment analysis and hence its adoption. The gamma is the free parameter of the radial basis function which intuitively quantifies the variance or deviation. The performance of the model will be observed for varying values of gamma (also with k-fold validation with k = 10).

Model Performance

  • Logistic Regression Model Performance

Figures 2a and 2b show the Mean and Standard deviation of the accuracy of the model. The C value in the Logistic model was varied within the range of 0.1 to 10. Some experimentation was done to select this range as the accuracy essentially flattens out with C values above 10. The resulting graph is given in Figure 1. As shown in the figure the model achieves a higher training accuracy relative to the validation data across all range of C values. The validation accuracy like the training accuracy increases sharply at initial C values but plateaus shortly before decreasing slightly at higher C values. Comparison of Figure 2a and 2b shows some overfitting at higher C values. The maximum validation accuracy of 82.4% is achieved with this model within the specified range of C.

Figure 2a: Mean Accuracy Score across a range of C values for Logistic Regression Model
Figure 2b: Standard deviation of Accuracy Score across a range of C values for Logistic Regression Model
  • Support Vector Machine (SVM) Model Performance

The gamma value of the support vector machine was varied and the accuracy curves as shown in the Figures 3a, 3b were observed. For gamma values less than 0.5, an increase in the gamma value results in a corresponding increase in accuracy in both training and validation sets. As the gamma value increases beyond 0.5 there is a clear evidence of overfitting as the accuracy of the validation set decreases while that of the training set continues to rise. The best accuracy on validation set within the range of gamma values for SVM occurs around gamma values = 0.5.

Figure 3a: Mean Accuracy Score across a range of Gamma for SVM
Figure 3b: Standard Deviation of Accuracy Score across a range of Gamma for SVM
  • Neural Network Model Performance

The number of hidden layers in the Neural Network was varied between 1 – 10 to evaluate the accuracy of the model against the training and testing set. Increasing the range of hidden layers above 10 does not improve the model and hence the adoption of this range. The resulting graphs are given in figure 4a, 4b. The maximum accuracy of 80.25% was obtained on the validation set using 2 hidden layers. The training accuracy is fairly constant across the range of numbers of hidden layers. The standard deviation curve shows an undulating pattern of highs and low and a relatively high standard deviation is observed even when number of hidden layers = 2.

Figure 4a: Mean Accuracy Score across a range of Number of Hidden layers for a Neural Network
Figure 4b: Standard Deviation of Accuracy Score across a range of Number of Hidden layers for a Neural Network

Conclusion

The Logistic Model is the leading model in terms of accuracy score for the sentiment analysis using Bag of Words model carried out here. The Logistic model is also quite good at avoiding overfitting (Figure 2a). The overfitting probability is quite low and it occurs only at extreme values of C. Also in terms of the stability of the logistic regression model, it is quite stable as shown in the Standard deviation curve especially at C values greater than 4 (Figure 2b). The SVM is second though the overfitting potential is high at high values of gamma (Figure 3a). The Neural Network comes in third place with a model accuracy less than both Logistic regression and the SVM. Also the model standard deviation is quite unstable across the range of hyper parameter selected (Figure 4b).

Further Work

This project takes a very simple approach and there is substantial room for improvement. This model could possible be improved by using a bi-gram or tri-gram vectoriser as opposed to the simple unigram vectoriser used here. “gram” simply means the number of words that are combined in the vocabulary and instead of a single word every two words (bi-gram) or three words (tri-gram) will form a component of the vocabulary space. This in some ways could add more meaning to the model. For example given a sentence: “not like it”. The bi-gram token will be: [“not like” , “like it” ] which could mean two different things and thus influence the label output. Also we could use some other models such as Word Embedding using Glove Vectors and I will write something on the improvement of this model in my next post. Stay tuned 🙂