By Harris Hardiman-Mostow
Mentor: James Murphy, Mathematics; Funding Source: NSF TRIPODShardimanmostowharrish_26182_2246593_summer-scholars-poster-1
Hello! My name is Harris, I’m a senior at Tufts. I study Mathematics and Mechanical Engineering. This summer and fall I have been working under the supervision of Prof. James Murphy in the Department of Mathematics, along with Marshall, a PhD student at Tufts, and Ope, an undergraduate from Penn State, researching methods of anomaly detection.
Anomaly detection is a subfield of data science, and as the name suggests, it’s concerned with identifying when something happens that’s unexpected. Usually this is defined in precise mathematical terms. There are applications in many fields. Think of something like diagnosing tumors and fraud detection – these are areas that employ anomaly detection algorithms, so it’s quite a high-impact field. Sometimes, we are concerned with exactly how to define an anomaly, and other times, we want to figure out the best ways to identify them. In this project, we are primarily concerned with the latter. The NSF, who collected and distributed the data we used, provides for us an exact definition of an anomaly, which I summarized at the bottom of the “Introduction” section, and which is visualize in Figure 4.
The data we used as part of the investigation this summer was from the NSF’s Algorithms for Threat Detection Challenge, which is a sort of competition they organize for research groups to encourage investigation in the area. The goal of this year’s challenge was to perform the best possible anomaly detection on sparse traffic flow data. The NSF provides 3 distinct datasets. One of them, very creatively called City 1, contains 500 sensors, each with 2 years’ worth of hourly traffic flow, and, importantly, every data point is labeled as an anomaly or not. This is the so-called “training” dataset. Figure 1 shows an example week of data. As you can see, every traffic flow observation is available to us to experiment with.
The 2nd and 3rd dataset we’re given are sparse – they contain somewhere between 1 and 20 percent of the traffic flow observations, and the evaluation of our final model by the NSF is the model’s ability to correctly label these sparse data points as anomalies or not. These are the “testing” sets. It’s a common practice in data science to use a training set to experiment and improve the efficacy of different models and algorithms in order to then glean information from the testing set.
Because we’re given a precise rule about how to label anomalies, we will be able to perform anomaly detection with 100% correctness if we can fill in the gaps in the sparse data. That is to say, if we can recreate the full “picture” of traffic flow, as illustrated in Figure 3, we can create the chart shown in Figure 4, which allows us to explicitly label each point as an anomaly or not.
As a result, our problem transforms into a sparse data problem, not an anomaly detection problem (per se). Our algorithm used to estimate missing data is summarized in the “Methodology and Results” section. Ideally, I would be able to tell you more about how the algorithm performs on data it has never seen before – testing data from a different city – but unfortunately the NSF has yet to fully publish the datasets they used to evaluate our methods. However, we do know our algorithm finished 4th overall out of all the research teams working on this, which we were all very excited about!
Thanks for reading!