I had to write an OpEd

One of my classes required me to write a “provocative issue paper” and I decided to write about why we should stop collecting student data. In the interest of putting out something this week here is what I wrote:

It is becoming cliche to comment on the degree to which we live in an increasingly virtual world. However, this is a fact that policy makers and voters should be repeating as a mantra. It is without a doubt the most important shift of the 21st century. Sure, by the end of the year 2000 the dotcom bubble had already burst but the five years leading up to that were just a short preamble to the incredible and continuing changes wrought by our increasingly technological world.

One of the consequences of our new virtual reality is that corporations, governments, and researchers have the ability to collect data at unprecedented scale. Netflix collects a data point every second you are on its site or app, even if you do nothing, the NSA had, for a time, the ability to access metadata on every call and text we make and Pearson… actually its not clear what Pearson does with its data. I am not picking on Pearson either, whether its Khan Academy, Coursera, or Houghton Mifflin Harcourt educational technology companies have incredible reach into the educational lives of students and very little oversight or transparency on what data they are collecting, much less what they do with it. Because of the dangers inherent both in the existence of large volumes of student data and its use in AI and ML we would be better off, as a society, with strict restrictions on its collection.

Of course these educational companies and not-for-profits will claim that they collect data for only for the purposes of improving student outcomes but a decade in to the big data revolution no big data research has fundamentally changed the way we teach or learn. In his book, “Failure to Disrupt”, Justin Reich covers the few discoveries that have come from data collected from MOOCs and “personal” tutoring systems and finds that they are either rediscoveries from the 1990s and earlier or intuitively obvious. It is hard to prove a lack of progress from educational big data but I would challenge any reader to ask their local education Ph.D. for a paper that used big data and that changed their mind about how teaching and learning should occur.

The lack of progress to date alone, however, is no reason to curtail the collection of data and the attempts to make progress with research. The reason we need to slow or stop collection is instead because of the inherently toxic nature of data.

We do not, cannot, and will never know all of the things that can be predicted about us with data. Maybe students who log into Khan Academy earlier in the morning are more likely to become senators and the ones who log in later are more likely to go to jail. Maybe the student whose username includes numbers are more likely to suffer from depression or anxiety. These are correlation hypotheses and they are unaccountably infinite but each is a chance for a company or government to make choices about students. To sell them product or to sell them as products, to hire them, fire them, or raise the cost of their car insurance.

We really don’t know what power this data holds but what we do know, from books like “Weapons of Math Destruction” and “Algorithms of Oppression” is that this data can be combined with AI and ML algorithms to do immense harm. We know that, in general, the artificial intelligences we train reproduce and exaggerate the inequalities and biases of our society. That they tend to assume the best of the more privileged population and the worst of those who have less.

In short, the collection of this data and its use for any purpose represents a serious risk to our students, especially students who are children, and to our society with no noticeable upsides to date. I am no policy wonk but I have a rough idea of how we could protect from these risks: FERPA needs to be reformed to specifically force creators of educational technology to clear all data collection with an external board, similar to an IRB, and to delete that data after a reasonable and relatively short period of time. For data approved for analysis it should be stored and analyzed in a manner that provides for differential privacy. Ultimately, we need to treat student data as we treat medical data not as an afterthought.