A little while back, we realized we were nipping at the heels of “big data” territory, arriving at near 1 terabytes of data. We being the team at Bankable Frontier Associates I work with, through a partnership with the Center for Emerging Markets Enterprises at Fletcher. Not quite a size that would titillate folks who drink Hadoop and sleep in Elastic Clouds, but it was a sobering moment that caused for some reflection on the limits of what we could do with all this information, even as we strained a pretty souped up machine to it’s limits.

XKCD - Statistical significanceMost of my concerns stem from the fact that big data has the disconcerting property of confessing to something – anything – under sufficient coercion. It’s a variation of the age-old problem of statistical correlation, aptly captured in the XKCD to the right –>

As the venerable Nassim Taleb points out, “We’re more fooled by noise than ever before, and it’s because … with big data, researchers have brought cherry-picking to an industrial level. … I am not saying here that there is no information in big data. There is plenty of information. The problem — the central issue — is that the needle comes in an increasingly larger haystack.

When you’re dealing with data on tens of millions of accounts and billions of transactions from financial institutions serving clients in eight countries, that’s a rather massive haystack to get lost in. In situations like this, it is ever so important to have a set of null hypotheses that can be proved/disproved conclusively, thereby keeping us honest, instead of chasing spurious connections.

Which brings us to correlation vs causation. Yes, we have granular transaction data over a course of years for each account holder, meaning we know everything they are doing with that account. We can also have up to twenty characteristics of the client and the account type – age, gender, income, occupation, age of account, interest rate paid, etc.

But unless such studies are paired with detailed financial diaries, we know nothing of the individuals motivations for why they do what they do, or of the rest of their financial portfolio and financial tools at their disposal. This means we usually cannot say things like, “the average account holder saves Ksh X for her child’s school uniform”.

And that’s ok.

Causality in the social sciences is a hard problem. It’s not possible to hold “everything else constant” like we can with the hard sciences. Human free will allows for a mind-boggling array of choices, people may not always take the same decision despite being faced with the same choices, and somethings an effect may have multiple contributing causes.

Quantitative researchers do the best they can to account for all possible explanatory variables and then attribute degrees of causality to certain variables. Because we don’t have all possible explanatory variables when dealing with big data, we restrict ourselves to demonstrating strong correlations and usually end up indicating potential causal connections and let others take it from there – such as field researchers who can conduct focus groups to dig in deep.

This may not sound intellectually gratifying, but it is once you get into the thick of things. Let’s consider the example of two savings types: A, which is a short-term, low-balance almost transactional behavior, and B, which is accretionary savings over the course of a year leading to a decent balance. A is strongly correlated with ATM card usage, while B is strongly correlated with branch usage. 20-40% of all savings accounts seem to display A-type behavior, while about 1% display B-type behavior across many of the financial institutions we have looked at and I can talk about. What questions come to mind? How about:

  • Do ATMs make it hard to save larger amounts over the long term because it’s just so easy to take money out? Do branches make it harder for folks to withdraw funds willy nilly and therefore save more over the long term?
  • Or.. do clients self-select to use ATMs in cases where they need easy access to money and intermediate small amounts through that channel, leaving large amounts of transactions aimed at towards building that large lump sum for some purpose to happen at branches, not least because they don’t feel safe hauling a satchel of catch to an ATM in the middle of nowhere?

The implications of potential answer(s) can be profound. The first would imply that while we have celebrated ATMs as a successful de-congestion measure for banks, reducing staff load, client wait-times and operational expenses associated with physical branches, they have also caused people to save less, which can be antithetical to the cause of financial inclusion. On the other hand, the second would imply that branches still have certain benefits that are not being captured by other channels, and more effort needs to be made to address this convenience/security factor.

Of course, as with any complex system, the actual answer probably contains kernels of truth from both possibilities, and then some. Unless ridiculously fortuitous natural experiments present themselves with just the right incentives, say through subtle product rule changes intended to “nudge” a certain type of behavior, it’s well nigh impossible to seek answers to these kinds of questions irrespective of how big “big data” is.

(Btw, having 1% of accounts display a particular type of behavior across different banks in different countries is highly interesting in itself, since there is nothing definitional that would force this to happen. But that’s another story.)

I, for one, sleep peacefully at night knowing that often, all I can expect to get from “big data” are glorified correlations; anything else is gravy.