JP Morgan put out an interesting financial-markets note on “big data”, illustrating the changes it could bring and the value it could create for vendors. I agree with both those points in principle, but I think there’s a bit more to the problem (and thus to the realization of the opportunity) than the note suggests.
“Big data” is the term we use to describe the mountains of “uncategorized” or “unstructured” information that businesses generate. Unlike the business-transactional data that’s neatly packaged into RDBMS tables, big data is…well…just out there. It’s tempting to say that if we analyzed it thoroughly and correctly, we could make better decisions. It’s also likely correct, but that doesn’t go far enough.
There’s insight buried in virtually every form of information, but is it “valuable” insight? That depends on the net of the worth of the knowledge you can extract and the cost of its extraction. My argument is that all our big data hype tends to presume that extraction is almost cost-less and that the value is fairly easy to recognize. On a broad scale, that’s not easy. On some narrower scales, it’s tantalizing because it offers us a vision of what harnessing big data might really take.
Take health care as an example. Everyone says that we can improve patient care by exercising big data, but in most cases the examples that are cited (including the ones in the JP Morgan report) are really not about “big data” but about better use of regular data. There are mountains of prescription information, patient information, etc. and it’s often fairly well categorized, but it’s separated into chunks by administrative boundaries—pharmacies have some, insurance companies have some, and doctors have some. Could we get things out of the combination we can’t extract from the pieces? Sure, but how do we cross those company borders. Nobody is going to give the other guy the data to keep, nor are they going to pay to store others’ data in duplicate on their own systems.
I think what we really need to be thinking about isn’t so much big data but “big query”. We already have a cloud model (Hadoop) that’s designed to distribute questions based on where the data needed to answer the questions might reside. The problem is that these systems presume that the data can be independently analyzed in each of the repositories in which it’s stored. If we want to analyze the correlation of two variables, we need both of them in the same place, and that might well mean that we have to suck all the big data to one massive virtual database to analyze, which is going to be enormously costly. Further, we could have done this in the old days of sending tapes and even then it was unwieldy. We need to make technology work better for us in this problem.
Big query would have to be a technique to perform culling on big data in its distributed form, to extract from it the elements that might meet the correlative criteria on which most data analysis has always depended. That means sending screening queries to each data node, having them reply with results, and then correlating them centrally. This is almost a data topology problem in one sense; based on the results from each data node we could pick the cheapest place to join them. It may also be a probabilistic problem, like finding the God Particle. We could apply screening criteria with the knowledge that each level of screening would increase the risk we’d miss something by excluding it but decrease the burden of the analysis needed. So maybe we say “I want a three-sigma or less risk of exclusion” and see what data volumes and costs result, then maybe increase that risk to two sigmas if we can’t afford the results.
I agree big data is a great opportunity, as I said. I think we need to start thinking about the specifics of how to address it and not sweep them aside in an orgy of enthusiasm. That’s the only way to make sure that the big data wave doesn’t become just a hype-wave flash in the pan.