Big Data vs. A Lot of Data

The term Big Data is getting thrown around a lot lately. As is the case with buzzwords, people have begun to use the term to describe a broad category of interest, (similar things happened to “innovation”, “social”, and “Web 2.0”.) If this wasn’t enough, add all the hype/marketing from hardware, software, and service firms driving the “importance of Big Data”, and finding any real clarity becomes impossible.

A lot of people seem to be using “big data” as a proxy for systems at scale and the data that comes with those systems. The general suggestion is that if you have a large system with lots of users, there must be patterns hidden in that data. And it follows that those hidden patterns must be worth something to somebody (right?)…so there’s gold in them there digital hills. (So many references to prospecting in the data world; mining, sharding, etc.)

I had the good fortune of hearing Cesar Hidalgo this week the Media Lab. He spends a lot of time thinking about networks and large data sets, and he had some great thoughts on the topic. In his talk, Hidalgo defined a nice framework to distinguish Big Data from a lot of data. He had three simple qualifying questions.

– Do you have size? – This is pretty relative to the problem you’re working on. But it’s usually in the hundreds of thousands/millions of records. You’ll need enough to provide some statistical significance across your population. But the greater the set of data the more edges you may be able to discover.

– Do you have resolution? – This brings some analysis to the data at hand. Just as all rock does not contain gold, all data does not contain (new) patterns. Low-fidelity data might be all customers transitions with order-level (total amount spent, etc). High-fidelity data would be all the customer transitions with item-level data, (the thing the customer purchased to make up the transaction.) Visa has the former and Amazon has the latter, and it’s no surprise Amazon knows you better. High-resolution data will illuminate new patterns, like Target’s recent misstep of identifying a pregnant teen before she could tell her father.

– Do you have scope? – This question starts to consider the reach of your data. Are you only gathering data against a very focused problem, or are you gathering data that will give you insight beyond your core business? Being able to understand patterns outside your immediate market will create new opportunities for understanding. As an example, Hidalgo spoke about telephone companies, who know your calling patterns, but also can also make determinations around mobility patterns because they know which cell towers you’ve used during your day.

So, though there’s a lot of noise around this space there’s a lot to be done here. And as the hardware, software, and services companies wind people up to capture more data, there will be more patterns to discover – this space is very self-fulfilling like that. Along those lines, this stat came up during the talk: 70% of all data captured about people it’s gathered by machines. So as we put more sensors in everything, we’ll push this ratio further.)

Getting beyond the hype, I’m excited to see what type of new patterns emerge from deeper analysis of data. There’s definitely space for data scientists to unearth new patterns that help designers create new experiences. But to be certain, the real opportunity isn’t in Big Data, it’s in gaining better resolution to the problems we’re trying to solve and the markets we’re trying to serve.

(If you’re into this sort of thing, here’s another talk by Cesar Hidalgo. It’s really nice, definitely worth your time.)

Leave a Reply