By Samuel Arbesman, Published: August 16, Washington Post
Samuel Arbesman, an applied mathematician and network
scientist, is a senior scholar at the Ewing Marion Kauffman Foundation and the
author of “The
Half-Life of Facts.”
Big data holds the promise of harnessing huge amounts of
information to help us better understand the world. But when talking about big
data, there’s a tendency to fall into hyperbole. It is what compels contrarians
to write such tweets as
“Big Data, n.: the belief that any sufficiently large pile of s--- contains a
pony.” Let’s deflate the hype.
1. “Big data” has a clear definition.
The term “big data” has been in circulation since
at least the 1990s, when it is believed to have originated in Silicon Valley . IBM offers a seemingly simple definition:
Big data is characterized by the four V’s of volume, variety, velocity and
veracity. But the term is thrown around so often, in so many contexts —
science, marketing, politics, sports — that its meaning has become vague and
ambiguous.
There’s general agreement that ranking every page on the
Internet according to relevance and searching the
phone records of every Verizon customer in the United States qualify as
applications of big data. Beyond that, there’s much debate. Does big data need
to involve more information than can be processed by a single home computer? If
so, marketing
analytics wouldn’t qualify, and neither would most of the work done by Facebook.
Is it still big data if it doesn’t use certain tools from the fields of
artificial intelligence and machine learning? Probably.
Should narrowly focused industry efforts to glean consumer
insight from large datasets be grouped under the same term used to describe the
sophisticated and varied things scientists are trying to do? There’s a lot of
confusion, and industry experts and scientists often end up talking past one
another.
2. Big data is new.
By many accounts, big data exploded onto the scene quite
recently. “If wonks were fashionistas, big data would be this season’s hot new
color,” a Reuters report quipped last year. In a
May 2011 report, the McKinsey Global Institute declared big data “the next
frontier for innovation, competition, and productivity.”
It’s true that today we can mine massive amounts of data —
textual, social, scientific and otherwise — using complex algorithms and
computer power. But big data has been around for a long time. It’s just that
exhaustive datasets were more exhausting to compile and study in the days when
“computer” meant a person who performed calculations.
Vast linguistic datasets, for example, go back nearly 800
years. Early biblical concordances — alphabetical indexes of words in the
Bible, along with their context — allowed for some of the same types of
analyses found in modern-day textual data-crunching.
The sciences also have been using big data for some time. In
the early 1600s, Johannes Kepler used Tycho Brahe’s detailed astronomical
dataset to elucidate certain laws of planetary motion. Astronomy in the age of
the Sloan Digital Sky Survey is
certainly different and more awesome, but it’s still astronomy.
Ask statisticians, and they will tell you that they have
been analyzing big data — or “data,” as they less redundantly call it — for
centuries. As they like to argue, big data isn’t much more than a sexier
version of statistics, with a few new tools that allow us to think more broadly
about what data can be and how we generate it.
3. Big data is revolutionary.
In their new book, “Big
Data: A Revolution That Will Transform How We Live, Work, and Think,”Viktor
Mayer-Schonberger and Kenneth Cukier compare “the current data deluge” to the
transformation brought about by the Gutenberg printing press.
If you want more precise advertising directed toward you,
then yes, big data is revolutionary. Generally, though, it’s likely to have a
modest and gradual impact on our lives.
When a phenomenon or an effect is large, we usually don’t
need huge amounts of data to recognize it (and science has traditionally
focused on these large effects). As things become more subtle, bigger data
helps. It can lead us to smaller pieces of knowledge: how to tailor a product
or how to treat a disease a little bit better. If those bits can help lots of
people, the effect may be large. But revolutionary for an individual? Probably
not.
4. Bigger data is better.
In science, some admittedly mind-blowing big-data analyses
are being done. In business, companies are being told to “embrace big data
before your competitors do.” But big data is not automatically better.
Really big datasets can be a mess. Unless researchers and
analysts can reduce the number of variables and make the data more manageable,
they get quantity without a whole lot of quality. Give me some quality medium
data over bad big data any day.
And let’s not forget about bias. There’s a common misconception
that throwing more data at a problem makes it easier to solve. But if there’s
an inherent bias in how the data are collected or examined, a bigger dataset
doesn’t help. For example, if you’re trying to understand how people interact
based on mobile phone data, a year of data rather than a month’s worth doesn’t
address the limitation that certain populations don’t use mobile phones.
Many interesting questions can be explored with little
datasets. Big data has refined our idea of six degrees of separation: Facebook
has shown that it’s actually closer
to four degrees. But the first six-degrees study was done by psychologist Stanley Milgram using a lot of cleverness and a small
number of postcards.
Furthermore, although it’s exciting to have massive datasets
with incredible breadth, too often they lack much in the way of a temporal
dimension. To really understand a phenomenon, such as a social one, we need
datasets with large historical sweep. We need long
data, not just big data.
5. Big data means the end of scientific theories.
Chris Anderson argued in a 2008 Wired essay that big data renders
the scientific method obsolete: Throw enough data at an advanced
machine-learning technique, and all the correlations and relationships will
simply jump out. We’ll understand everything.
But you can’t just go fishing for correlations and hope they
will explain the world. If you’re not careful, you’ll end up with spurious
correlations. Even more important, to contend with the “why” of things, we
still need ideas, hypotheses and theories. If you don’t have good questions,
your results can be silly and meaningless.
Having more data won’t substitute for thinking hard, recognizing
anomalies and exploring deep truths.
No comments:
Post a Comment