The article that
prompted me to read Too Much To Know (blogged here, here, and here)
was the epilogue in a journal issue exploring the intersection of data
and history. The article is intriguingly titled “Big Data Is the Answer … But
What Is the Question?” (figure below shows authors, abstract, citation).
The authors
provide their musings on eight questions they ask to probe the concept of data
(listed in the abstract). I will explore a couple that attracted my attention.
I recommend reading the article in full, and possibly other articles in that
issue of Osiris, if you’re interested in knowing more.
Question 1 is
“What counts as data?” I thought this was obvious at first glance and then
quickly realized that I bring a rather narrow chemist’s idea of what
constitutes data. The authors use the example of crystallography: “… x-rays
diffracted by a crystal produce an image containing dark spots, whose
intensities are used to calculate ‘structure factors’, which in turn are used
to determine the coordinates of each atom composing the crystal. But where are
the data? Crystallographers were first content to publish atomic coordinates as
the ‘data’ supporting a proposed structure, before they were asked to provide
more foundational data…” Perhaps all of them are data. And given that I know
something about chemical structure determination via crystallography, I might
even add that simulation parameters used in various steps might be data of yet
another sort.
Additionally, what
counts as data may change over time. Being a computational chemist, I’ve certainly
experienced this when attempting to publish research results. Which data is
primary and should be part of the main article? Which is secondary and should
be moved to Supporting Information? The authors quote a philosopher writing
that data are “fungible objects defined by their portability and prospective
usefulness as evidence”. When something is categorized as data, it is used at
that moment in time, to support a knowledge claim. How many of us try to
bolster an argument by saying “I have data to support…”? I have done this many
times, especially when trying to get increased resources. But is the data sufficient?
Is that data relevant? Is it raw data? Is it derivative data? Has it been ‘interpreted’
in some way? Is it ‘compelling’?
Question 4 is
“What makes data measurable? What does quantification do to data?” If you use (standard
Shannon information) bits and bytes to measure data, then data sizes might be
quantified differently depending on what constitutes your base data. For
example, the authors discuss the ASCII text encoding system and how bytes can
represent the space of the ‘English’ alphabet and alphanumeric system. But it
might have difficulty with Chinese or Czech. As a second example, “a 500-page
book and a single scanned photograph require the same number of bytes of
computer memory, yet from a human point of view, the book usually contains far
more information.” Comparisons may be tricky. When we say we are drowning in
petabytes of data, what does that really mean qualitatively? It’s very
challenging to have the single ASCII-unit-system measure data of different
types.
What does
quantification do to data? I’m not sure, but trying to compare elephants and
oranges on the same metric might be very misleading. I’ve been thinking about
how to measure molecular complexity, as I’ve been probing the question of
whether chemical evolution in a non-equilibrium situation leads to the
formation of more ‘complex’ molecules. But as I’ve delved into the literature
of how to measure molecular complexity, the situation becomes decidedly more
‘complex’ when attempting to choose an appropriate metric. Most approaches use
something akin to a Shannon or Boltzmann-type scale coupled with some ad-hoc
add-ons.
Up to this point I
haven’t touched on Big Data. The last two questions by the authors touch on this
aspect. In “Who owns data? Who uses data?” they briefly discuss “supply chains”
of data. One interesting historical aspect that I suspect has far-reaching
ramifications is that in recent years, “data suppliers, data mangers, and data
users have become far more differentiated and specialized. As the collection,
organization and curation of data become increasingly professionalized, a
divide has appeared between the scientists who produce data, those who manage
it, and those who analyze it.” This has led to tension or even conflict between
data producers and data analyzers.
Big Data is here
to stay. Beyond privacy issues, we should be cautious and thoughtful about what
data signifies in whatever context it is being used. We should certainly ask
questions.
Some previous
posts on big data.
No comments:
Post a Comment