Wednesday, August 8, 2018

Questioning Data


The article that prompted me to read Too Much To Know (blogged here, here, and here) was the epilogue in a journal issue exploring the intersection of data and history. The article is intriguingly titled “Big Data Is the Answer … But What Is the Question?” (figure below shows authors, abstract, citation).


The authors provide their musings on eight questions they ask to probe the concept of data (listed in the abstract). I will explore a couple that attracted my attention. I recommend reading the article in full, and possibly other articles in that issue of Osiris, if you’re interested in knowing more.

Question 1 is “What counts as data?” I thought this was obvious at first glance and then quickly realized that I bring a rather narrow chemist’s idea of what constitutes data. The authors use the example of crystallography: “… x-rays diffracted by a crystal produce an image containing dark spots, whose intensities are used to calculate ‘structure factors’, which in turn are used to determine the coordinates of each atom composing the crystal. But where are the data? Crystallographers were first content to publish atomic coordinates as the ‘data’ supporting a proposed structure, before they were asked to provide more foundational data…” Perhaps all of them are data. And given that I know something about chemical structure determination via crystallography, I might even add that simulation parameters used in various steps might be data of yet another sort.

Additionally, what counts as data may change over time. Being a computational chemist, I’ve certainly experienced this when attempting to publish research results. Which data is primary and should be part of the main article? Which is secondary and should be moved to Supporting Information? The authors quote a philosopher writing that data are “fungible objects defined by their portability and prospective usefulness as evidence”. When something is categorized as data, it is used at that moment in time, to support a knowledge claim. How many of us try to bolster an argument by saying “I have data to support…”? I have done this many times, especially when trying to get increased resources. But is the data sufficient? Is that data relevant? Is it raw data? Is it derivative data? Has it been ‘interpreted’ in some way? Is it ‘compelling’?

Question 4 is “What makes data measurable? What does quantification do to data?” If you use (standard Shannon information) bits and bytes to measure data, then data sizes might be quantified differently depending on what constitutes your base data. For example, the authors discuss the ASCII text encoding system and how bytes can represent the space of the ‘English’ alphabet and alphanumeric system. But it might have difficulty with Chinese or Czech. As a second example, “a 500-page book and a single scanned photograph require the same number of bytes of computer memory, yet from a human point of view, the book usually contains far more information.” Comparisons may be tricky. When we say we are drowning in petabytes of data, what does that really mean qualitatively? It’s very challenging to have the single ASCII-unit-system measure data of different types.

What does quantification do to data? I’m not sure, but trying to compare elephants and oranges on the same metric might be very misleading. I’ve been thinking about how to measure molecular complexity, as I’ve been probing the question of whether chemical evolution in a non-equilibrium situation leads to the formation of more ‘complex’ molecules. But as I’ve delved into the literature of how to measure molecular complexity, the situation becomes decidedly more ‘complex’ when attempting to choose an appropriate metric. Most approaches use something akin to a Shannon or Boltzmann-type scale coupled with some ad-hoc add-ons.

Up to this point I haven’t touched on Big Data. The last two questions by the authors touch on this aspect. In “Who owns data? Who uses data?” they briefly discuss “supply chains” of data. One interesting historical aspect that I suspect has far-reaching ramifications is that in recent years, “data suppliers, data mangers, and data users have become far more differentiated and specialized. As the collection, organization and curation of data become increasingly professionalized, a divide has appeared between the scientists who produce data, those who manage it, and those who analyze it.” This has led to tension or even conflict between data producers and data analyzers.

Big Data is here to stay. Beyond privacy issues, we should be cautious and thoughtful about what data signifies in whatever context it is being used. We should certainly ask questions.

Some previous posts on big data.
·      Deep-Fried Data

No comments:

Post a Comment