STM innovations for tackling big data

Print this pagePrint this page
17 December 2013

Ever since the first mass web browser was introduced in the mid 1990s, online publications have flourished and multiplied. Early on, publishing communities identified the need to devise structure around these publications, lest they get lost in the recesses of the vast World Wide Web. We are having moderate success in interlinking communications across the web, but a formidable challenge still presents itself in terms of dealing with a relatively new development introduced online—an abundance of data—much of which is associated with scholarly publications.

How to structure this data and link it to relevant publications was the central issue discussed at this year's International STM Publishing Association seminar on innovations in the scientific and technical publishing industry. Publishers gathered early this month in London to learn from those at the forefront of this issue. Frank Stein of IBM's Watson Project put the matter into perspective: 90% of the world's data was created in the last two years, and 80% of this data is unstructured.

Plenary speaker Sayeed Choudhary, associate dean for data management at Johns Hopkins University (JHU), noted that the entities that best handle massive quantities of digital data are the internet giants: Facebook, Amazon, Google, and Apple. No business or government sector (except perhaps the NSA) competes with their scale in terms of speed and throughput of data management. He sees data management as a unique opportunity for libraries and STM publishers to work together, particularly in the areas of defining standards, identifiers and structures for data, and the linking to respective data repositories.

 Choudhary defined two classes of scientific data: (1) “big data,” which is defined as having the three V’s—high volume, high velocity, and variety—and (2) “spreadsheet science,” which encompasses single investigator or small group science. JHU hosts the repository for a very large astronomical database, the Sloan Digital Sky Survey, that currently holds more than 150 terabytes of data from observations of galactic and extragalactic objects. The project is a superb example of a library team taking the lead in a large data management project. Even with its well-structured data system, he noted that the database would have benefited from a data structure identification (metadata) system much earlier in the project timeline.

Many agree that the first and most important data management problem to solve is the preservation and linking of data that is associated with peer-reviewed publications. It is ironic that data associated with publications is often born digital but often disappears on a departing researcher’s thumb drive. Of note, EMBO, the publishing arm belonging to the European Molecular Biology Laboratory, routinely enables authors to connect data sets behind any of the figures and tables associated with their publications. AIP Publishing and AAS are participating in a current NSF-funded project to examine author attitudes and publishing protocols for linking data sets to publications in several astronomy and plasma physics journals.

There is much to be done in dealing with data on three fronts: the hardware that performs calculations, stores, and displays the data; the software that manipulates the data; and the human interface for interacting with and interpreting the tremendous volume of information.

On the hardware front, Frank Stein described how IBM is developing entire new business divisions around the power of its Watson supercomputer. Near-term applications include delivery of medical information to caregivers on hand-held devices with access to large fractions of the world’s clinical and pharmaceutical information. Behind the hand-held delivery device is Watson—a 10-ft cube of computer hardware that consumes 100 kW of power, whose judgment can still be dwarfed by a more qualified medical professional.

Matthew Day of Wolfram (known for its powerful “Mathematica” software) described a new venture, Wolfram Alpha, an information-processing tool that allows anyone to pose questions of varying degrees of complexity to Wolfram’s system of interlinked databases. The answers can be delivered in simple graphical outputs, which subjugate the voluminous underlying data. A modest version of Wolfram Alpha powers Apple’s Siri service on the iPhone.

Chris Lintott of Oxford University and the principal investigator behind the crowd-sourced science project Zooniverse gave a striking example of what can be accomplished with the linking of multiple observers. The first Zooniverse project involved the classification of millions of galaxies that are now visible in images from deep space taken by both land-based and space-based telescopes. The Zooniverse website asks for volunteers to help classify galaxy images. In the first day of the website’s existence, classification rates exceeded 70,000 per hour, and in its first year more than 3.5 million galaxies were classified. Despite the sophistication of modern image processing software, the human brain is still better with certain pattern recognition tasks. Zooniverse as a citizen science enterprise has since moved beyond astronomy to tackle the myriad of identification tasks in zoology and archaeology.

Seminar participants also learned of several new tools being offered for data management in a smorgasbord of 5-minute “flash” presentations that included techniques for data highlighting (LENS), more accurate materials characterization (SCAZZL), enabling peer review of author citations (Social Cite), and a method of “geocaching” location information in articles so that research locations can be mapped (JournalMap).

Building structure for data management is still in its infancy, but I believe that these powerful tools that are being developed will help the community converge on solutions that will help tackle the big problem of big data.