AIP | Matters
-- -- January 14, 2013

Fred Dylla Director's Matters

By H. Frederick Dylla, Executive Director & CEO

Facing the big data problem

An emerging front in the international call to provide wide access to scientific information is the push for access to scientific data. Scientific data underpins the charts, figures, and other forms of data that are shown in scientific publications. For mega-science projects in particle physics, astronomy, earth science, and genomics, the extensive data associated with these projects usually reside in separate repositories managed by the research institutions or their funding agencies. For whatever size scientific endeavor, either “big science” or individual investigator science, the data collection, analysis, and display are largely digital. Since the three-decades march of Moore’s law and the ubiquity of the microprocessor, the exponentially expanding growth of digital data has outpaced our ability to properly archive these data so that they can be easily managed and identified for discovery and use.

Scholarly publishers are in a unique position to offer important resources to help solve these problems. As the world’s scholarly literature largely went digital in the first decade of the web era, the publishing industry developed standards and persistent identifiers (the digital object identifier or DOI) for the discovery and tracking of articles, as well as redundant repositories for archiving digital content. This practice can be transferred and applied to the data problem. At the annual STM Innovations Seminar last month in London, AIP director of business development, Terry Hulbert, organized a session on the challenges and opportunities afforded by what the popular press has labeled “big data.” I offer you a few highlights to provide insight into this important subject.

Stephen Boyer has spent most of his career helping IBM Research develop software for dealing with big data problems. He subtitled his talk, “How to Deal with Too Much Content and Not Enough Discovery”—quite suitable for any student, researcher, scholar, or casual keystroker performing a Google search of the massive amount of information available online. Mr. Boyer’s IBM colleagues developed machine-readable tools for the world’s patent literature in the 1990s and later expanded their efforts to include machine-readable chemical compounds. With the use of IBM’s Blue Gene supercomputer, billions of pages of tagged text could be analyzed in minutes. The project has since moved to pharmaceuticals. A consortium of drug companies has helped fund schemas for extensive tagging of biomedical literature so that drug chemistry can be tied to the myriad of trade names and effectively associated with the consequences of clinical drug use. The latest incarnation runs on IBM’s Watson supercomputer with the near-term goal of providing medical doctors and clinicians with manageable access to this complex literature. One intriguing question posed to Boyer concerned the legal question of IBM’s liability, if Watson’s data were used for a presumed misdiagnosis. Boyer, being an astute computer scientist, promptly referred the question to the IBM legal department.

Hans Pfeiffenberger, of the Alfred Wegener Institute for Polar and Research, looked at several examples of the big data problem arising from the basic sciences. For example, the likely discovery of a Higgs boson from among the thousands of terabytes of data collected by CERN’s Large Hadron Collider has been well publicized as a massive big data problem. It is also, however, a perfect example of one of the world’s largest networks of computers performing both the analyses and archiving of the data.

Pfeiffenberger offered two other examples on the same scale. In the geosciences, the worldwide network of subsurface ocean buoys are collecting and transmitting data on the ocean’s temperature and salinity; research institutions around the world collect and analyze these data. The new Beijing Genomics Institute (BGI) runs 180 gene sequencers, has its own supercomputer running in the cloud, and copublishes its own journal “Gigascience.” None of this existed a few years ago.

Stefan Winkler-Nees from the German Research Foundation has discussed the explicit connection of the big data problem to the infrastructure and protocols set up by the scholarly publishing community for web-based journals. Winkler-Nees observed that the scientific community needs to address this problem starting at the front end, when an experiment or massive theoretical problem is first being planned. We need to design systems so that it is easier to manage and share data. He noted that there are few reward structures currently in place by our funding or research institutions for data management. The National Science Foundation’s (NSF) requirement for all grant applications to include data management plans is a start, but NSF managers will be the first to admit that few good models or standards are in place. Winkler-Nees called for the adoption of persistent identifiers, which can encode more value than the simple provenance of the data; the development of peer review methods for data to provide more confidence than the author’s endorsement; and essential links to the researchers and institutions providing the data. All of these protocols are well in use for scholarly publications and can form a basis for putting some order into the chaos of big data. Big data need publications linked to these massive data sets to provide the essential protocols for quality assurance, as well as established tools for discovery and archiving (the underlying metadata). He felt that publications can provide the “linking hubs” in our increasingly “digital ecosystem.”

I am pleased to note that one of our Member Societies, the American Astronomical Society, has teamed up with AIP to explore some of the key aspects of linking data with publications. In early October we were notified by NSF that our partnership had been awarded a grant to explore data linking of two journals published by AAS and one by AIP. The project will first examine author attitudes toward linking data sets with publications, and then develop protocols that will be tested by volunteer authors from the candidate journals. What better way to solve a daunting problem than to test potential solutions with a series of experiments?

Publishing Matters

CrossMark implemented on AIP Journals to track papers' update status

CrossMarkAIP Publishing now displays a CrossMark logo on the HTML page or PDF file of its journal articles. The logo indicates AIP is maintaining the integrity of the published document through any updates, corrections, enhancements, retractions, and other such changes. Readers can click on the CrossMark logo to learn about status updates and to make sure that they are reading the most recent and reliable version of the paper. As long as there is an internet connection, this works—whether the reader is on the publisher’s site, a third-party site, or viewing a PDF that was downloaded months earlier. If the document has been updated, clicking the logo will display a link to the newest version.

The CrossMark logo is a service of CrossRef, a nonprofit corporation formed by a group of scholarly publishers in 2000, and signals to researchers that publishers are committed to maintaining the scientific accuracy and integrity of their scholarly content. Many scholarly publishers, including Elsevier, Oxford University Press, and The Royal Society, already display the CrossMark logo on some of their published content. You can learn more about CrossMark from the CrossRef website.

Physics Resources Matters

Inside Science TV experiences significant growth in TV stations

Inside ScienceAiming to reach television viewers who do not ordinarily seek out science news, Inside Science TV produces two short-form video stories per week for distribution to television stations and the web.  Celebrating its one-year anniversary this month, the program worked throughout much of 2012 to build a network of local television stations that air this new program across the United States. 

Screen capture from ISTV

In a recent Inside Science TV segment, “New Solar Cell Absorbs and Emits Light,” UC Berkeley professor Eli Yablonovitch explains a new solar-cell design that promises significant increases in efficiency.

In a major development, Inside Science TV recently finalized an agreement with Gray Television, Inc., a company that owns stations affiliated with CBS, NBC, ABC, and Fox in many television markets across the United States, from Colorado Springs, CO to Tallahassee, FL. The agreement increases syndication of ISTV segments to 33 local news stations in the U.S. ISTV is continuing its efforts to add more television markets in the U.S. throughout the year and is also pursuing sales of the program in international markets.

After the segments are provided to television stations, they are posted on the web, where they can be easily shared through social media outlets such as Facebook and Twitter. In addition, the National Science Foundation will soon be showcasing Inside Science TV segments on their Science 360 website as well as the Knowledge Network, an Internet video feed that they send to universities.

We encourage readers to visit Inside Science TV on the website and YouTube channel to sample the science video content that we are providing to the broad general public, and to spread the word about ISTV to your video-viewing friends and loved ones.

Member Society Spotlight

Students petition Congress to protect funding for science

APSAs 2012 drew to a close, APS joined other scientific societies to encourage students from across the country to stand up for science. In an effort to draw Congress’s attention to the consequences that sequestration would have on science, 6200 students urged Congress to sustain the budget for science funding. Their message: Reducing science funding to help reduce the deficit would be counterproductive; the future of our country’s economic prosperity would be compromised. See the APS press release for more details.

Mandatory cuts were to take place on January 2 if Congress did not take action. Funding for civilian science programs would have been cut by 8.2% and for defense science programs by 9.4%. In the first hours of 2013, Congress officially delayed most sequestration decision making with a new deadline of March 1. (See FYI #3: No Resolution in Sight.) Science continues to need strong advocacy from the community.

Coming Up

Thursday, January 24

  • Physics Today Advisory Committee meeting (College Park)

Friday, January 25

  • AIP Advisory Panel on Committees meeting (College Park)
  • AIP Executive Committee meeting (College Park)
PSP2013 Annual conference