Internet2 Global Summit
“Delivering on the Promise of Big Data and Big Science:
The Most Important Networks Are Not Fiber Optic”
Shirley Ann Jackson, Ph.D.
President, Rensselaer Polytechnic Institute
Sheraton Denver Downtown
Tuesday, April 8, 2014
I am delighted to be here today to speak to the world’s key expediters of scientific discovery and technological innovation.
As a scientist and university president, I represent the large group of people who ask the impossible of you. We rely on you for the undergirding technologies that allow the connectivity of brilliant people and massive data setsfor collaboration, discovery, and innovation.
And we will become even more impossible in the foreseeable future. Allow me to offer an example: Astrophysicists tell us that less than 5% of the universe is ordinary matter. The other 95%, dark matter and dark energy, are as yet absolute mysteries.
Only one thing is entirely certain about that 95%: The herculean task of intercepting, verifying, and transmitting the data that will describe itas well as enabling its analysisis going to fall to the people assembled hereand to your counterparts around the world.
Speaking globally in fact, the Square Kilometer Array currently under development, the world’s largest radio telescope, soon will track young galaxies to see how dark energy and dark matter behave. One aspect of this single project is expected to generate a data stream equivalent to 100 times the global Internet traffic.
Closer to home, bioengineers at Rensselaer and other universities are trying to solve one of the knottiest problems of modern medicine: understanding why, for any given person, only about a third of the pharmaceuticals routinely prescribed will be effective, another third will be ineffective, and the last third will be absolutely harmful. We now have high-throughput assays that can test a pharmaceutical against huge numbers of liver cells with the range of gene expression we see in different individuals. Eventually, we will create a database that, when matched up against a patient’s genome, should allow us accurately to predict the effectiveness or toxicity of a drug. In the meanwhile, we are generating enormous amounts of data.
We expect you to help move forward these investigationsand in addition, to help us…
- to commercialize nuclear fusion as an energy source;
- to determine whether there is life in other solar systems; and
- now that we have evidence of cosmic inflation and the Higgs boson, to reconcile general relativity and quantum physics;
- as well as to predict what Warren Buffet has wagered cannot be predicted: a perfect March Madness bracket.
We expect you, also, to help us derive insights from the massive amounts of web-based data that humanity is producing about itself, during the ordinary course of every day. In fact, this may be the greatest intellectual challenge and opportunity we all face in academic life.
As you know, the rate at which humankind is creating data is accelerating rapidly, with much of the increase due to the unstructured data on social media, and the infrastructure and sensor data generated by the “Internet of Things”though it must be noted, the latter is about to become a much more coherent force. Both Qualcomm and the Industrial Internet Consortiumwhich includes General Electric, IBM and other major corporationsare working to create standards for sensored and networked devices.
Yes, garage doors speaking to thermostats is becoming commonplace. More importantly, we are creating streams of information that can radically improve everything from health care to our energy security.
Today, we analyze less than 1 percent of the data we capture, even though the answers to many of the great global challenges lie within this overabundant natural resource.
For example, we have an initiative at Rensselaer that we call The Jefferson Project at Lake George, which is designed to turn Lake George into the “smartest lake in the world”in other words, to instrument and monitor the lake to the hilt, and to build layers of data about its biochemistry, its circulation, and its food web into a multi-dimensional scaffolding that, together with data analytics and data visualization, will allow us to understand and mitigate its stressors. We see this as an opportunity to revolutionize the way the earth’s water resources are protectedwith a new model fully informed by science.
Clearly, with its Innovation Platform, Internet2 has the future in view, creating...
- The robust bandwidth to move massive amounts of data around;
- Configurable networks tailored to specific research tasks; and
- Demarcations that create safe harbors for scientific collaborations.
Today, I want to offer my thoughts about the challenges in fully realizing the potential of this era of Big Data and Big Science.
Now, the Internet2 Network is lightning fast. However, all of us in academe understand the limitations of even the most capacious pipeline, given an exploding volume and velocity of dataincluding the petascale data generation currently occurring on the platforms of Big Science.
Because of the mismatch of capacity between our networks and our supercomputers, researchers are still mailing each other data on hard disks. As well, storage and its integration with computation represent a challenge.
This mismatch is likely to grow, as we examine important research questions at ever-greater resolution and scale. Petascale supercomputers will help us model the effects of human decision-making on a city forced to evacuate in a disaster. They will help us model the molecular dynamics of protein folding, or how proteins assemble themselves into different shapes in order to carry out their functions in our bodiessomething that has been called the most significant problem in biochemistry. Why? Because a number of diseases, including Alzheimer’s, involve misfolded proteins.
And the scaling up will not halt there. Problems in energysuch as designing cooling systems for optimum efficiency where the coolant absorbs maximum heat without boiling, or modeling the incorporation of wind energy into the electrical grid on a national scaledemand exascale computing power.
IBM Senior Vice President and Director of Research John Kelly has called today’s computers, for all their speed and power, “brilliant idiots,” because they must be told what to do at every moment. By themselves, they are not well-suited to help us cut the volume and velocity of data down to size. Greater intelligence upstream is part of the answer, whether intelligent security cameras that only transmit the unusual, or cognitive computing systems able to cull, curate, and interpret for the researcher.
For example, we are using the remarkable IBM cognitive computing system Watson to help us interpret the enormous amount of data generated by a survey of Lake George’s underwater typography. A typical Geographic Information System can tell us what is the deepest point of the lake, but it cannot answer semantic questions based on spatial data, such as, what are Lake George’s major basins? How are these major basins connected? How do the shape of these basins affect the time it takes a layer of water to circulate? Watson can.
Clearly, the data challenges are not only about data at rest, but also about data in motion, generated in rushing streams, whether by sensored devices, or enormous scientific platforms. As we consider advanced networking, the question arises, can we imbed artificial intelligence within the networks themselves to decide what data should be moved, and how it should be queueddepending on the research questions being addressed? In other words, can we expand the definition of software-defined networks to include cognitive systems that can handle such challenges?
However, while we must find new ways to manage the volume and velocity of data, we should try not to micro-manage it. The truth is, we probably do not want to be too selective about what we store! Data is clearly a realm in which one investigator’s trash is another investigator’s treasure. What used to be considered “junk DNA”the overwhelming majority of the human genome that does not encode for proteinsrepresents a prime example: further research reveals that it includes important regulators of gene expressionwith significant implications for health and disease.
As we know, it is not just the volume and velocity of data that strain our capacitiesbut the third V, variety: the fact that relevant data can come from an unlimited range of sources, platforms, and geographies.
This variety includes the work of diverse researchers collaborating in ever-larger and more far-flung groups, particularly as the production of reference data in certain fields becomes more important in hypothesis-driven research.
International collaborations always have been part of scientific research and will become more so in the futureas researchers around the world share data to address global challenges. As you develop the underlying networks, the Research Data Allianceits U.S. arm led by Rensselaer Professor Francine Bermanis working to establish the upper layers of the infrastructure and policies to allow more facile and trustworthy data sharing among researchers globally.
There are a few other problems universities must solve to allow greater connectivity. As exciting as is the variety of data we now are collecting, researchers often cannot identify which of their peers has what data and what tools. As a result, a duplication of effort slows down the progress of discovery and innovation.
Just as PubMed and Google Scholar make it easy to find citations for the published literature by subject or author, it should be easier to search for metadata about datasets, data tools, and the researchers who created them. To use an extremely old-fashioned analogy, we need a Yellow Pages for data.
Rensselaer Professor Peter Fox already has devised something similar for The Deep Carbon Observatory, a multi-disciplinary project funded by the Sloan Foundation to understand the carbon locked in the earth in all its forms, whether within microbial life or as hydrocarbons. Now, a group of web and data scientists at The Rensselaer Institute for Data Exploration and Applicationsor The Rensselaer IDEAincluding Dr. Fox, are creating a portal for Rensselaer researchers in all fields to find information about data and tools. Ultimately, however, we will want to expand this effort nationwide and globally.
And we will require smarter systems to bookmark the Yellow Pages for us. At Rensselaer, we are teaching Watson to serve as a data advisor, steering researchers to the places they are likely to find the information they require within 1,000,000 open government data sets around the world. Down the road, it is easy to envision Watson understanding research partnerships and expanding on the safe harbors Internet2 offers them, by finding data that would interest collaborators as a group and, as well, by directing data of particular interest to the particular individuals within a collaboration.
Finally, we will need to address, increasingly, the fourth V of Big Data: veracity.
Layered data from many sources will help us see more clearly. But it also will raise contextual questions. Researchers will want to know, what is the provenance of the data? Is it accurate? Who is permitted to use it, correct it, and add to it in future?
Data sharing on the scale I am arguing for is likely to be a case of “trust, but verify.” Transparency about the tools used to produce the data will help to reassure researchers and enable them to test and reproduce itand help to speed discovery and innovation.
Again, the question arises, can cognitive and semantic tools be embedded into advanced networking, so that the networks themselves can use the provenance of data to decide what should be moved?
As we contemplate complex global challenges that range from climate change, to dislocations within the financial markets, to disruptions of global supply chains, to hyper-connected infrastructure, to the possibility of a flu pandemic, it is clear that we humans all are subject to intersecting vulnerabilities with cascading consequences. We are connected by our exposuresand we are exposed by our connections.
Therefore, it is of importance that greater resilience be built into our networks, both for the security of the Internet of Things, as well as for avoiding disruption of important collaborative research efforts.
So, as one thinks about the future of Internet2, not only about how it enables global research, but also about how robust it isthe question arises, is there more intelligence we can build into our networks to reinforce their resilience?
It is important, also, to recognize that the tools we are creatingnetworking tools that enable consortia of researchers to form, semantic and cognitive tools that allow investigations across unrelated data sets and that weigh and value data according to its provenanceas well as the increasingly central role computation plays in every field all are contributing to another revolution in research: the crumbling of the walls between disciplines.
That means we all are challenged to consider the ways we teach, and how we organize and fund research, as we prepare the next generation to lead. We must find new ways to bring together the innovators and students in data science, networking, and computation, with the innovators and students in every other domain.
More broadly, in this era of Big Data and Big Science, universities must serve as a crossroads for collaboration. They must model themselves on what I have defined as The New Polytechnic, using advanced technologiesin new waysto unite a multiplicity of disciplines and perspectives. We must do this because, as we all know, the most important networks in discovery and innovation are human. But unlocking human potential depends not only on the technologies we put in place, but on how we are able to use them.
The greatest challenge all of us in academic life facewhether we are network technicians or theoretical physicists, whether we are CIOs, CTOs, or university presidentsis fostering the right connections.
Therefore, I am so very glad Internet2 is on the job!
Source citations are available from the division of Strategic Communications and External Relations, Rensselaer Polytechnic Institute. Statistical data contained herein were factually accurate at the time it was delivered. Rensselaer Polytechnic Institute assumes no duty to change it to reflect new developments.