Source: Flickr user torkildr
One of the interesting aspects of GigaOM’s Structure conference last week from a NewNet perspective was the discussion of what some participants called the “Big Data” problem. In a sense, this issue affects virtually every other aspect of the infrastructure industry that was the focus of the conference, both on the hardware side and the software side, and it stems from the fundamental nature of the web and the Internet itself. Not only that, but it is an issue that no one — with the possible exception of Google and Microsoft — can really claim to have a handle on.
The nature of the problem is fairly straightforward: There is a vast amount of digital data out there, and it is growing rapidly. Just how much there is, and how rapidly it is growing, is the subject of much debate, but some recent estimates put the amount of Internet content at 500 billion gigabytes, and networking experts estimated that this number would likely double in 18 months. At Structure, the chief information officer at NASA’s Ames Research Center said that the agency has telescopes that are producing almost a billion gigabytes of data every day. At those kinds of levels, we are well into the arena of what information scientists call “exascale” computing — where the world’s information exceeds a million trillion bytes.
It may be a straightforward problem, but that doesn’t mean anyone has figured out an easy solution. When you have billions of gigabytes of data being produced, you have a number of problems to solve, including:
Where to put it: Even with storage costs dropping rapidly (a terabyte of data storage used to cost hundreds of thousands of dollars, but now costs less than $100), finding a place to put a billion gigabytes of data is not an easy task. Scientists are working on massively scalable clusters and even some new information technologies such as quantum-level data storage to handle the problem.
How to move it: As Joe Weinman, vice-president of strategy and business development at AT&T, mentioned on a panel at Structure, the public Internet — even the fastest parts of it — simply isn’t capable of transferring billions of gigabytes of data in anything close to a reasonable time-frame. Some scientists say the easiest way to handle this problem is by using FedEx: in other words, literally shipping hard drives and servers across the country by truck.
What to do with it: Even when you figure out where to put it, the members of a panel at Structure talking about exascale computing noted that you still have to do something meaningful with it, and that means you need very powerful computers. John West, special assistant for computation strategy with the U.S. Army Engineer Research and Development Center, said the most powerful computer in existence is at the Oakridge National Laboratory of the Department of Energy and does 1.8 petaflops (1.8 trillion operations) per second, which is well short of an exascale computer, and it requires 7 megawatts of energy, enough to heat thousands of homes. Those kinds of computers also cost upwards of $500 million each.
As the Structure panel noted, NASA and university research departments aren’t the only ones contributing to this data deluge — every mobile phone and iPod and iPad does as well, as more people take photos and video at ever-increasing resolutions. Apple has shipped three million iPads in less than three months, and that represents another hundred million gigabytes of storage or more added to the pile. Those ever-growing numbers pose a formidable challenge for a company like Google, whose stated goal is to be able to search all of the world’s information. Even as it grows (the company has an estimated one million servers at its disposal, and continues to build more server farms), that goal seems more and more elusive.