Giga Omni Media Inc.
A near-term outlook for big data
Why service providers matter for the future of big data — by Derrick Harris
By now most IT professionals have seen the report from the McKinsey Global Institute and the Bureau of Labor Statistics predicting a significant big data skills shortage by 2018. They predict the U.S. workforce will be between 140,000 and 190,000 short on what are popularly called “data scientists” and 1.5 million people short for more-traditional data analyst positions. If those numbers are accurate — which they very well might be, given the incredible importance companies are currently placing and will continue to place on big data and analytics — one can only imagine the skills shortage today, during the infancy of big data.
And McKinsey did not address shortages of workers capable of deploying and managing the distributed systems necessary for running many big data technologies, such as Hadoop. Those skills might be more commonplace and less important several years down the road, when they evolve into everyday systems administration work, but they are vital today.
To date, one major solution to the big data skills shortage has been the advent of consulting and outsourcing firms specializing in deploying big data systems and developing the algorithms and applications companies need in order to actually derive value from their information. Almost unheard of just a few years ago, these companies are cropping up fairly frequently, and some are bringing in a lot of money from clients and investors.
If McKinsey’s predictions hold true, these companies will continue to play a vital role in helping the greater corporate world make sense of the mountains of data they are collecting from an ever-growing number of sources. Indeed, in a recent survey by GigaOM Pro and Logicworks (full results available in a forthcoming GigaOM Pro report) of more than 300 IT professionals, 61 percent said they would consider outsourcing their big data workloads to a service provider.
However, if the current wave of democratizing big data lives up to its ultimate potential, today’s consultants and outsourcers will have to find a way to keep a few steps ahead of the game in order to remain relevant, because what’s cutting edge today will be commonplace tomorrow.
Snapshot: What’s happening now?
Today most big data outsourcing takes one of three shapes:
1. Firms that help companies design, deploy and manage big data systems
2. Firms that help companies build custom algorithms and applications to analyze data
3. Firms that provide some combination of engineering, algorithm and application, and hosting services
We examine each in more detail below.
The first category is probably the most popular in terms of the number of firms, if not in mind share. Systems design and management is important: Hadoop clusters and massively parallel databases are not child’s play. So, too, is helping companies select the right set of tools for the job. Hadoop, for example, gets a lot of attention, but it is not right for every type of workload (e.g., real-time analytics). Assuming they have a multiplatform environment that includes a traditional relational database, possibly an analytic database such as Teradata or Greenplum, and Hadoop, many companies will also need guidance in connecting the various platforms so data can move relatively freely among them.
There are a handful of firms in this space, some big and some quite small. Two of the bigger ones are Impetus and Scale Unlimited, which share a heavy focus on the implementation of Hadoop and other big data technologies while providing fairly base-level analytics services. An interesting up-and-comer is MetaScale, which has the big-business experience and financial backing that come from being a wholly owned subsidiary of the Sears Holding Corporation. MetaScale is using Sears’ experience with building big data systems and is providing an end-to-end consulting-through-management service while partnering with specialists on the algorithm front.
That brings us to the group of firms that specializes in helping companies create analytics algorithms best suited for their specific needs. A number of smaller firms are making names for themselves in this space, including Think Big Analytics and Nuevora, but Mu Sigma is the 400-pound gorilla. Its financials alone illustrate its dominance: Mu Sigma has raised well over $100 million in investment capital, and the company’s 886 percent revenue growth between 2008 and 2010 (from $4.2 million to $41.5 million) landed it a place on Inc. magazine’s list of the 500 fastest-growing private companies.
Mu Sigma helps customers — including Microsoft, Dell and numerous other large enterprises — with what it calls “decision sciences,” using its DIPP (descriptive, inquisitive, prescriptive, predictive) index. The firm actually does do some consulting and outsourcing work on the system side, but its strong suit is in bringing advanced analytics techniques to business problems. Right now it helps clients across a range of vertical markets develop targeted algorithms for marketing, risk and supply chain analytics.
The whole package
At the top of the pyramid are firms that do it all: system design, algorithms and their own hosted platform for actually processing user data. There are some newer, smaller firms in this space, such as RedGiant Analytics, but probably the biggest firm dedicated to providing these types of services is Opera Solutions. It did $69.5 million in revenue in 2010 and touts itself as the biggest employer of computer science Ph.D.s outside IBM. Opera focuses on finding and analyzing the “signals” within a specific customer’s data and developing applications that utilize that information. At the technological core of its service is the Vektor platform, a collection of technologies and algorithms for storing, processing and analyzing user data.
The vendors themselves
However, it is worth noting that despite their specialties in the field of big data technologies and techniques, the firms mentioned above aren’t operating in a vacuum. Especially at the infrastructure level, these firms have natural competition from the vendors of big data technologies themselves. Hadoop distribution vendors such as Cloudera, Hortonworks, MapR and EMC have partnerships across the data-management software ecosystem, meaning companies wanting to implement a big data environment, no matter how complex, will always have the opportunity to pay for professional support and services.
Even on the hardware front, Dell, Cisco, Oracle, EMC and SGI are among the server makers selling either reference architectures or appliances tuned especially for running Hadoop workloads. Such offerings can eliminate many of the questions and much of the legwork needed to build and deploy big data environments.
And then there is IBM, which has an entire suite of software, hardware and services that it can bring to customers’ needs. If they are willing to pay what IBM charges and go with a single-vendor stack, companies can get everything from software to analytics guidance to hosting from Big Blue. IBM predicts analytics will be a $16 billion business for it by 2015, and services will play a large role in that growth.
Is disruption ahead for data specialists?
Increasingly, however, big data outsourcing firms might be finding themselves competing against a broad range of threats for companies’ analytics dollars. At the basest level, the increasing acceptance of cloud computing as a delivery model for big data workloads could prove problematic. While the majority of our survey respondents plan to outsource some big data workloads in the upcoming year, 70 percent actually said they would consider using a cloud provider such as Rackspace or Amazon Web Services, versus just 46 percent who said they would use analytics specialists such as Mu Sigma or Opera Solutions. Presumably, as with all things cloud, many are looking for any way to eliminate the costs associated with buying and managing physical infrastructure.
Of course, simply choosing to utilize a cloud provider’s resources doesn’t completely eliminate the need for specialist firms. Such a decision might mitigate the need to worry about buying and configuring hardware, but it doesn’t make big data easy. Companies will still likely need assistance configuring the proper virtual infrastructure and the right software tools for their given applications, unless they are using a hosted service such as Amazon Elastic MapReduce, IBM SmartCloud BigInsights or Cloudant (or any number of other hosted NoSQL databases).
One particularly promising but brand-new hosted service provider is Infochimps, which has pivoted from being primarily a data marketplace into a big data platform provider. Its new Infochimps Platform product, which is itself hosted on Amazon Web Services, aims to make it as easy as possible to deploy, scale and use a Hadoop cluster as well as a variety of databases.
However, none of these hosted services provide users with data scientists who can tell them what questions to ask of their data and help create the right models to find the answers. In the end, that’s what big data is all about. Firms like Mu Sigma, Opera Solutions and others that help with the actual creation of algorithms and models should still be very appealing, assuming the other companies don’t become analytics experts overnight. In that sense, cloud infrastructure is just like physical infrastructure: It’s the easy part, but what runs on top of it is what matters.