Last week my colleague John Furrier reported that YARN, also known as Next-Generation MapReduce, was upgraded to a full-fledged Apache Hadoop sub-project. While YARN is still considered alpha-quality, the move is a good sign for Hadoop. Here’s why.
Critics often cite Hadoop’s inability to process data with any method other than MapReduce as evidence that Hadoop isn’t enterprise-ready. And indeed this is a shortcoming. MapReduce is great for batch processing large volumes of distributed data, but it’s less than ideal for real-time data processing, graph processing and other non-batch methods.
YARN is the open source community’s effort to overcome this limitation and transform Hadoop from a One Trick Pony to a truly comprehensive Big Data management and analytics platform. Specifically, YARN gives each application running on Hadoop its own ApplicationMaster. As described at Hadoop.Apache.org, the ApplicationMaster serves as a framework-specific library that, in conjunction with a global ResourceManager, enables applications to process data in one of a number of frameworks, traditional MapReduce among them.
You can get all the technical details here, but the important takeaway for CIOs and others responsible for Big Data investments is that YARN enables enterprises to wring significantly more value from Hadoop by allowing both MapReduce-focused applications and applications for other data processing frameworks to run on the same cluster.
Hortonworks’ Arun Murthy, who plays a major role in developing YARN, explains its significance:
“People are not going to be comfortable buying a $5 million Hadoop cluster just to do MapReduce and a $2 million cluster to do something else. If you can allow them to run both apps in the same cluster, its not only easier for you in terms of a CapEx perspective … it’s also easier from an operational perspective because you don’t have to have two separate sets of people managing your clusters or two sets of tools for managing your clusters.”
So instead of maintaining one large-scale Hadoop cluster to support historical analysis of Big Data sets (along with dedicated staff and software to manage the cluster) and a separate cluster, staff and software to support real-time end-user-facing Big Data applications, you can deploy just one cluster for both. That’s results in a lot of time, money and manpower savings, making Hadoop a much more attractive option for the enterprise.
Of course, YARN is not quite ready for prime time, and there are other areas that need improvement for Hadoop to be considered a comprehensive data management platform. But the upgrade to its own sub-project means that YARN will receive even more attention from committers and develop that much faster. When it reaches the point that YARN is stable enough for production-level deployments, Murthy said Hortonworks will then integrate it into its own Hadoop distribution, the Hortonworks Data Platform.
Check out the below video for a succinct explanation of YARN and its benefits from Murthy at Hadoop Summit 2012.