Hadoop YARN is basically MapReduce upgrade in an attempt to take Apache Hadoop beyond MapReduce for data-processing.
According to Hortonworks cofounder Arun Murthy blog post, Apache Hadoop YARN joins Hadoop Common (core libraries), Hadoop HDFS (storage) and Hadoop MapReduce (the MapReduce implementation) as the sub-projects of the Apache Hadoop which, itself, is a Top Level Project in the Apache Software Foundation. Until this milestone, YARN was a part of the Hadoop MapReduce project and now is poised to stand up on it’s own as a sub-project of Hadoop.
This is a huge win for Hortonworks who have established themselves as the emerging leader with Cloudera in being a credbile steward for stable and open Hadoop. The marketplace is looking for confidence in the stability and maturity of Apache Hadoop as many organizations have already been successful driving real business and technical value.
As folks are aware, Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer. However, the MapReduce algorithm, by itself, isn’t sufficient for the very wide variety of use-cases we see Hadoop being employed to solve. With YARN, Hadoop now has a generic resource-management and distributed application framework, where by, one can implement multiple data processing applications customized for the task at hand. Hadoop MapReduce is now one such application for YARN and I see several others given my vantage point – in future you will see MPI, graph-processing, simple services etc.; all co-existing with MapReduce applications in a Hadoop YARN cluster.
Implications for the Apache Hadoop Developer community
I’d like to take a brief moment to walk folks through the implications of making Hadoop YARN as a sub-project, particularly for members of the Hadoop developer community.
We will now see a top-level hadoop-yarn-project source folder in Hadoop trunk.
We will now use a separate jira project for issue tracking for YARN i.e. https://issues.apache.org/jira/browse/YARN
We will also use a new firstname.lastname@example.org mailing list for collaboration.
We will continue to co-release a single Apache Hadoop release that will include the Common, HDFS, YARN and MapReduce sub-projects.
If you would like to play with YARN please download the latest hadoop-2 release from the ASF and start contributing – either to core YARN sub-project or start building your cool application on top!
Please do remember that hadoop-2 is still deemed alpha quality by the Apache Hadoop community, but YARN itself shows a lot of promise and we are excited by the future possibilities!
Overall, having Hadoop YARN as a sub-project of Apache Hadoop is a significant milestone for Hadoop several years in the making. Personally, it is very exciting given that this journey started more than 4 years ago with https://issues.apache.org/jira/browse/MAPREDUCE-279. It’s a great pleasure, and honor, to get to this point by collaborating with a fantastic community that is driving Apache Hadoop.
What Does This Mean for the Hadoop Community:
Hadoop has established itself as the big data platform of choice. Many big organizations are moving to Hadoop and those who don’t have a big data strategy will be left behind.
• Open Source Innovation v Stability: As an open source technology matures and becomes mainstream, it becomes increasingly important to balance community innovation and enterprise stability. Core Apache Hadoop has reached stability and has been proven in large-scale deployments. It can be trusted and it is no longer necessary to rely on the bleeding edge development lines of Hadoop
• Apache Hadoop Platform Completeness: Apache Hadoop with core set of related projects presents a wide array of functions to enable the ecosystem, ease operations and empower the developer with enterprise ready tools
• Apache Hadoop Maturity: Apache Hadoop has come a long way. Trusted and test versions of Hadoop is very important. Additionally, upgrades like YARN to the core is a great example of balancing the innovation and stability
• Community Stewardship: the Apache Hadoop community continues to push the platform forward. Hortonworks and Cloudera are working closely with the community and are stewards of the core so that it remains a viable solution for the enterprise, but they are also innovators at the edge to advance Hadoop further.
The trend is that half the worlds data will be processed by Apache Hadoop. The Hadoop community continues to revolutionize and commoditize the storage and processing of big data via open source. The major focus needs to be on the scale and adoption of Apache Hadoop. All the players are extending and dedicating significant engineering resources to make Apache Hadoop more robust and easier to integrate, extend, deploy and use.
Hadoop continues to be the big data platform of choice and upgrades will come. I’m looking forward to more conversations around this between now and Hadoop World and Strata this fall in NYC October 23-25th.
More info on Hadoop World / Strata here – http://strataconf.com/
Here is my interview with the cofounder of Hortonworks Arun Murthy at Hadoop Summit this past June.