Microsoft discusses Azure Data Lake and Hadoop integration

Staff Writer

Azure Data Lake, as reported yesterday, is a group of big data storage and analytics services enabling analysts, scientists, businesses, organizations to perform analyses and processing on big data of all shapes and characteristics. The new hyper-scale data store is built up of components including:

  • Azure Data Lake Store, which will be previewed later this year, is a one-stop repository for dynamically capturing and storing data without needing to change the application as the data scales.
  • Azure Data Analytics, which will also be previewed later this year, is a dynamically scaled new service built on Apache YARN. It features U-SQL, allows users to combine custom-defined code in SQL. Its scalable distributed query capability enables efficient analysis of data stored across all of SQL Servers in Azure, Azure SQL Database, and Azure SQL Data Warehouse.
  • Azure HDInsight, a managed Apache Hadoop cluster service which utilizes open source analytics engines such as Hive, Spark, HBase, and Storm. HDInsight is now available on managed clusters on Linux.

Microsoft Data Platforms Technical Fellow Raghu Ramakrishnan took the opportunity today to describe how these services came to be with a sort of behind the scenes candidness.
Ramakrishnan was a former Yahoo employee who worked in-depth with Hadoop and many other open source tools. Apache Hadoop is an open source software framework distributed storage and processing of very large sets of data. It is built on the assumption that single systems or clusters can fail, and the framework must handle that automatically.
When Ramakrishnan came to Microsoft, he set about integrating Hadoop into Microsoft’s big data strategy. This was after he saw Microsoft’s engineers and analysts productively using tools like Cosmos and Scope to easily manage, process, and analyze big data on massive scalable environments. He was then convinced he wanted to combine the advanced productivity he found at Microsoft with the vibrant openness and flexibility found in the Hadoop ecosystem.
And he succeeded. HDInsight, Azure Data Lake, Azure Data Lake Store, and several other Azure services all offer tight Hadoop integration.
The central theme of Ramakrishnan’s anecdotes is Microsoft’s newfound commitment to contributing to open source. In addition to being a huge contributor to the Apache Hadoop project and its core element, HDFS, the company continues to be a major contributor to Apache’s YARN project. The company itself is actively incorporating Hadoop and YARN into its big data workflows. This requires Microsoft to grow YARN’s capabilities to better match the company’s needs; additions which it then funnels back into the open source community.
Some of Microsoft’s significant contributions to YARN include:

  • Support for work-conserving preemption (YARN-45).
  • Rayon (YARN-1051), a resource reservation component that ships with the Hadoop 2.6 release.
  • Mercury (YARN-2877) and Tetris (YARN-2745), both of which enhance the YARN scheduler.
  • REEF (Retainable Evaluator Execution Framework) a framework running on top of YARN conducive to machine-learning based jobs.

Other noteworthy contributions outside of YARN include

  • Hardoop on Azure and Windows
  • Hive and ORC
  • OAuth2 support in WebHDFS
  • Spark Kernel for Jupyter

The company’s pivot to open source goodness extends well beyond its cloud offerings as well. As we’ve reported previously, Microsoft has open sourced several traditionally proprietary components, such as the CoreCLR, the Rosyln compiler, and Live Writer. It has also created entirely new open source projects, such as its iOS porting tool, Facebook SDK, WinJS, and TypeScript.
Microsoft’s products and services are rapidly changing to be as inclusive as possible with other, even competing products, all to better serve the unique needs of its customers. Azure Data Lake and its integration with the Hadoop ecosystem is the latest evidence of Microsoft’s new long term strategy.