Ragu Pappu Technical thoughts, mostly

Cloudera Hadoop Developer Training – My experience

Last week (Nov 11-14) I attended the Cloudera Developer Training for Apache Hadoop course and in this post I share my experience and takeaways from that training. But first, a brief bit about me to lay out the context for my experience with this training. I have worked in the Telecommunications industry and have several years experience with Embedded Systems software design and development and about 2 months ago I decided to work in the Big Data space, which led me to this training course.

Research and Preparation

I did a lot of research online and discussed with friends on ways to get started in Big Data. The general recommendation was to start by getting trained in Hadoop. I decide to go with Cloudera training because it is the clear leader in the Big Data space. I reviewed the training course topics for the Apache Hadoop course, and with an intent to get credentialed as a Cloudera Certified Developer for Apache Hadoop, a certification well-regarded in the industry, I registered for the Developer Training for Hadoop course.

I had about 6 weeks till start of training and, wanting to make the most of the training class, I prepared for it. My goal was to have, in time for the start of training, a few things covered -- basic understanding of the two key Hadoop components, MapReduce and HDFS, high-level knowledge of the Hadoop ecosystem, a little bit of Hadoop programming, and some knowledge of the practical applications, issues, and limits of Hadoop as it related to Big Data. To that end I relied heavily on the Web, Google search, and books. I started learning Hadoop from Hadoop Beginner's Guide, and a week later started reading Tom White's Hadoop: The Definitive Guide, in parallel. I installed Hadoop on my laptop and ran example programs from the first book. I pored through articles, blog posts, presentations, and white papers about anything Hadoop and Big Data. As the training date approached, I felt reasonably well-prepared. In particular, Tom White's book was very helpful -- it provides a great introduction to Hadoop with simple, clear explanations of fairly involved concepts, lots of real code examples in Java, C, Python, etc, and detailed discussions on the relative benefits and limitations of different methods used in Hadoop.

Hadoop Developer Training

The 4-day training was delivered by ExitCertified at their clean, quiet, and well-prepared facility in San Francisco. Our instructor, Joel, taught the class all four days. The training class ran each day from 9AM to 5PM with one hour break for lunch around noon, and with two 5-10 minute breaks each during the morning and afternoon sessions. Our class comprised of 6 on-site and 8 remote-site trainees, the latter joining in via video conferencing. One of the trainees had been flown in from Ireland by her company to get trained on-site! The trainees came from diverse job backgrounds, industries, and work experience levels, and the majority did not have Hadoop development experience. I was somewhat puzzled by the rather small number of on-site trainees given the high demand for Hadoop developers in Silicon Valley. Maybe the upcoming Thanksgiving break had something to do with it. But the small number of trainees made for better opportunities for intimate discussions and detailed question-and-answer sessions.

The training started with each student signing into the Cloudera training portal which gave access to the Cloudera Hadoop Training PowerPoint slides and the Developer Exercise Instructions document. Our instructor Joel also generously provided a USB thumb drive containing a lot of additional material, including his lecture slides (separate from the Cloudera training slides), articles, FAQs, demo code snippets, etc for downloading to our personal laptops for later study and reference. The setup for each student included a desktop with two monitors to facilitate viewing lecture slides and running lab exercises. The desktop came installed with a Virtual Machine (VM) which was our platform to run the lab exercises.

The Hadoop Developer Training Course attempts a comprehensive coverage of Hadoop and includes a variety of topics. It starts with a high-level introduction to Hadoop, moves on to detailed coverage of MapReduce and HDFS, and winds down with an introduction to some of the Hadoop ecosystem projects. In each class time was spent on lectures (65%) and hands-on lab exercises (35%). On Day1 the topics were Motivation for Hadoop and Introduction to MapReduce and HDFS; on Day2 and Day3, there was in-depth coverage of MapReduce, including writing MapReduce programs, writing unit tests, and detailed presentation of Reducers and Combiners; and on final Day4, the topics were other Hadoop ecosystem projects, including Hive, Pig, Impala, Sqoop, and Oozie. Each lecture on a key topic was followed by a lab exercise that explored that topic. On Day4 the class got a bonus. Joel asked for a vote on other Hadoop ecosystem projects of interest and the class voted for Spark, Mahout, Zookeeper, and Graph Processing which topics he lectured on in good detail.

The Cloudera Hadoop training slide deck contains 600+ slides, so there was a lot of material to cover during the 4-day Developer Training course. And appropriately, Joel maintained a brisk pace in his lectures, which also allowed room for questions and discussions. Occasionally, some of the topics were skimmed or entirely skipped in favor of in-depth and detailed discussions of the more important topics around MapReduce. I liked Joel's lecture style: he placed just the right emphasis on details and highlights of important topics and frequently switched to his own lecture slides (that were detailed, accurate, and contained more up-to-date information) to supplement the Cloudera training material. Early on it was clear that Joel came prepared for the lectures, that he was hands-on (wrote code snippets), and was knowledgeable about the wider industry and related technologies. To keep students engaged in the often dense technical material Joel injected doses of humor in his lectures and frequently ended lectures on major topics with review questions directed at the trainees. At the end of lab exercises Joel went over the solution source code of the lab exercise problem, explaining the details and pointing out key pieces in the code.

Speaking of lab exercises, the Cloudera Hadoop Developer training webpage states under the Audience and Prerequisites section that "...Knowledge of Java is strongly recommended and is required to complete the hands-on exercises...". This prerequisite is desirable but should not be viewed as a show-stopper. I do not have Java programming experience but during lab that limitation only slowed down my ability to complete the lab exercises. The solutions (i.e., full source code) for the lab exercises were also provided within the project directories and one could refer to those solutions while attempting the lab exercises, which was a big help for trainees without Java programming experience. The important thing was to understand how the technique or methodology presented during the lecture was implemented in code.

In summary, the following are my takeaways from the training:

  • Hadoop is powerful technology and its component, MapReduce, is powerful but also complex; an optimal Hadoop implementation to solve a real-world problem requires a deep understanding of the various "components" of MapReduce. This training provides sufficient material to be able to start writing and debugging MapReduce programs using Hadoop. The training only helps flatten the learning curve but, like any endeavor toward mastery, only deliberate practice will make you a Hadoop expert.
  • Newer Hadoop ecosystem projects such as Pig and Hive hide some of the complexities of MapReduce from the user. The question then is this: does a user need to intimately understand MapReduce in order to solve the problem at hand (which is somewhat akin to the need to intimately understand internal combustion engines in order to drive a car)? In my opinion, the answer is, for many problems, no. For example, a candidate MapReduce problem could potentially be solved with just a few lines of a Pig script while requiring minimal to no understanding of MapReduce. So, for somebody seeking to learn data analysis for example, then, a more beneficial course might be the Cloudera Data Analyst Training which covers Pig, Hive, and Impala in greater depth than the Developer Training for Hadoop course.
  • More recently, Apache Spark, has been gaining in popularity for problems requiring iterative computing (e.g., Machine Learning problems). Spark can run programs on the computing cluster upto 100x faster than Hadoop MapReduce programs. For someone attempting to get started in Big Data, the Cloudera Developer Training for Apache Spark may be a more attractive, or at least, equally good, alternative to Developer Training for Hadoop.

Conclusion

Overall, I found the training useful. The classes held my interest and I gained some insights that I might not have easily discovered on my own by merely reading books or searching on the web. Another benefit of attending the classes on-site was that it provided opportunities to interact with other students in the class which helped me get a picture of the kinds of problems companies are attempting to solve using Hadoop. For its technical content, immediate use, and the high quality of the lecturer, I would happily recommend the Cloudera Developer Training for Hadoop course. But I would also urge anyone considering signing up for a training course, especially someone just getting started, to carefully evaluate various aspects of each Developer Training course, including course content, merits of the technology, and growing (or not) use of that technology in the market, and then sign up for one that (a) can help solve the problem(s) relatively quickly (not too steep learning curve), and (b) that uses a technology with demonstrated advantages over similar, perhaps competing, technologies.