Category Archives: book review

How to Develop Big Data Applications for Hadoop

This video is a great introduction to implementing Hadoop on Amazon Web Services using Karmasphere Studio.

A morning session of the Strata 2011 O’Reilly conference, it is a video of a panel of speakers from Karmasphere, Amazon Web Services, and Concurrent. The video comes in three parts totaling 145 minutes, and while the editing of the video could have been better, the content is excellent.

It starts off with the history of Hadoop, the basics of map-reduce infrastructure, and the languages, libraries, and other supporting projects that go with it.

Ken Krugler of Amazon gives an overview of Amazon Web Services (AWS), followed by Chris Wensel of Concurrent talking about their Cascading product

One of the central ideas of the video is that MapReduce (MR) is too low level to express anything more than a simple algorithm. Tools, such as Karmasphere Studio, can help generate the needed boilerplate code when given a higher level model. Tools that work with these higher level models include

  • Cascading, a visual flow layout tool for combining multiple MR steps
  • Hive, a SQL-like language that can work with most any file types/flat files
  • Pig, a language for data analysis

A case study follows on how Playfish, a company which makes games which run on Facebook, uses Karmasphere Analyst to produce their reports. Every click on a Playfish game is considered a tuple to be processed, and it used to take a long time to run a report. Now, with Analyst and AWS, the reporting has sped up tremendously, enabling Playfish to respond to trends that much quicker.

Next, a hands-on lab, led by Abe Taha of Karmasphere, was the highlight of the video. It covered:

  • installation of Karmasphere Studio into Eclipse
  • working with the Hadoop perspective to setup clusters and such
  • using the Java perspective to create various artifacts, like reducers, mappers, and partitioners
  • defining and loading datafiles with Karmasphere Analyst
  • using hive to implement joins, which are easy in hive but would be difficult in Java MR

This was all then finished off with a Q&A session.

Overall, a great video well worth the time.

This video is available at O’Reilly.


Leave a comment

Filed under aws, book review, hadoop

Mining the Social Web by Matthew Russell

Some basic programming ability is a must for this book, as the first page starts with installing the Python development tools. If you don’t know Python, that is okay since all the code is easy to follow. Everything you need to develop and run the examples is described step by step with clear instructions at every point.

Once you get comfortable with the basics, the author quickly moves from topic to topic, giving a good introduction into many aspects of how to mine data and generate useful conclusions. Some of the examples include

  • accessing your twitter feed with OAuth,
  • processing feeds to determine influence,
  • using set-wise operations with redis to determine which of your friends are also followers,
  • storing data in CouchDB,
  • using map-reduce to determine the most popular mentions and topics,
  • natural language processing,
  • and seeing data with various visualization tools.

And that was just for Twitter.

The book continues on with examples of processing mailboxes, LinkedIn, Google Buzz, blogs, Facebook, and the Semantic Web. The examples show how easy it is to gather and analyze data from all these social web sites.

With a good breadth of coverage, I highly recommend this book for anyone wanting to learn to process and visualize large amounts of data, either from the social web or any other data source.

This book is available online at Amazon, Chapters and O’Reilly.

Leave a comment

Filed under book review, facebook, twitter