How to Develop Big Data Applications for Hadoop

This video is a great introduction to implementing Hadoop on Amazon Web Services using Karmasphere Studio.

A morning session of the Strata 2011 O’Reilly conference, it is a video of a panel of speakers from Karmasphere, Amazon Web Services, and Concurrent. The video comes in three parts totaling 145 minutes, and while the editing of the video could have been better, the content is excellent.

It starts off with the history of Hadoop, the basics of map-reduce infrastructure, and the languages, libraries, and other supporting projects that go with it.

Ken Krugler of Amazon gives an overview of Amazon Web Services (AWS), followed by Chris Wensel of Concurrent talking about their Cascading product

One of the central ideas of the video is that MapReduce (MR) is too low level to express anything more than a simple algorithm. Tools, such as Karmasphere Studio, can help generate the needed boilerplate code when given a higher level model. Tools that work with these higher level models include

  • Cascading, a visual flow layout tool for combining multiple MR steps
  • Hive, a SQL-like language that can work with most any file types/flat files
  • Pig, a language for data analysis

A case study follows on how Playfish, a company which makes games which run on Facebook, uses Karmasphere Analyst to produce their reports. Every click on a Playfish game is considered a tuple to be processed, and it used to take a long time to run a report. Now, with Analyst and AWS, the reporting has sped up tremendously, enabling Playfish to respond to trends that much quicker.

Next, a hands-on lab, led by Abe Taha of Karmasphere, was the highlight of the video. It covered:

  • installation of Karmasphere Studio into Eclipse
  • working with the Hadoop perspective to setup clusters and such
  • using the Java perspective to create various artifacts, like reducers, mappers, and partitioners
  • defining and loading datafiles with Karmasphere Analyst
  • using hive to implement joins, which are easy in hive but would be difficult in Java MR

This was all then finished off with a Q&A session.

Overall, a great video well worth the time.

This video is available at O’Reilly.


Leave a comment

Filed under aws, book review, hadoop

Connecting my Android

I got the first Android phone that Samsung put out, the i7500. It didn’t come with any software or drivers, so connecting it to my Vista laptop was a challenge. Then I upgraded the laptop to Windows 7 64-bit, and the connection broke.

The phone is still running Android 1.5 Cupcake since Samsung won’t be updating it, so trying to find some USB drivers for it has been tough. For a few months, the only way I could get music onto the phone and pictures/videos off was to remove the microSD card every time.

I finally managed to find a set of Samsumg drivers that was put out by Verizon. I extracted the file and did not run the included setup. I just used the Windows Device Manager and pointed it to where I had extracted the drivers. I had to go back to Device Manager a couple of times until everything was installed.

Now, the process to connect the phone and computer is simple.

  • plug the phone in with the supplied USB cable, and the USB icon appears in the phone’s status bar.
  • on the phone, select the USB connected notification and a dialog comes up asking if you want to mount the device.
  • select Mount
  • and two devices appear in Windows

The first device is labeled “MICROSD” and maps to the microSD card in my phone. The other device is labeled “Removable Disk” and maps to the SIM card.

Now, everything works great. My phone might be stuck with an old version of Android, but I like it.

Leave a comment

Filed under android

Mining the Social Web by Matthew Russell

Some basic programming ability is a must for this book, as the first page starts with installing the Python development tools. If you don’t know Python, that is okay since all the code is easy to follow. Everything you need to develop and run the examples is described step by step with clear instructions at every point.

Once you get comfortable with the basics, the author quickly moves from topic to topic, giving a good introduction into many aspects of how to mine data and generate useful conclusions. Some of the examples include

  • accessing your twitter feed with OAuth,
  • processing feeds to determine influence,
  • using set-wise operations with redis to determine which of your friends are also followers,
  • storing data in CouchDB,
  • using map-reduce to determine the most popular mentions and topics,
  • natural language processing,
  • and seeing data with various visualization tools.

And that was just for Twitter.

The book continues on with examples of processing mailboxes, LinkedIn, Google Buzz, blogs, Facebook, and the Semantic Web. The examples show how easy it is to gather and analyze data from all these social web sites.

With a good breadth of coverage, I highly recommend this book for anyone wanting to learn to process and visualize large amounts of data, either from the social web or any other data source.

This book is available online at Amazon, Chapters and O’Reilly.

Leave a comment

Filed under book review, facebook, twitter

Always wanted to be a programmer

One day back in high school, the math teacher brought something new into class. It was a Commodore PET.

“We are going to spend the next six weeks learning how to program this computer to make it do new and interesting things.” She had me at “program”. I was hooked.

I’ve been a programmer ever since. Even when I’ve been working as a business analyst, classroom teacher, or database designer, I still feel the need to code. A few things have been learned over the years, so the main reason for this blog is to share what I learned with others who are facing similar challenges. Also, writing helps to organize my thoughts, and will help to jump-start my career.

This blog will be a chronicle of my path: the successes, failures, questions, answers and whatever else comes along.  Hop on and enjoy the ride!

Leave a comment

Filed under ramblings