Introducing some Short Tutorials for Hadoop
I’ve recently joined Rittman Mead and part of my job will be looking at ‘Big Data’ technologies. This includes looking at how we can apply technologies to manage big data sets whether it be lightweight (but large) key-value stores, capturing and moving data or running batch jobs. My background is primarily in Java development and I’ve spent a lot of time working with many open source tools and open standards that make development easier.
The open source tools that are symbolic of the term ‘Big Data’ are constantly evolving, providing better features and performance. They are unstable, not in terms of quality but in how their APIs and general best practices change so quickly. Fortunately the dust is starting to settle on the core projects so if you’ve not had the time to work with the tools it’s only been getting easier and you’re in a good position.
Over the next few posts I’ll introduce Hadoop along with a few other open source tools that can be used together to quickly develop applications. Hadoop appears to have a steep learning curve and even installing it can look tricky. It’s actually quite easy to start a development environment and with packages like those from Cloudera it’s becoming much easier and quicker to set up production clusters.
Our example project will stream data from twitter using their API that will then be stored in raw form in Hadoop. After we can look a the ways we can transform and process that data using other tools like Hive, Pig and HBase and Oracle NoSQL. We’ll also be using some other open source tools to help so even if you never use Hadoop I hope they might be interesting.
In the first post we’ll only start up half of Hadoop - this will include two daemons that will provide the distributed file store. Another two daemons are needed for the Map-Reduce framework which is used for running batch processes and we’ll look at these in a later post.
To add data to the file store we’ll use the popular server side javascript engine node.js. This isn’t related to Hadoop but it demonstrates how we can move data between two web services with a dozen lines of code.
I’ll be writing the steps with the following configuration listed below. Of course most of the examples I create will work with most versions of Hadoop and the other big data tools, but for reference, the versions I'll be using will be as follows:
- CentOS VM with 2GB RAM (also Ubuntu and OSX)
- Hadoop 1.0.4
- node.js 0.8.21
- A Twitter API key (free but with volume restrictions) from https://dev.twitter.com/apps/new