Introduction to Hadoop HDFS (and writing to it with node.js)
The core Hadoop project solves two problems with big data - fast, reliable storage and batch processing.
For this first post on Hadoop we are going to focus on the default storage engine and how to integrate with it using its REST API. Hadoop is actually quite easy to install so let’s see what we can do in 15 minutes. I’ve assumed some knowledge of the Unix shell but hopefully it’s not too difficult to follow - the software a versions are listed in the previous post.
If you’re completely new to Hadoop three things worth knowing are...
- The default storage engine is HDFS - a distributed file system with directories and files (ls, mkdir, rm, etc)
- Data written to HDFS is immutable - although there is some support for appends
- HDFS is suited for large files - avoid lots of small files
Starting up a local Hadoop instance for development is pretty simple and even easier as we’re only going to start half of it. The only setting that’s needed is the host and port where the HDFS master ‘namenode’ will exist but we’ll add a property for the location of the filesystem too.
After downloading and unpacking Hadoop add the following under the <configuration> tags in core-site.xml...
conf/core-site.xml:
<property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/${user.name}/hdfs-filesystem</value> </property>
Add your Hadoop bin directory to the PATH
export PATH=$PWD/hadoop-1.0.4/bin:$PATH
The only other step before starting Hadoop is to format the filesystem...
hadoop namenode -format
Hadoop normally runs with a master and many slaves. The master 'namenode' tracks the location of file blocks and the files they represent and the slave 'datanodes' just store file blocks. To start with we’ll run both a master and a slave on the same machine...
# start the master namenode hadoop-daemon.sh start namenode # start a slave datanode hadoop-daemon.sh start datanode
At this point we can write some data to the filesystem...
hadoop dfs -put /etc/hosts /test/hosts-file hadoop dfs -ls /test
You can check that the hadoop daemons are running correctly by running jps (Java ps). Shutting down the daemons can be done quickly with a ctrl-c or killall java - do this now.
To add data we’ll be using the WebHDFS REST api with node.js as both are simple but still fast.
First we need to enable the WebHDFS and Append features in HDFS. Append has some issues and has been disabled in 1.1.x so make sure you are using 1.0.4. It should be back in 2.x and should be fine for our use - this is what Hadoop development is like! Add the following properties...
conf/hdfs-site.xml
<property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <property> <name>dfs.support.append</name> <value>true</value> </property>
Restart HDFS...
killall java hadoop-daemon.sh start namenode && hadoop-daemon.sh start datanode
Before loading data we need to create the file that will store the JSON. We’ll append all incoming data to this file...
hadoop dfs -touchz /test/feed.data hadoop dfs -ls /test
If you see the message "Name node is in safe mode" then just wait for a minute as the namenode is still starting up.
Next download node.js (http://nodejs.org/download/) - if you’re using Unix you can ‘export PATH’ in the same way we did for hadoop.
export PATH=$PWD/node-v0.10.0-linux-x64/bin/:$PATH
Scripting in node.js is very quick thanks to the large number of packages developed by users. Obviously the quality can vary but for quick prototypes there always seems to be a package for anything. All we need to start is an empty directory where the packages and our script will be installed. I’ve picked three packages that will help us...
mkdir hdfs-example cd hdfs-example npm install node-webhdfs npm install ntwitter npm install syncqueue
The webhdfs and twitter packages are obvious but I’ve also used the syncqueue package so that only one append command is sent at a time - Javascript is asynchronous. To use these create and edit a file named twitter.js and add....
var hdfs = new (require("node-webhdfs")).WebHDFSClient({ user: process.env.USER, namenode_host: "localhost", namenode_port: 50070 }); var twitter = require("ntwitter"); var SyncQueue = require("syncqueue"); var hdfsFile = "/test/feed.data"; // make appending synchronous var queue = new SyncQueue(); // get your developer keys from: https://dev.twitter.com/apps/new var twit = new twitter({ consumer_key: "keykeykeykeykeykey", consumer_secret: "secretsecretsecretsecret", access_token_key: "keykeykeykeykeykey", access_token_secret: "secretsecretsecretsecret" }); twit.stream("statuses/filter", {"track":"hadoop,big data"}, function(stream) { stream.on("data", function (data) { queue.push(function(done) { console.log(data.text); hdfs.append(hdfsFile, JSON.stringify(data), function (err, success) { if (err instanceof Error) { console.log(err); } done(); }); }); }); });
And run node twitter.js
Now sit back and watch the data flow - here we’re filtering on “hadoop,big data” but you might want to choose a different query or even a different source - eg. tail a local log file, call a web service, run a webserver.