Archive for the ‘ hadoop ’ Category

Installing Hadoop-LZO compression module RPMs

Recently I wrote about compiling and installing LZO support for Hadoop.

But now I found RPMs for this by Cloudera. Strangely they arent mentioned anywhere but here “Installing-and-Using-Impala” . So its much simpler to install now.

# this is for RHEL/CentOS . For Debian, Ubuntu, SLES/SUSE see http://archive.cloudera.com/gplextras/
cd /etc/yum.repos.d/ && wget http://archive.cloudera.com/gplextras/redhat/6/x86_64/gplextras/cloudera-gplextras4.repo
yum install hadoop-lzo-cdh4 hadoop-lzo-cdh4-mr1

Its really as simple as that! Installs LZO for MapReduce and for Hadoop to /usr/lib/hadoop/lib/ and /usr/lib/hadoop-0.20-mapreduce/lib and also to the native/ paths.

Check my older blog post about the necessary configuration settings which are left to do. Just skip the compilation part.

Compiling and installing Hadoop-LZO compression support module

If you want to benefit of splittable LZO compression in Hadoop you have to build it yourself. Due to licensing reasons, the module isnt shipped with Apaches Hadoop or Cloudera.

LZO compression is significantly faster than the other compressions in Hadoop, such as Snappy or GZip.

The original work is located at http://code.google.com/p/hadoop-gpl-compression/ however there are two more improved forks that are in-sync. We will be using Todd Lipcons fork. He is employed at Cloudera, so his fork is the closest to the CDH4 stack.

The reason why I am writing this short article is because all installation articles for hadoop-lzo I found were not as short as they could be. So here we go:

yum install lzo-devel ant java-1.7.0-openjdk-devel gcc
cd /usr/local/src
git clone git://github.com/toddlipcon/hadoop-lzo.git
cd hadoop-lzo

Read more