If you are interested in Hadoop, read more here.

For this tutorial, I'll use a VM with Ubuntu Server 16.04, 64 bit version, relying on VirtualBox 5.1.4 for the virtualization.

The guest system setup is as follows:

All 2 cores of my i5-6200U
4096 MB of RAM (although 1024 MB should be enough)
A dinamically allocated 10 GB VDI hard disk (5 GB are the least)
Ubuntu Server 16.04 x64 ISO file (but every *buntu flavour should be ok)

Notes

When you read a line like this:

jdoe@farlands ~ $ echo "Hello, world!"

I imply a bash prompt without root priviledges, where jdoe is the username and farlands is the hostname.

On the other hand, when the line is like this:

farlands % echo "Hello, world!"

I imply a bash prompt with root priviledges

Ok, let's start: run the guest os installation with default values and let's jump to hadoop headaches.

Update the guest system

Open up a terminal and fire this commands to update repositories and upgrade the emulated system.

farlands % apt update
farlands % apt upgrade -y

Java 8

We're going to use a precompiled and prepackaged version of Oracle Java 8 in the Webupd8 repo, to avoid further difficulties.

Open up the usual terminal and input:

farlands % apt purge openjdk*
farlands % add-apt-repository -y ppa:webupd8team/java
farlands % apt update
farlands % apt install -y oracle-java8-installer

You can verify Java version by typing:

farlands % java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)

If you read a similar output, you completed this step.

Next, we need to create the JAVA_HOME environmental variable, to give hadoop the capability to find java executables.

farlands % echo "export JAVA_HOME=/usr" >> /etc/profile
farlands % source /etc/profile

Disable IPv6

Apache Hadoop supports only IPv4, so let's disable IPv6 in the kernel parameters.

Open the file /etc/sysctl.conf:

farlands % editor /etc/sysctl.conf

And append to the end:

# Disable IPv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Then reboot:

farlands % reboot

Configure SSH keys

We want to run our setup on a different general purpose user, so we will create a hadoopuser user and a hadoopgroup group.

farlands % addgroup hadoopgroup
farlands % adduser -ingroup hadoopgroup hadoopuser

We need ssh access to our machine, so let's install and start an OpenSSH server.

farlands % apt install ssh
farlands % systemctl enable ssh
farlands % systemctl start ssh

Now we need to setup passwordless ssh, by means of crypto keys. In first place, we change to the hadoopuser account, then we create the key using RSA encryption and finally we authorize the key for the current user.

farlands % su - hadoopuser
hadoopuser@farlands ~ $ ssh-keygen -t rsa -P ""
hadoopuser@farlands ~ $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
hadoopuser@farlands ~ $ chmod 600 ~/.ssh/authorized_keys
hadoopuser@farlands ~ $ ssh-copy-id -i ~/.ssh/id_rsa.pub localhost
hadoopuser@farlands ~ $ ssh localhost

If no password were asked on ssh login, you successfully configured passwordless ssh for user hadoopuser.

Install Hadoop

We are ready to install Hadoop. Unfortunately, it does not come prepackaged, but we have to extract and move it to /usr/local.

farlands % wget http://it.apache.contactlab.it/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
farlands % tar xzf hadoop-2.7.3.tar.gz
farlands % rm -rf hadoop-2.7.3.tar.gz
farlands % mv hadoop-2.7.3 /usr/local
farlands % ln -sf /usr/local/hadoop-2.7.3/ /usr/local/hadoop
farlands % chown -R hadoopuser:hadoopgroup /usr/local/hadoop-2.7.3/

Now we need to configure some environmental variables, with the hadoopuser account. Switch to that account and edit ~/.bashrc:

hadoopuser@farlands ~ $ editor ~/.bashrc

Append at the end:

# Hadoop config
export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# Native path
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib/native"
# Java path
export JAVA_HOME="/usr"
# OS path
export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin

Next, source ~/.bashrc to apply changes.

hadoopuser@farlands ~ $ source ~/.bashrc

Now we need to edit /usr/local/hadoop/etc/hadoop/hadoop-env.sh:

farlands % editor /usr/local/hadoop/etc/hadoop/hadoop-env.sh

And add this at the end:

export JAVA_HOME="/usr"

Configure Hadoop

Hadoop configuration is quite hard, because it has a lot of config files. We need to navigate to /usr/local/hadoop/etc/hadoop and edit these files:

core-site.xml
hdfs-site.xml
mapred-site.xml (needs to be copied from mapred-site.xml.template)
yarn-site.xml

They all are XML files with a top-level <configuration> node. For clarity we report the configuration node only.

`core-site.xml`

<configuration>
<property>
  <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
</property>
</configuration>

`hdfs-site.xml`

<configuration>
<property>
 <name>dfs.replication</name>
 <value>1</value>
</property>

<property>
  <name>dfs.name.dir</name>
    <value>file:/usr/local/hadoop/hadoopdata/hdfs/namenode</value>
</property>

<property>
  <name>dfs.data.dir</name>
    <value>file:/usr/local/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

`mapred-site.xml`

<configuration>
 <property>
  <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>
</configuration>

`yarn-site.xml`

<configuration>
 <property>
  <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
 </property>
</configuration>

Format namenode

Next, we need to format the namenode filesystem with the following command:

hadoopuser@farlands ~ $ hdf namenode -format

Search the output: if you can read a string like this:

INFO common.Storage: Storage directory /usr/local/hadoop/hadoopdata/hdfs/namenode has been successfully formatted.

It's done.

Start and stop services

Now, the last thing to do is starting Hadoop services:

hadoopuser@farlands ~ $ start-dfs.sh
hadoopuser@farlands ~ $ start-yarn.sh

To check the status of the services use the jps command:

hadoopuser@farlands ~ $ jps
26899 Jps
26216 SecondaryNameNode
25912 NameNode
26041 DataNode
26378 ResourceManager
26494 NodeManager

To stop services, these are the commands:

hadoopuser@farlands ~ $ stop-dfs.sh
hadoopuser@farlands ~ $ stop-yarn.sh

Congratulations, you made it!

Hi, I'm Riccardo

Install Apache Hadoop 2.7 (on *buntu 16.04)