If you are interested in Hadoop, read more here.
For this tutorial, I'll use a VM with Ubuntu Server 16.04, 64 bit version, relying on VirtualBox 5.1.4 for the virtualization.
The guest system setup is as follows:
- All 2 cores of my i5-6200U
- 4096 MB of RAM (although 1024 MB should be enough)
- A dinamically allocated 10 GB VDI hard disk (5 GB are the least)
- Ubuntu Server 16.04 x64 ISO file (but every *buntu flavour should be ok)
Notes
When you read a line like this:
jdoe@farlands ~ $ echo "Hello, world!"
I imply a bash
prompt without root priviledges, where jdoe
is the username and farlands
is the hostname.
On the other hand, when the line is like this:
farlands % echo "Hello, world!"
I imply a bash
prompt with root priviledges
Ok, let's start: run the guest os installation with default values and let's jump to hadoop headaches.
Update the guest system
Open up a terminal and fire this commands to update repositories and upgrade the emulated system.
farlands % apt update
farlands % apt upgrade -y
Java 8
We're going to use a precompiled and prepackaged version of Oracle Java 8 in the Webupd8 repo, to avoid further difficulties.
Open up the usual terminal and input:
farlands % apt purge openjdk*
farlands % add-apt-repository -y ppa:webupd8team/java
farlands % apt update
farlands % apt install -y oracle-java8-installer
You can verify Java version by typing:
farlands % java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
If you read a similar output, you completed this step.
Next, we need to create the JAVA_HOME environmental variable, to give hadoop the capability to find java executables.
farlands % echo "export JAVA_HOME=/usr" >> /etc/profile
farlands % source /etc/profile
Disable IPv6
Apache Hadoop supports only IPv4, so let's disable IPv6 in the kernel parameters.
Open the file /etc/sysctl.conf
:
farlands % editor /etc/sysctl.conf
And append to the end:
# Disable IPv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Then reboot:
farlands % reboot
Configure SSH keys
We want to run our setup on a different general purpose user, so we will create a hadoopuser
user and a hadoopgroup
group.
farlands % addgroup hadoopgroup
farlands % adduser -ingroup hadoopgroup hadoopuser
We need ssh access to our machine, so let's install and start an OpenSSH server.
farlands % apt install ssh
farlands % systemctl enable ssh
farlands % systemctl start ssh
Now we need to setup passwordless ssh, by means of crypto keys. In first place, we change to the hadoopuser
account, then we create the key using RSA encryption and finally we authorize the key for the current user.
farlands % su - hadoopuser
hadoopuser@farlands ~ $ ssh-keygen -t rsa -P ""
hadoopuser@farlands ~ $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
hadoopuser@farlands ~ $ chmod 600 ~/.ssh/authorized_keys
hadoopuser@farlands ~ $ ssh-copy-id -i ~/.ssh/id_rsa.pub localhost
hadoopuser@farlands ~ $ ssh localhost
If no password were asked on ssh login, you successfully configured passwordless ssh for user hadoopuser
.
Install Hadoop
We are ready to install Hadoop. Unfortunately, it does not come prepackaged, but we have to extract and move it to /usr/local
.
farlands % wget http://it.apache.contactlab.it/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
farlands % tar xzf hadoop-2.7.3.tar.gz
farlands % rm -rf hadoop-2.7.3.tar.gz
farlands % mv hadoop-2.7.3 /usr/local
farlands % ln -sf /usr/local/hadoop-2.7.3/ /usr/local/hadoop
farlands % chown -R hadoopuser:hadoopgroup /usr/local/hadoop-2.7.3/
Now we need to configure some environmental variables, with the hadoopuser
account. Switch to that account and edit ~/.bashrc
:
hadoopuser@farlands ~ $ editor ~/.bashrc
Append at the end:
# Hadoop config
export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# Native path
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib/native"
# Java path
export JAVA_HOME="/usr"
# OS path
export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin
Next, source ~/.bashrc
to apply changes.
hadoopuser@farlands ~ $ source ~/.bashrc
Now we need to edit /usr/local/hadoop/etc/hadoop/hadoop-env.sh
:
farlands % editor /usr/local/hadoop/etc/hadoop/hadoop-env.sh
And add this at the end:
export JAVA_HOME="/usr"
Configure Hadoop
Hadoop configuration is quite hard, because it has a lot of config files. We need to navigate to /usr/local/hadoop/etc/hadoop
and edit these files:
core-site.xml
hdfs-site.xml
mapred-site.xml
(needs to be copied frommapred-site.xml.template
)yarn-site.xml
They all are XML files with a top-level <configuration>
node. For clarity we report the configuration node only.
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:/usr/local/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:/usr/local/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Format namenode
Next, we need to format the namenode filesystem with the following command:
hadoopuser@farlands ~ $ hdf namenode -format
Search the output: if you can read a string like this:
INFO common.Storage: Storage directory /usr/local/hadoop/hadoopdata/hdfs/namenode has been successfully formatted.
It's done.
Start and stop services
Now, the last thing to do is starting Hadoop services:
hadoopuser@farlands ~ $ start-dfs.sh
hadoopuser@farlands ~ $ start-yarn.sh
To check the status of the services use the jps
command:
hadoopuser@farlands ~ $ jps
26899 Jps
26216 SecondaryNameNode
25912 NameNode
26041 DataNode
26378 ResourceManager
26494 NodeManager
To stop services, these are the commands:
hadoopuser@farlands ~ $ stop-dfs.sh
hadoopuser@farlands ~ $ stop-yarn.sh
Congratulations, you made it!