Hadoop Installation Guide

Make sure you execute everything from the home directory. Use cd to move to home directory.

Note that the username here is hadoop. You need to set this to your system username (which should be your SRN).

Change any /home/hadoop/ to /home/<your username>/

Start with updating your system. Use the following commands

cd
sudo apt update -y
sudo apt upgrade -y

Install Java

Since Hadoop 3.x supports Java 8 currently, we will install that version

sudo apt install openjdk-8-jdk -y

Check your Java versions with the following commands

java -version
javac -version

Setup SSH

We now need to setup a passwordless SSH

sudo apt install openssh-server openssh-client -y

Enable passwordless SSH

Generate an SSH key pair and define the location is is to be stored in id_rsa. Then use the cat command to store the public key as authorized_keys in the ssh directory. Follow these commands with change in permissions.

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Verify passwordless SSH with

ssh localhost

Type exit to quit SSH.

Downloading Hadoop

Use any mirror link to get the download url. Download and extract hadoop using the following commands

wget https://downloads.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz
tar xzf hadoop-3.2.2.tar.gz

Single Node Deployment

This setup, also called pseudo-distributed mode, allows each Hadoop daemon to run as a single Java process. A Hadoop environment is configured by editing a set of configuration files:

  • bashrc
  • hadoop-env.sh
  • core-site.xml
  • hdfs-site.xml
  • mapred-site-xml
  • yarn-site.xml

Before proceeding, we need to make a few directories for our namenodes and datanodes and also give them the required permissions.

cd
mkdir dfsdata
mkdir tmpdata
mkdir dfsdata/datanode
mkdir dfsdata/namenode

Change permissions using the following commands. Remember to replace hadoop with your username.

sudo chown -R hadoop:hadoop /home/hadoop/dfsdata/
sudo chown -R hadoop:hadoop /home/hadoop/dfsdata/datanode/
sudo chown -R hadoop:hadoop /home/hadoop/dfsdata/namenode/

Setup ~/.bashrc

Open .bashrc with the following command

sudo nano ~/.bashrc

Scroll to the bottom of the file. Copy and paste these statements right at the bottom.

#Hadoop Related Options
export HADOOP_HOME=/home/hadoop/hadoop-3.2.2
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Press Ctrl + S to save and then Ctrl + X to quit. Apply the changes with

source ~/.bashrc

Setup hadoop-env.sh

Open the file with

sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Scroll down until you find the commented line # export JAVA_HOME=. Uncomment the line and replace the path with your Java path. The final line should look like this

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Save and exit the file as shown previously.

Setup core-site.xml

Open the file with

sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Replace the existing configuration tags with the following

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoop/tmpdata</value>
</property>
<property>
  <name>fs.default.name</name>
  <value>hdfs://127.0.0.1:9000</value>
</property>
</configuration>

Save and exit the file.

Setup hdfs-site.xml

Open the file using

sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Replace the existing configuration tags with the following

<configuration>
<property>
  <name>dfs.name.dir</name>
  <value>/home/hadoop/dfsdata/namenode</value>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>/home/hadoop/dfsdata/datanode</value>
</property>
<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>
</configuration>

To create a multi-node setup, change the <value></value> attribute of dfs.replication to the number of nodes desired. Save and exit the file after making all the changes.

Setup mapred-site.xml

Open the file with

sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Replace the existing configuration tags with the following

<configuration> 
<property> 
  <name>mapreduce.framework.name</name> 
  <value>yarn</value> 
</property> 
</configuration>

Save and exit the file.

Setup yarn-site.xml

Open the file with

sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Replace the existing configuration tags with the following

<configuration>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>127.0.0.1</value>
</property>
<property>
  <name>yarn.acl.enable</name>
  <value>0</value>
</property>
<property>
  <name>yarn.nodemanager.env-whitelist</name>   
  <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

Save and exit the file.

Format HDFS NameNode

Before we start Hadoop for the first time, we need to format the namenode. Use the following command

hdfs namenode -format

A SHUTDOWN message will signify the end of the formatting process.

Congratulations! You have now installed Hadoop!

Starting Hadoop

Navigate execute the following commands

cd hadoop-3.2.2/sbin/
./start-all.sh

Type jps to find all the Java Processes. You should see 6 total processes, including the jps process. Note that the order of the items and the process IDs will be different.

2994 DataNode
3219 SecondaryNameNode
3927 Jps
3431 ResourceManager
2856 NameNode
3566 NodeManager

You can alternatively start the nodes and then the YARN resource manager manually using

./start-dfs.sh
./start-yarn.sh

Access Hadoop from Browser

You can access Hadoop on localhost on the following ports

Remember to stop all processes when you are done with your work.

./stop-all.sh