Spark Installation Guide

Please ensure that you have first installed Hadoop before you install Spark. Also remember to perform these installations in the home directory of the the Hadoop user profile. Switch to the Hadoop user and then execute cd to reach the home directory.

Install Scala and Git

Since we have already installed Java 8, we just need to install Scala and Git

sudo apt install scala git -y

Check the versions of all the installed packages so far with

java -version
javac -version
scala -version
git --version

Downloading Spark

Download Spark compatible with the version of Hadoop on your system. Extract it and move it to opt/spark directory.

wget https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
tar xvf spark-3.1.2-bin-hadoop3.2.tgz
sudo mv spark-3.1.2-bin-hadoop3.2 /opt/spark

Configure Spark

We need to configure a few environment variables. Execute the following

echo "export SPARK_HOME=/opt/spark" >> ~/.profile
echo "export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin" >> ~/.profile
echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profile

Starting Spark

To start the Spark application, we need to navigate to the right directory and then convert all shell scripts to an executable. We can then execute the start commands.

cd /opt/spark/sbin
sudo chmod +x *.sh

Then run ./start-all.sh to start Spark. This will create a master and slave with default configurations. Remember to use ./stop-all.sh to shut down all processes once you are done.

You can find more details about Master and Slave processes here.

Access Spark from Browser

You can view Spark on a browser, visit http://localhost:8080

Spark Shell

You can also use the Spark Shell to execute commands. First navigate to the right directory and then convert the required files to executables

cd /opt/spark/bin
sudo chmod +x pyspark 
sudo chmod +x spark-shell

Then execute the Spark Shell with Scala using

./spark-shell

Or, you can also access Spark using Python3 with

./pyspark