UE19CS322: Assignment 0 | Aditeya Baral

Word Count using MapReduce

This is the zero’th assignment for the UE19CS322 Big Data Course at PES University. The assignment consists of a single task and focuses on running a MapReduce job to count the number of words in a text file. This is an ungraded but mandatory assignment used to test your installation of Hadoop and does not involve writing any code.

The link to all the files required for the assignment can be found here.

Assignment Objectives and Outcomes

This assignment will help the student install and setup Hadoop.
At the end of this assignment, the student will be able to view their VM configuration and setup Hadoop for further assignments.

Ethical practices

Although ungraded, this assignment is mandatory, not advisory. All students are expected to perform this assignment. Not only will this help you to verify your Hadoop installation but it also will setup your team’s submission profile for later assignments. It additionally helps us confirm if your virtual machine configuration is suitable to run all the assignments.

Please do not run the script repeatedly. All requests are logged and frequent requests in attempt to sabotage the system will result in your team being blacklisted on the portal.

The Dataset

The dataset is the book “Alice in Wonderland” by Lewis Caroll obtained from Project Gutenberg

Software/Languages to be used:

Python 3.8.x
Hadoop v3.2.2 only

Marks

This assignment is ungraded

Tasks Overview:

Clone the repository and obtain all the required files
Run the script
Verify your installation on this portal.

Submission Date

16th September, 11:59 PM

Task Specifications

Problem Statement

Find the number of occurrences of a given word

Description

Find the number of occurrences of the word “alice”

Running the Script

Ensure that Hadoop is running
```
$HADOOP_HOME/sbin/start-all.sh
```
You should see the following running processes (in any order) when you run jps. If you do not see all these processes, your Hadoop installation is incorrect
```
DataNode
SecondaryNameNode
Jps
ResourceManager
NameNode
NodeManager
```
Install curl
```
sudo apt install curl -y
```

Clone the repository and navigate into the directory

git clone https://github.com/Cloud-Computing-Big-Data/UE19CS322-A0
cd UE19CS322-A0/

Give the script executable access
```
chmod +x *.pyc
```
Run the file with your Team ID as a command line argument
```
python3 script.pyc BD_1_2_3_4
```
After execution of the complete script, you should see the following message on your terminal. Visit the portal and view the results of the execution.
```
Starting Hadoop Installation verification...
Verification concluded. Submission has been made to the portal.
Please check the portal for the results.
```
Note that your results will take a few minutes to show up on the portal. Please be patient and do not submit repeatedly.