UE19CS322: Assignment 1

Analysis of US Road Accident Data using MapReduce

This is the first assignment for the UE19CS322 Big Data Course at PES University. The assignment consists of 2 tasks and focuses on running MapReduce jobs to analyse data recorded from accidents in the USA.

The files required for the assignment can be found here.

Assignment Objectives and Outcomes

  1. This assignment will help students become familiar with the Map Reduce programming environment and the HDFS.
  2. At the end of this assignment, the student will be able to write and debug MapReduce code.

Ethical practices

Please submit original code only. You can discuss your approach with your friends but you must write original code. All solutions must be submitted through the portal. We will perform a plagiarism check on the code and you will be penalised if your code is found to be plagiarised.

The Dataset

You will be provided with a link to the dataset on PESU Forum. You will be working with the following set of attributes.

Key Type Description
Severity integer Severity of the accident (between 1 - 4)
Start_Time datetime Start time of accident in local time zone
Start_Lat float Latitude as GPS coordinate of the start point
Start_Lng float Longitude as GPS coordinate of the start point
Description string Natural language description of the accident
Visibility(mi) float Visibility (in miles) during the accident
Precipitation(in) float Precipitation amount in inches, if there is any
Weather_Condition string Weather condition during the accident - rain, snow, thunderstorm, fog, etc
Sunrise_Sunset String Shows the period of day (i.e. day or night) during the accident

Software/Languages to be used:

  1. Python 3.8.x
  2. Hadoop v3.2.2 only

Marks

Task 1: 2 marks
Task 2: 2 marks
Report: 1 mark

Tasks Overview:

  1. Load the data into HDFS.
  2. Create mapper.py and reducer.py for Task 1 and Task 2
  3. Run your code on the sample dataset until you get the right answer
  4. Submit the files to the portal
  5. Submit one page report based on the template and answer the questions on the report

Submission Date

16th September, 11:59 PM

Submission Guidelines

You will need to make the following changes to your mapper.py and reducer.py scripts to run them on the portal:

  1. Include the following shebang on the first line of your code

    #!/usr/bin/env python3
    
  2. Convert your files to an executable

    chmod +x mapper.py reducer.py
    
  3. Convert line breaks in DOS format to Unix format (this is necessary if you are coding on Windows)

    dos2unix mapper.py reducer.py
    

Task Specifications

Task 1

Problem Statement

Find record count per hour

Description

Find the number of accidents occurring per hour that satisfy a set of conditions and display them in sorted fashion.

All the following conditions must be satisfied by a record:

Attribute Condition
Description Accident should result in either a “lane blocked”, “shoulder blocked” or an “overturned vehicle”
Severity >= 2
Sunrise_Sunset Night
Visibility(mi) <= 10
Precipitation(in) >= 0.2 inches
Weather_Condition Should either be “Heavy Snow”, “Thunderstorm”, “Heavy Rain”, “Heavy Rain Showers” or “Blowing Dust”

Comments

Ignore records which do not satisfy the mentioned conditions. Additionally, if any of the required attributes contain NaN, ignore the record.

Recommended module: datetime

Output Format

For each hour that contains accident data satisfying the provided conditions, print the hour followed by the number of accidents on a separate line.

Task 2

Problem Statement

Find record count per city and state

Description

Find the number of accidents occurring per city and state where the distance between the start coordinates and a given pair of coordinates (LATITUDE, LONGITUDE) is within D using Euclidean Distance.

For each record, make a POST request to obtain the city and state information. The expected JSON payload format:

{
    "latitude": Start_Lat,
    "longitude": Start_Lng
}

You are required to take 3 command line arguments in your mapper.py script:

LATITUDE LONGITUDE D

Output Format

For each state, display the state name, followed by each city and its accident count, then the state total.

Helpful Commands

Running MapReduce without Hadoop

cat path_to_dataset | python3 mapper.py [args] | sort -k 1,1 | python3 reducer.py [args] > output.txt

Starting Hadoop

$HADOOP_HOME/sbin/start-all.sh

Running a MapReduce Job

hadoop jar path-to-streaming-jar-file \
-input path_to_input_folder_on_hdfs \
-output path_to_output_folder_on_hdfs \
-mapper absolute_path_to_mapper.py \
-reducer absolute_path_to_reducer.py