UE19CS322: Assignment 1
Analysis of US Road Accident Data using MapReduce
This is the first assignment for the UE19CS322 Big Data Course at PES University. The assignment consists of 2 tasks and focuses on running MapReduce jobs to analyse data recorded from accidents in the USA.
The files required for the assignment can be found here.
Assignment Objectives and Outcomes
- This assignment will help students become familiar with the Map Reduce programming environment and the HDFS.
- At the end of this assignment, the student will be able to write and debug MapReduce code.
Ethical practices
Please submit original code only. You can discuss your approach with your friends but you must write original code. All solutions must be submitted through the portal. We will perform a plagiarism check on the code and you will be penalised if your code is found to be plagiarised.
The Dataset
You will be provided with a link to the dataset on PESU Forum. You will be working with the following set of attributes.
| Key | Type | Description |
|---|---|---|
| Severity | integer | Severity of the accident (between 1 - 4) |
| Start_Time | datetime | Start time of accident in local time zone |
| Start_Lat | float | Latitude as GPS coordinate of the start point |
| Start_Lng | float | Longitude as GPS coordinate of the start point |
| Description | string | Natural language description of the accident |
| Visibility(mi) | float | Visibility (in miles) during the accident |
| Precipitation(in) | float | Precipitation amount in inches, if there is any |
| Weather_Condition | string | Weather condition during the accident - rain, snow, thunderstorm, fog, etc |
| Sunrise_Sunset | String | Shows the period of day (i.e. day or night) during the accident |
Software/Languages to be used:
- Python
3.8.x - Hadoop
v3.2.2only
Marks
Task 1: 2 marks
Task 2: 2 marks
Report: 1 mark
Tasks Overview:
- Load the data into HDFS.
- Create
mapper.pyandreducer.pyfor Task 1 and Task 2 - Run your code on the sample dataset until you get the right answer
- Submit the files to the portal
- Submit one page report based on the template and answer the questions on the report
Submission Date
16th September, 11:59 PM
Submission Guidelines
You will need to make the following changes to your mapper.py and reducer.py scripts to run them on the portal:
-
Include the following
shebangon the first line of your code#!/usr/bin/env python3 -
Convert your files to an executable
chmod +x mapper.py reducer.py -
Convert line breaks in
DOSformat toUnixformat (this is necessary if you are coding on Windows)dos2unix mapper.py reducer.py
Task Specifications
Task 1
Problem Statement
Find record count per hour
Description
Find the number of accidents occurring per hour that satisfy a set of conditions and display them in sorted fashion.
All the following conditions must be satisfied by a record:
| Attribute | Condition |
|---|---|
| Description | Accident should result in either a “lane blocked”, “shoulder blocked” or an “overturned vehicle” |
| Severity | >= 2 |
| Sunrise_Sunset | Night |
| Visibility(mi) | <= 10 |
| Precipitation(in) | >= 0.2 inches |
| Weather_Condition | Should either be “Heavy Snow”, “Thunderstorm”, “Heavy Rain”, “Heavy Rain Showers” or “Blowing Dust” |
Comments
Ignore records which do not satisfy the mentioned conditions. Additionally, if any of the required attributes contain NaN, ignore the record.
Recommended module: datetime
Output Format
For each hour that contains accident data satisfying the provided conditions, print the hour followed by the number of accidents on a separate line.
Task 2
Problem Statement
Find record count per city and state
Description
Find the number of accidents occurring per city and state where the distance between the start coordinates and a given pair of coordinates (LATITUDE, LONGITUDE) is within D using Euclidean Distance.
For each record, make a POST request to obtain the city and state information. The expected JSON payload format:
{
"latitude": Start_Lat,
"longitude": Start_Lng
}
You are required to take 3 command line arguments in your mapper.py script:
LATITUDE LONGITUDE D
Output Format
For each state, display the state name, followed by each city and its accident count, then the state total.
Helpful Commands
Running MapReduce without Hadoop
cat path_to_dataset | python3 mapper.py [args] | sort -k 1,1 | python3 reducer.py [args] > output.txt
Starting Hadoop
$HADOOP_HOME/sbin/start-all.sh
Running a MapReduce Job
hadoop jar path-to-streaming-jar-file \
-input path_to_input_folder_on_hdfs \
-output path_to_output_folder_on_hdfs \
-mapper absolute_path_to_mapper.py \
-reducer absolute_path_to_reducer.py