UE19CS322: Assignment 3
Analysis of Earth Surface Temperature using Spark
This is the third assignment for the UE19CS322 Big Data Course at PES University. The assignment consists of 2 tasks and focuses on using Spark and manipulating dataframes to analyse and obtain insights from Earth’s surface temperature readings.
The files required for the assignment can be found here.
Assignment Objectives and Outcomes
- This assignment will help students get familiar with Spark and DataFrames
- At the end of this assignment, the student will be able to write and debug code to work with DataFrames and manipulate them using Spark
Ethical practices
Please submit original code only. You can discuss your approach with your friends but you must write original code. All solutions must be submitted through the portal. We will perform a plagiarism check on the code and you will be penalised if your code is found to be plagiarised.
The Dataset
The dataset consists of readings of the Earth’s surface temperatures and was compiled and put together by Berkeley Earth, an organization which is focused on environmental data science and records data on Earth’s climate.
There are 2 dataframes, and you will be working with the following attributes in each of them respectively.
City.csv
This DataFrame contains surface temperature readings captured everyday for every city in the world from 1750 to present.
| Attribute | Type | Description |
|---|---|---|
| dt | String | Date of the Record |
| City | String | Name of the city |
| Country | String | Name of the country |
| AverageTemperature | Float | Average Temperature of the city on that Date |
Global.csv
This DataFrame contains worldwide surface temperature readings captured everyday from 1750 to present.
| Attribute | Type | Description |
|---|---|---|
| dt | String | Date of the Record |
| LandAverageTemperature | Float | Global Land Average Temperature on that Date |
Software/Languages to be used:
- Python
3.8.x - Spark
v3.1.2only
Marks
Task 1: 2 marks Task 2: 2 marks Report: 1 mark
Tasks Overview:
- Create
task1.pyandtask2.pyfor Task 1 and Task 2 respectively - Run your code on the sample dataset until you get the right answer
- Submit the files to the portal
- Submit one page report based on the template and answer the questions on the report
Submission Link
Portal for Big Data Assignment Submissions
Submission Deadline
8th November, 11:59 PM
Submission Guidelines
You will need to make the following changes to your task1.py and task2.py scripts to run them on the portal
- Make sure to always fetch the available
Spark Contextinstead of creating a new one. This will prevent any errors while attempting to create a newSpark Contextto connect to the Spark Cluster.spark_context = SparkContext.getOrCreate() - Convert line breaks in DOS format to Unix format (this is necessary if you are coding on Windows - your code will not run on our portal otherwise)
dos2unix task1.py task2.py
Task Specifications
The following sample splits (containing only the required columns for both Tasks) will be used to explain the examples for each Task
City.csv
| dt | AverageTemperature | City | Country |
|---|---|---|---|
| 1876-02-01 | 20 | Bangalore | India |
| 1862-04-01 | 32 | Bangalore | India |
| 1876-02-01 | 28 | Indore | India |
| 1862-04-01 | 34 | Kolkata | India |
| 1856-01-01 | 32 | Kolkata | India |
| 1856-01-01 | 33 | Delhi | India |
| 1876-02-01 | 27 | Frankfurt | Germany |
| 1862-04-01 | 29 | Frankfurt | Germany |
| 1856-01-01 | 26 | Frankfurt | Germany |
Global.csv
| dt | LandAverageTemperature |
|---|---|
| 1876-02-01 | 20 |
| 1862-04-01 | 30 |
| 1856-01-01 | 25 |
Task 1
Problem Statement
Find the number of times where a city’s average temperature on a day turned out to be higher than the city’s average temperature throughout the dataset for a given country.
Description
Use the city.csv DataFrame to find the number of times a city’s average temperature turned out to be higher than the city’s overall average temperature for a given country and display each city along with its count on a newline.
Comments
- The
countryand the path to thecity.csvDataFrame will be provided as the two command line arguments for the Task. - You are expected to use as many Spark transformations as possible without using custom code snippets (loops and data structures to manipulate data) since these will speed up your implementation
- Important: Never load the entire dataset into your memory!
Input Format
The input to the Task consists of two command line arguments in the following order: country path_to_city.csv
Output Format
Display each city in that country with its corresponding count. The values are \t separated and newline delimited. Do not print the city if it does not have any instances where the required condition has been satisfied.
Example
Assuming the value of the command line argument country to be India, we obtain the average temperatures of each city in the given country as 26 for Bangalore, 33 for Kolkata, 28 for Indore, and 33 for Delhi.
There are exactly two examples in the given split where the average temperature of a city in the given country on any given date is higher than the city’s average temperature throughout the dataset:
- Bangalore (occurred on 1862-04-01 with average temperature of 32)
- Kolkata (occurred on 1981-11-01 with average temperature of 34)
Hence, the output for this Task will be as follows.
Bangalore 1
Kolkata 1
Task 2
Problem Statement
Find the number of times where a country’s maximum average temperature on a date turned out to be higher than the worldwide land average temperature on the same date.
Description
Use the city.csv DataFrame to find the maximum average temperature of a country on every date. Count the number of occurences where the country’s maximum average temperature on a date turned out to be higher than the worldwide land average temperature on the same date from the global.csv DataFrame and display the count for each country.
Comments
- The maximum average temperature of a country is defined as the maximum of the average temperatures of all cities in that country on that date.
- The path to the
city.csvandglobal.csvDataFrames will be provided as the two command line arguments for the Task. - You are expected to use as many Spark transformations as possible without using custom code snippets (loops and data structures to manipulate data) since these will speed up your implementation
- Important: Never load the entire datasets into your memory!
Input Format
The input to the Task consists of two command line arguments in the following order: path_to_city.csv path_to_global.csv
Output Format
Display each country with its corresponding count. The values are \t separated and newline delimited. Do not print the country if it does not have any instances where the required condition has been satisfied.
Example
After computing the maximum average temperature for each country on every date, we obtain the following
-
1876-02-01- India:
max(20, 28)=28 - Germany:
max(27)=27
Since both countries have their maximum temperature greater than the land average temperature on this date (20), we increment the counter for both these countries.
Current count:
India: 1, Germany: 1 - India:
-
1862-04-01- India:
max(32, 34)=34 - Germany:
max(29)=29
Since only India’s maximum temperature is greater than the land average temperature on this date(30), we increment only India’s count.
Current count:
India: 2, Germany: 1 - India:
-
1856-01-01:- India:
max(32, 33)=33 - Germany:
max(26)=26
Since both countries have their maximum temperature greater than the land average temperature on this date (25), we increment the counter for both these countries.
Current count:
India: 3, Germany: 2 - India:
Hence, the final output will be as follows.
Germany 2
India 3