TDM 20200: Project 8 — 2024

Motivation: Spark uses a distributed computing model to process data, which means that data is processed in parallel across a cluster of machines. PySpark is a Spark API that allows you to interact with Spark through the Python shell, or in Jupyter Lab, or in DataBricks, etc. PySpark provides a way to access Spark’s framework using Python. It combines the simplicity of Python with the power of Apache Spark.

Context: Understand components of Spark’s ecosystem that PySpark can use

Scope: Python, Spark SQL, Spark Streaming, MLib, GraphX

Learning Objectives
  • Develop skills and techniques to use PySpark to read a dataset, perform transformations like filtering, mapping and execute actions like count, collect

  • Understand how to use PySpark SQL to run SQL queries

Dataset(s)

The following questions will use the following dataset:

  • /anvil/projects/tdm/data/whin/weather.parquet

Readings and Resources

You need to use 2 cores for your Jupyter Lab session for Project 8 this week.

Dr Ward created 5 videos to help with this project.

Questions

Question 1 (2 points)

  1. Run the example from Example book - PySpark, in Pandas and then in Spark. Make sure to show the time used by each of these methods.

  2. Comment on the different speeds needed, to process data using Pandas vs PySpark.

Question 2 (2 points)

  1. Run the following code to initiating a PySpark application. The name of our PySpark session is sp but you may use a different name. This is the entry point to using Spark’s functionality with a DataFrame and with the SQL API.

  2. Read the file /anvil/projects/tdm/data/whin/weather.parquet into a PySpark DataFrame called myDF

  3. Show the first 5 rows of the resulting PySpark DataFrame.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
sp = SparkSession.builder.appName('TDM_S').config("spark.driver.memory", "2g").getOrCreate()

Question 3 (2 points)

  1. List the DataFrame’s column names and data types

  2. How many rows are in the DataFrame?

  3. How many unique station_id values are in the data set?

  • The printSchema() function is useful to explore a DataFrame’s structure.

  • You may use the select() function to select the column, and you can use the distinct() and count() functions to get the distinct values.

Question 4 (2 points)

  1. Create a Temporary View called weather from the PySpark DataFrame myDF.

  2. Run a SQL Query to get the total number of records for each station.

  3. Run a SQL Query to get the maximum wind speed recorded by each station.

  4. Run a SQL Query to the average temperature recorded by each station.

  • createOrReplaceTempView() is useful for part 4a.

  • You may refer to sp.sql()

  • Use GROUP BY station_id to group together the data from each station_id before performing the Spark SQL query.

Question 5 (2 points)

  1. Explore the DataFrame, and run 2 SQL Queries of your own choosing. Explain the meaning of your two queries and what you learn from the queries.

Project 08 Assignment Checklist

  • Jupyter Lab notebook with your code, comments and outputs for the assignment

    • firstname-lastname-project08.ipynb

  • Python file with code and comments for the assignment

    • firstname-lastname-project08.py

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.