STAT 39000: Project 1 — Spring 2022

Motivation: Welcome back! This semester should be a bit more straightforward than last semester in many ways. In the first project back, we will do a bit of UNIX review, a bit of Python review, and I’ll ask you to learn and write about some terminology.

Context: This is the first project of the semester! We will be taking it easy and slowly getting back to it.

Scope: UNIX, Python

Learning Objectives
  • Differentiate between concurrency and parallelism at a high level.

  • Differentiate between synchronous and asynchronous.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/

Questions

Question 1

Google the difference between synchronous and asynchronous — there is a lot of information online about this.

Explain what the following tasks are (in day-to-day usage) and why: asynchronous, or synchronous.

  • Communicating via email.

  • Watching a live lecture.

  • Watching a lecture that is recorded.

Please review our updated submission guidelines before submitting your project.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Solution
  • Email: Async

  • Live lecture: Sync

  • Recorded lecture: Async

Question 2

Given the following scenario and rules, explain the synchronous and asynchronous ways of completing the task.

You have 2 reports to write, and 2 wooden pencils. 1 sharpened pencil will write 1/2 of 1 report. You have a helper that is willing to sharpen 1 pencil at a time, for you, and that helper is able to sharpen a pencil in the time it takes to write 1/2 of 1 report.

You can assume you start with 2 sharpened pencils. Of course, if you assumed otherwise before the project was modified, you will get full credit with a different assumption.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Solution

Sync:

Use the first pencil to write .5 reports. Use the second pencil to write .5 reports. Go sharpen both pencils. Use the first sharpened pencil to write .5 reports. Use the second pencil to write the last .5 report.

Async:

Use the first pencil to write .5 reports. Ask sharpener to sharpen first pencil and begin writing second .5 report with second pencil. Use first (now sharpened) pencil to write another .5 report and give second pencil to sharpener to sharpen. Use second (now sharpened) pencil to write last .5 reports.

Question 3

Write Python code that simulates the scenario in question (2) that is synchronous. Make the time it takes to sharpen a pencil be 2 seconds. Make the time it takes to write .5 reports 5 seconds.

Use time.sleep to accomplish this.

How much time does it take to write the reports in theory?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Solution

In theory it would take 24 seconds to do this synchronously.

import time
from typing import List
from dataclasses import dataclass


@dataclass
class Pencil:
    is_sharp: bool = True

    def sharpen(self):
        time.sleep(2)
        self.is_sharp = True


def write_essays(number_essays: int, pencils: List[Pencil]):
        for essay in range(number_essays):

            # make sure both pencils are sharpened and sharpen both otherwise
            [pencil.sharpen() for pencil in pencils if not pencil.is_sharp]

            # write half the essay
            time.sleep(5)

            # dull first pencil
            pencils[0].is_sharp = False

            # write the other half essay
            time.sleep(5)

            # dull second pencil
            pencils[1].is_sharp = False


def simulate_story():
    pencils = [Pencil(), Pencil()]
    write_essays(2, pencils)

simulate_story() # 24 seconds

Question 4

The original text of the question is below. This is too difficult to do for this project. For this question, you are not required to write the code yourself. Rather, just answer the theoretical component to the question.

This question will be addressed in a future project, with better examples, and many more hints.

Read the stackoverflow post and write Python code that simulates the scenario in question (2) that is asynchronous. The time it takes to sharpen a pencil is 2 seconds and the time it takes to write .5 reports is 5 seconds.

Use async functions and asyncio.sleep to accomplish this.

How much time does it take to write the reports in theory?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Solution

In theory it would take 20 seconds to do this asynchronously.

Code will be provided in the next project’s solutions.

Question 5

In your own words, describe the difference between concurrency and parallelism. Then, look at the flights datasets here: /depot/datamine/data/flights/subset. Describe an operation that you could do to the entire dataset as a whole. Describe how you (in theory) could parallelize that process.

Now, assume that you had the entire frontend system at your disposal. Use a UNIX command to find out how many cores the frontend has. If processing 1 file took 10 seconds to do. How many seconds would it take to process all of the files? Now, approximately how many seconds would it take to process all the files if you had the ability to parallelize on this system?

Don’t worry about overhead or the like. Just think at a very high level.

Best make sure this sounds like a task you’d actually like to do — I may be asking you to do it in the not-too-distant future.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Solution

Parallelism is when processes are actually doing things at the same time. Concurrency is when multiple things can appear to be doing things at the same time, but really just start, run, and stop in an overlapping period of time.

Count the number of flights. You could parallelize by having each core run a job that processes a single year. Then, sum the results at the very end.

lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    1
Core(s) per socket:    12
...

24 cores.

If 1 file took 10 seconds to process, and we have 22 files, and 24 cores, it would (in theory) take just about 10 seconds with parallelism and about 220 seconds without.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.