Motivation: In Python it is very important to understand some of the data types more depth than you would in R. Many of the data types in Python will seem very familiar.

R Python

character

string 'str'

integer

integer 'int'

numeric

float

logical

boolen 'bool'

In addition to all of that, there are other classifications that the packages in Python include. The most popular packages are pandas and numpy.
Things like `tuple`s, `list`s, `set`s, and `dict`s that diverge from R.
It is important to understand basic concepts in Python before jumping in.

Context: This is the second project introducing some basic data types, and demonstrating some familiar control flow concepts, all while digging right into a dataset.

Scope: tuples, lists, if statements, opening files

Learning Objectives
  • List the differences between lists & tuples and when to use each.

  • Gain familiarity with string methods, list methods, and tuple methods.

  • Demonstrate the ability to read and write data of various formats using various packages.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset

The following questions will use the dataset found in Anvil:

/anvil/projects/tdm/data/craigslist/vehicles.csv

Questions

ONE

Read in the dataset /anvil/projects/tdm/data/craigslist/vehicles.csv into a pandas DataFrame called myDF.

Insider Knowledge

pandas is an integral tool for various data science tasks in Python. You can read a quick intro here.+ We will be slowly introducing bits and pieces of this package throughout the semester.

Helpful Hints
import pandas as pd
from pathlib import path
  1. How big is the dataset (in Mb or Gb)?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answers to the questions above

TWO

Using the function attributes to get the number of columns and rows of our dataset. How many columns and rows are there? Use f-strings[1] to print a message, for example

There are 123 columns in the DataFrame
There are 321 rows in the DataFrame!

Insider Knowledge

Attributes are the different properties of a data.frame that can be used to get data/information from a particular data.frame

  • index- _index()_there are two types of index in a data.frame. One being row index and the other colummn index.

mydf.index
  • columns- can be used ot get the label values for the columns in the dataset

mydf.columns
  • axes- used when we want to know the value of all row labels AND column labels at the same time

mydf.axes
  • d(ata)types- used to show the data types for each column in the data.frame

mydf.dtypes
  • size- used to show the total number of elements/items in a data.frame

mydf.size
  • shape- show the total number of rows and columns of a specific data.frame

mydf.shape
  • ndim- stands for number of dimensions, this shows the number of dimensions in a specific data.frame

mydf.ndim
  • empty- this checks to see if the data.frame is empty, if it is indeed empty it will retun TRUE but if the data.frame has elements/information then it will return FALSE

mydf.empty
  • T- stands for transpose this means that you can change rows into columns and also vice versa of columns into rows.

mydf.T
  • values- returns a view object which contains the values of the dic as a list.

mydf.values
Helpful Hint

Earlier we learned how to read a csv file into python, line-by-line, and print values.
Use the csv package to print just the first row, which should contain the names of the columns, OR instead of using the csv package, use one of the pandas attributes from myDF to print the column names.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answers to the questions above

THREE

  • Using the pandas package get a list called our_columns that contains the column names.

  • Print the second value in the list.

  • Without using a loop, print the 1st, 3rd, 5th, etc. elements of the list.

  • Print the last four elements of the list ( "state", "lat", "long") by accessing their negative index.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answers to the questions above

FOUR

  1. matplotlib is one of the primary plotting packages in Python. Use the following code

my_values = tuple(myDF.loc[:, 'odometer'].dropna().to_list())

Using the result which is a tuple create a lineplot of odometer readings.

Insider Knowledge

A tuple is used to store multiple items in a single variable. One of four data types that Python uses to store collections of data. The other three include List, Set, Dictionary

Tuple a collection of data separated by commas which is ordered and cannot be changed (aka immutable). A tuple does allow for duplicate values, can be any data type
List data structure that is changable (aka mutable). Each element/value is inside the list is an item. A list does allow for duplicate values, and can be any data type
Set is an unordered collection of items. Each element is unique which means that there are no duplicates and cannot be changed (aka immutable). Elements/items cannot be accessed by using indexes because a set is unordered. Can be constructed with any data type
Dictionary it is a data structure that is also known as an associative array it is a colletion of key-value pairs. They are ordered, and changable (aka mutable) and do not allow for duplicates. values can be any data type.

The linelot is not informative, and if we tried to sort the values we are unable to do that because a tuple is immutable. A good article about this here. In addition, this is a great post will help you understand when using a tuple would be benifical.

  1. Knowing that now that lists are mutable, so they can be sorted. Convert my_values to a list and then sort, and re-plot.

    Notice that there are potential ouliers in your data that are making the plot looking askew. We can use negative indexing to plot the sorted values. We want to remove the last 50 values (which are the highest values).

Helpful Hint

To prevent plotting values on the same plot use the close method.

import matplotlib.pyplot as plt
my_values = [1,2,3,4,5]
plt.plot(my_values)
plt.show()
plt.close()
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answers to the questions above

So far we have summarized some data and created a graph using matplotlib, we have used pandas to find interesting facts about our dataset, and highlighted the differences between lists and tuples. Onto the next!


1. an improved way to format strings. Less prone to errors, faster, easier to read, and more concise