Control Flow with the NOAA Data
This example is based on question 1-5 from TDM 102 Project 3 Spring 2024.
These example(s) depend on the database directory:
-
/anvil/projects/tdm/data/noaa
The NOAA data set includes weather-related information from multiple weather stations across the United States. It contains daily records for several years in CSV files.
Examine the Files in the NOAA Data Directory
Explore the files in the provided data set directory. Find out how many years are included in the data set.
ls /anvil/projects/tdm/data/noaa
1750.csv 1800.csv 1838.csv 1876.csv 1914.csv 1952.csv 1990.csv 1763.csv 1801.csv 1839.csv 1877.csv 1915.csv 1953.csv 1991.csv 1764.csv 1802.csv 1840.csv 1878.csv 1916.csv 1954.csv 1992.csv 1765.csv 1803.csv 1841.csv 1879.csv 1917.csv 1955.csv 1993.csv 1766.csv 1804.csv 1842.csv 1880.csv 1918.csv 1956.csv 1994.csv 1767.csv 1805.csv 1843.csv 1881.csv 1919.csv 1957.csv 1995.csv 1768.csv 1806.csv 1844.csv 1882.csv 1920.csv 1958.csv 1996.csv 1769.csv 1807.csv 1845.csv 1883.csv 1921.csv 1959.csv 1997.csv 1770.csv 1808.csv 1846.csv 1884.csv 1922.csv 1960.csv 1998.csv 1771.csv 1809.csv 1847.csv 1885.csv 1923.csv 1961.csv 1999.csv 1772.csv 1810.csv 1848.csv 1886.csv 1924.csv 1962.csv 2000.csv 1773.csv 1811.csv 1849.csv 1887.csv 1925.csv 1963.csv 2001.csv 1774.csv 1812.csv 1850.csv 1888.csv 1926.csv 1964.csv 2002.csv 1775.csv 1813.csv 1851.csv 1889.csv 1927.csv 1965.csv 2003.csv 1776.csv 1814.csv 1852.csv 1890.csv 1928.csv 1966.csv 2004.csv 1777.csv 1815.csv 1853.csv 1891.csv 1929.csv 1967.csv 2005.csv 1778.csv 1816.csv 1854.csv 1892.csv 1930.csv 1968.csv 2006.csv 1779.csv 1817.csv 1855.csv 1893.csv 1931.csv 1969.csv 2007.csv 1780.csv 1818.csv 1856.csv 1894.csv 1932.csv 1970.csv 2008.csv 1781.csv 1819.csv 1857.csv 1895.csv 1933.csv 1971.csv 2009.csv 1782.csv 1820.csv 1858.csv 1896.csv 1934.csv 1972.csv 2010.csv 1783.csv 1821.csv 1859.csv 1897.csv 1935.csv 1973.csv 2011.csv 1784.csv 1822.csv 1860.csv 1898.csv 1936.csv 1974.csv 2012.csv 1785.csv 1823.csv 1861.csv 1899.csv 1937.csv 1975.csv 2013.csv 1786.csv 1824.csv 1862.csv 1900.csv 1938.csv 1976.csv 2014.csv 1787.csv 1825.csv 1863.csv 1901.csv 1939.csv 1977.csv 2015.csv 1788.csv 1826.csv 1864.csv 1902.csv 1940.csv 1978.csv 2016.csv 1789.csv 1827.csv 1865.csv 1903.csv 1941.csv 1979.csv 2017.csv 1790.csv 1828.csv 1866.csv 1904.csv 1942.csv 1980.csv 2018.csv 1791.csv 1829.csv 1867.csv 1905.csv 1943.csv 1981.csv 2019.csv 1792.csv 1830.csv 1868.csv 1906.csv 1944.csv 1982.csv 2020.csv 1793.csv 1831.csv 1869.csv 1907.csv 1945.csv 1983.csv 2020_sample.csv 1794.csv 1832.csv 1870.csv 1908.csv 1946.csv 1984.csv 2020_sampleB.csv 1795.csv 1833.csv 1871.csv 1909.csv 1947.csv 1985.csv 2021.csv 1796.csv 1834.csv 1872.csv 1910.csv 1948.csv 1986.csv 2022.csv 1797.csv 1835.csv 1873.csv 1911.csv 1949.csv 1987.csv 2023.csv 1798.csv 1836.csv 1874.csv 1912.csv 1950.csv 1988.csv readme-by_year.txt 1799.csv 1837.csv 1875.csv 1913.csv 1951.csv 1989.csv status-by_year.txt
Import Pandas and Pathlib
Import pandas and pathlib using: import pandas as pd and also from pathlib import Path
import pandas as pd
from pathlib import Path
Using a For Loop to List NOAA Data Files
Create a list named files
, to hold Path objects from 1880.csv to 1883.csv in the data set folder using list comprehension:
files=[]
for year in range(1880,1884):
file = Path(f'/anvil/projects/tdm/data/noaa/{year}.csv')
files.append(file)
files
[PosixPath('/anvil/projects/tdm/data/noaa/1880.csv'), PosixPath('/anvil/projects/tdm/data/noaa/1881.csv'), PosixPath('/anvil/projects/tdm/data/noaa/1882.csv'), PosixPath('/anvil/projects/tdm/data/noaa/1883.csv')]
OR you could try:
files = [Path(f'/anvil/projects/tdm/data/noaa/{year}.csv') for year in range(1880,1884)]
files
[PosixPath('/anvil/projects/tdm/data/noaa/1880.csv'), PosixPath('/anvil/projects/tdm/data/noaa/1881.csv'), PosixPath('/anvil/projects/tdm/data/noaa/1882.csv'), PosixPath('/anvil/projects/tdm/data/noaa/1883.csv')]
Calculate the Number of Records in a Single File
Calculate how many records are in the file 1880.csv. (Each line is one record.)
file= Path('/anvil/projects/tdm/data/noaa/1880.csv')
with open(file,'r') as f:
count = 0
for line in f:
count+=1
print(f'there are {count} records in the {file}')
there are 370779 records in the /anvil/projects/tdm/data/noaa/1880.csv
OR you could try:
with open(files[0],'r') as f:
count = 0
for line in f:
count+=1
print(f'there are {count} records in the {file}')
there are 370779 records in the /anvil/projects/tdm/data/noaa/1880.csv
Total Number of Records Across 4 Files
Calculate how many records there are (altogether) in the 4 files from 1880.csv to 1883.csv. Use the list files
that you created previously:
files = [Path(f'/anvil/projects/tdm/data/noaa/{year}.csv') for year in range(1880,1884)]
files
total = 0
for file in files:
with open(file,'r') as f:
count = 0
for line in f:
count+= 1
total += count
print(f'there are {total} records in the {files}')
there are 1906862 records in the [PosixPath('/anvil/projects/tdm/data/noaa/1880.csv'), PosixPath('/anvil/projects/tdm/data/noaa/1881.csv'), PosixPath('/anvil/projects/tdm/data/noaa/1882.csv'), PosixPath('/anvil/projects/tdm/data/noaa/1883.csv')]
Display Columns and Preview Data in the First File
Run the following statement, to read in the first file from the list myfiles into a DataFrame using pd.read_csv(files[0])
.
file_df = pd.read_csv(files[0])
file_df
AGE00147705 | 18800101 | PRCP | 0 | Unnamed: 4 | Unnamed: 5 | E | Unnamed: 7 |
---|---|---|---|---|---|---|---|
AGE00147708 |
18800101 |
PRCP |
36 |
NaN |
NaN |
E |
NaN |
AGE00147709 |
18800101 |
PRCP |
36 |
NaN |
NaN |
E |
NaN |
AGE00147712 |
18800101 |
PRCP |
0 |
NaN |
NaN |
E |
NaN |
AGE00147713 |
18800101 |
TMAX |
52 |
NaN |
NaN |
E |
NaN |
AGE00147713 |
18800101 |
TMIN |
0 |
NaN |
NaN |
E |
NaN |
… |
… |
… |
… |
… |
… |
… |
… |
USW00093852 |
18801231 |
TMAX |
39 |
NaN |
NaN |
X |
NaN |
USW00093852 |
18801231 |
TMIN |
-17 |
NaN |
NaN |
X |
NaN |
USW00094728 |
18801231 |
TMAX |
-122 |
NaN |
NaN |
0 |
NaN |
USW00094728 |
18801231 |
TMIN |
-194 |
NaN |
NaN |
0 |
NaN |
USW00094728 |
18801231 |
PRCP |
0 |
NaN |
NaN |
0 |
NaN |
370778 rows × 8 columns
Display Columns
Show the column names for the dataframe:
column_names = file_df.columns
print(column_names)
Index(['AGE00147705', '18800101', 'PRCP', '0', 'Unnamed: 4', 'Unnamed: 5', 'E', 'Unnamed: 7'], dtype='object')
Fix Column Headers by Specifying header=None
If we examine the column names and preview the data up above for filedf
, you will notice that there is an issue with the column names. It seems that the column names are actually the first observation. Correct this by specifying the argument header=None
:
file_df = pd.read_csv(files[0],header=None)
file_df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
0 |
AGE00147705 |
18800101 |
PRCP |
0 |
NaN |
NaN |
E |
1 |
AGE00147708 |
18800101 |
PRCP |
36 |
NaN |
NaN |
E |
2 |
AGE00147709 |
18800101 |
PRCP |
36 |
NaN |
NaN |
E |
3 |
AGE00147712 |
18800101 |
PRCP |
0 |
NaN |
NaN |
E |
4 |
AGE00147713 |
18800101 |
TMAX |
52 |
NaN |
NaN |
E |
… |
… |
… |
… |
… |
… |
… |
… |
370774 |
USW00093852 |
18801231 |
TMAX |
39 |
NaN |
NaN |
X |
370775 |
USW00093852 |
18801231 |
TMIN |
-17 |
NaN |
NaN |
X |
370776 |
USW00094728 |
18801231 |
TMAX |
-122 |
NaN |
NaN |
0 |
370777 |
USW00094728 |
18801231 |
TMIN |
-194 |
NaN |
NaN |
0 |
370778 |
USW00094728 |
18801231 |
PRCP |
0 |
NaN |
NaN |
0 |
Add Column Names
Now let us add these 7 column names: id, date, element_code, value, mflag, qflag, sflag, and obstime to the data frame:
pd.read_csv(files[0],names = ["id","date","element_code","value","mflag","qflag","sflag","obstime"])
Make a List of DataFrames for Each Year (1880-1883)
Make a list called mydataframes (of length 4) that contains 4 data frames, one for each year, from 1880.csv to 1883.csv. Starting with the sample code (above) for reading in the first file, modify our example, so that you have a "for" loop that reads in all 4 files. Test your work with a for loop that displays the column names of each of the four data frames.
import pandas as pd
file_DFs = []
for file in files:
file_df = pd.read_csv(file,names= ["id","date","element_code","value","mflag","qflag","sflag","obstime"])
file_DFs.append(file_df)
file_DFs
[ id date element_code value mflag qflag sflag obstime 0 AGE00147705 18800101 PRCP 0 NaN NaN E NaN 1 AGE00147708 18800101 PRCP 36 NaN NaN E NaN 2 AGE00147709 18800101 PRCP 36 NaN NaN E NaN 3 AGE00147712 18800101 PRCP 0 NaN NaN E NaN 4 AGE00147713 18800101 TMAX 52 NaN NaN E NaN ... ... ... ... ... ... ... ... ... 370774 USW00093852 18801231 TMAX 39 NaN NaN X NaN 370775 USW00093852 18801231 TMIN -17 NaN NaN X NaN 370776 USW00094728 18801231 TMAX -122 NaN NaN 0 NaN 370777 USW00094728 18801231 TMIN -194 NaN NaN 0 NaN 370778 USW00094728 18801231 PRCP 0 NaN NaN 0 NaN [370779 rows x 8 columns], id date element_code value mflag qflag sflag obstime 0 AGE00135039 18810101 PRCP 30 NaN NaN E NaN 1 AGE00147705 18810101 PRCP 479 NaN NaN E NaN 2 AGE00147708 18810101 PRCP 125 NaN NaN E NaN 3 AGE00147709 18810101 PRCP 125 NaN NaN E NaN 4 AGE00147711 18810101 PRCP 0 NaN NaN E NaN ... ... ... ... ... ... ... ... ... 443265 UZM00038457 18811231 PRCP 100 NaN NaN r NaN 443266 UZM00038457 18811231 TAVG -3 NaN NaN r NaN 443267 UZM00038618 18811231 TMIN -71 NaN NaN r NaN 443268 UZM00038618 18811231 PRCP 161 NaN NaN r NaN 443269 UZM00038618 18811231 TAVG -49 NaN NaN r NaN [443270 rows x 8 columns], id date element_code value mflag qflag sflag obstime 0 AGE00135039 18820101 PRCP 0 NaN NaN E NaN 1 AGE00147705 18820101 PRCP 0 NaN NaN E NaN 2 AGE00147708 18820101 PRCP 0 NaN NaN E NaN 3 AGE00147709 18820101 PRCP 0 NaN NaN E NaN 4 AGE00147711 18820101 PRCP 0 NaN NaN E NaN ... ... ... ... ... ... ... ... ... 514078 UZM00038457 18821231 PRCP 0 NaN NaN r NaN 514079 UZM00038457 18821231 TAVG 16 NaN NaN r NaN 514080 UZM00038618 18821231 TMIN 10 NaN NaN r NaN 514081 UZM00038618 18821231 PRCP 0 NaN NaN r NaN 514082 UZM00038618 18821231 TAVG 33 NaN NaN r NaN [514083 rows x 8 columns], id date element_code value mflag qflag sflag obstime 0 AGE00135039 18830101 PRCP 0 NaN NaN E NaN 1 AGE00147705 18830101 PRCP 0 NaN NaN E NaN 2 AGE00147708 18830101 PRCP 0 NaN NaN E NaN 3 AGE00147709 18830101 PRCP 0 NaN NaN E NaN 4 AGE00147711 18830101 PRCP 0 NaN NaN E NaN ... ... ... ... ... ... ... ... ... 578725 UZM00038457 18831231 PRCP 0 NaN NaN r NaN 578726 UZM00038457 18831231 TAVG -69 NaN NaN r NaN 578727 UZM00038618 18831231 TMIN -22 NaN NaN r NaN 578728 UZM00038618 18831231 PRCP 0 NaN NaN r NaN 578729 UZM00038618 18831231 TAVG 1 NaN NaN r NaN [578730 rows x 8 columns]]
Unique elements of the column element_code
Print out the (unique) elements of the column element_code (i.e., show each element just one time).
i=1
for file_df in file_DFs:
print(f"file {i}'s unique element_codes:{file_df['element_code'].unique()}")
i+=1
file 1's unique element_codes:['PRCP' 'TMAX' 'TMIN' 'SNOW' 'SNWD' 'TAVG' 'MDPR' 'DATN' 'DATX' 'MDTN' 'MDTX' 'DAPR' 'DWPR' 'MDSF' 'WT01' 'TOBS' 'WT04' 'WT05'] file 2's unique element_codes:['PRCP' 'TMAX' 'TMIN' 'TAVG' 'SNOW' 'SNWD' 'MDPR' 'MDSF' 'DATN' 'DATX' 'MDTN' 'MDTX' 'WT04' 'DAPR' 'DWPR' 'TOBS'] file 3's unique element_codes:['PRCP' 'TMAX' 'TMIN' 'TAVG' 'SNOW' 'SNWD' 'DATN' 'DATX' 'MDTN' 'MDTX' 'DAPR' 'DWPR' 'MDPR' 'MDSF' 'WT04' 'TOBS' 'WT05' 'WT03' 'WT16' 'WT18' 'DASF'] file 4's unique element_codes:['PRCP' 'TMAX' 'TMIN' 'TAVG' 'SNOW' 'SNWD' 'DAPR' 'DWPR' 'MDPR' 'MDSF' 'WT11' 'DATN' 'DATX' 'MDTN' 'MDTX' 'WT03' 'DASF' 'WT18' 'WT04' 'WT16' 'WT08']
Snow Occurence
Find the number of times that SNOW occurs in the element_code column.
You could try:
i=1
for file_df in file_DFs:
snow_ct = len(file_df[file_df['element_code']=='SNOW'])
print(f"file {i} has {snow_ct} element_code as 'SNOW'")
i+=1
OR you could try:
i=1
for file_df in file_DFs:
snow_ct = file_df['element_code'].value_counts()['SNOW']
print(f"file {i} has {snow_ct} element_code as 'SNOW'")
i+=1
OR you could try:
i=1
count=0
for file_df in file_DFs:
snow_ct = (file_df['element_code']=='SNOW').sum()
count+=snow_ct
print(f"file {i} has {snow_ct} element_code as 'SNOW'")
i+=1
print(f'count total is {count}')
These three methods would lead to the same output:
file 1 has 33700 element_code as 'SNOW' file 2 has 36451 element_code as 'SNOW' file 3 has 40648 element_code as 'SNOW' file 4 has 53417 element_code as 'SNOW'
Chunksize
Now let us practice using the chunksize
feature for big data. You may refer to this document, to get more information about chunksize
.
Try to run the following 2 programs, to find the number of times that SNOW
occurs in the element_code
column, from the year 1880 to year 1883.
Version 1
import pandas as pd
from pathlib import Path
myfiles=[]
for year in range (1880, 1884):
file= Path(f'/anvil/projects/tdm/data/noaa/{year}.csv')
myfiles.append(file)
import time
count = 0
startT = time.time()
for file in files:
for df in pd.read_csv(file, names=["id", "date", "element_code", "value", "mflag", "qflag", "sflag", "obstime"], chunksize=10000):
count += len(df[df['element_code'] == 'SNOW'])
print(count)
total_T = time.time() - startT
print(total_T)
164216 0.760974645614624
Version 2
count = 0
for file in files:
for df in pd.read_csv(file, names=["id", "date", "element_code", "value", "mflag", "qflag", "sflag", "obstime"], chunksize=10000):
for index, row in df.iterrows():
if row['element_code'] == 'SNOW':
count += 1
print(count)
164216 56.58658051490784