pandas Dates and Times

Basics

Visualization is one of the crucial foundations of practicing data science, and analyzing information on a timeline can be vital to understanding its behavior. However, our understanding and structure of time is difficult to translate into raw computing — we can recognize that the phrase "8/1/2022" is the day that follows "7/31/2022", but without help, a computer would just see those as two different strings.

To help, pandas introduced the to_datetime function to convert strings (like the ones above) to Timestamp objects, which have many attributes that help translate our human perception of dates and times for the computer’s sake.

Python already has a datetime package with a datetime object — pandas Timestamp is an adaptation of this object that works with pandas functions, though the two are often used interchangeably. In fact, the pandas documentation doesn’t have datetime object page, just a Timestamp one.


to_datetime

Let’s start with an example DataFrame that contains event types, attendance numbers, and the dates on which the events occur. We’ll check the data types of our columns to see what pandas starts out with.

import pandas as pd

list_1 = ['7/1/2021', '7/4/2021', '8/20/2021', '9/2/2021', '4/1/2050']
list_2 = ['Presentation', 'Class', 'Presentation', 'Event', 'Class']
list_3 = [25, 0, 50, 48, 1000000]

myDF = pd.DataFrame(zip(list_1, list_2, list_3), columns=['date', 'type', 'attendance'])
print(myDF.dtypes)
date          object
type          object
attendance     int64

As we can see from the output, date is a general object — this doesn’t help us, as we need a Timestamp object and its host of useful functions and attributes. The simplest way to convert this only requires our to_datetime function, the column we want to change, and the format our dates come in:

myDF['date'] = pd.to_datetime(myDF['date'], format='%m/%d/%Y')
print(myDF.dtypes)
date          datetime64[ns]
type                  object
attendance             int64

Hold on — what’s with the datetime64[ns] object? In spite of this output, if we test the type of our individual dates, we find that they are indeed Timestamp objects. The implication is that all the Timestamp attributes are usable when working with date.

type(myDF.date[0])
<class 'pandas._libs.tslibs.timestamps.Timestamp'>


format Argument

The date column is now saved with the data type that we want. If you recall, we had an odd-looking format argument. The idea is that we are telling pandas how to parse the date: month first, then day, then year, separated by forward slashes. If the dates had dashes or some other separator, we could indicate this in format to ensure pandas is reading it correctly.

format does pay attention to capitalization. For example, %y means the date excludes century, while %Y includes century (think 13 compared to 2013). There are helpful tables to explain the different formats, and we’ll include the main ones here:

  • Month:

    • %b — abbreviated month, "Oct"; %B — full month name, "October"

    • %m — month number with leading zero, "04"; %-m — month number, no leading zero, "4"

  • Day:

    • %a — abbreviated weekday, "Mon"; %A — full weekday name, "Monday"

    • %d — day number with leading zero, "07"; %-d — day number, no leading zero, "7"

  • Year:

    • %y — last two numbers of year, "14"; %Y — full year, "1914"

%M corresponds to minute, while %m corresponds to month.


Dates/Times Objects, Methods, and Attributes

Converting data using to_datetime is only useful if we can extract further information using Timestamp and other pandas packages.


Time Differences: Timedelta and DateOffset

The Timedelta and DateOffset objects are simple: they are time/date objects different from date times that can be supplemented a difference between dates (4 days from now, 4 months ago, etc) to find the desired future/past date. The key difference is that Timedelta works with absolute time (1 day = 24 hours), while DateOffset works with calendar time, accounting for things such as daylight savings time. DateOffset also lacks an array class corresponding to timedelta64[ns].


Timedelta

For Timedelta, you can use either keyword arguments or a pair including (value, unit) to indicate the duration of time.

import pandas as pd

day_example = pd.Timedelta(days=6)
week_example = pd.Timedelta(weeks=14)
print(day_example)
print(week_example)
6 days 00:00:00
98 days 00:00:00

As we can see here, Timedelta works in days, hours, minutes, and seconds, where the equivalency of 1 week = 7 days applies before output. Since months and years have differing numbers of days, you cannot use them when generating Timedeltas.


DateOffset

Imported from pandas.tseries.offsets, DateOffset is another method allowing for the changing of times. Since this category uses calendar-based logic, we can supplement years, months, or weeks as keywords.

A plural parameter (such as years) and a singular parameter (such as year) mean different things — including a plural will add to the Timestamp, including a singular will replace the Timestamp. We’ll demonstrate the difference in the next example.

print(pd.Timestamp("2019-12-25") + pd.DateOffset(day=1))
print(pd.Timestamp("2019-12-25") + pd.DateOffset(days=1))
2019-12-01 00:00:00
2019-12-26 00:00:00

Knowing the different uses of the two data types is important.

  • DateOffset is useful due to its broader acceptance of time parameters and use of calendar logic. We often want to know the dates of things weeks, months, and years in advance, and it’s inconvenient to translate those to number of days for use with Timedelta. Additionally, 4 months from January 15th is informally understood to be May 15th, and DateOffset understands this where Timedelta does not.

  • Timedelta can be interpreted by DataFrames and Series, while DateOffset cannot and is cast as a simple object. Additionally, any DateOffset time measurements equal to or shorter than an hour function like Timedelta.

It’s important to think of these two objects as cooperating tools to your success rather than mutually exclusive options.


dt operator

You’ll notice that Timestamp attributes include basic elements of date information — month, day, year, second, day_of_week, and so on. Let’s try and creating a month column from date in myDF:

myDF['month'] = myDF['date'].month
AttributeError: 'Series' object has no attribute 'month'

As the error message says, we can’t get month from the date column because it’s a Series, not a Timestamp. Instead of looping through each value or using apply, we have the dt accessor, allowing us to use Timestamp attributes and functions column-wide. Now we can create our month column:

myDF['month'] = myDF['date'].dt.month
print(myDF)
        date          type  attendance  month
0 2021-07-01  Presentation          25      7
1 2021-07-04         Class           0      7
2 2021-08-20  Presentation          50      8
3 2021-09-02         Event          48      9
4 2050-04-01         Class     1000000      4


Examples

Create month and year columns from our date column.

Click to see solution
myDF['month'] = myDF['date'].dt.month
myDF['year'] = myDF['date'].dt.year
print(myDF)
        date          type  attendance  month  year
0 2021-07-01  Presentation          25      7  2021
1 2021-07-04         Class           0      7  2021
2 2021-08-20  Presentation          50      8  2021
3 2021-09-02         Event          48      9  2021
4 2050-04-01         Class     1000000      4  2050

Create the weekday column from date.

Click to see solution
myDF['weekday'] = myDF['date'].dt.day_name()
print(myDF)
        date          type  attendance   weekday
0 2021-07-01  Presentation          25  Thursday
1 2021-07-04         Class           0    Sunday
2 2021-08-20  Presentation          50    Friday
3 2021-09-02         Event          48  Thursday
4 2050-04-01         Class     1000000    Friday

Shift all the days in date forward by one week, replacing the old dates in the process.

Click to see solution
myDF['date'] = myDF['date'] + pd.Timedelta("7 days")
print(myDF)
        date          type  attendance
0 2021-07-08  Presentation          25
1 2021-07-11         Class           0
2 2021-08-27  Presentation          50
3 2021-09-09         Event          48
4 2050-04-08         Class     1000000

Suppose myDF.date contains exclusively days from the first semester of an academic year, and each year ends on May 31st. Create the end_of_school column using the date column and DateOffset, which contains the last day of school for that academic year. Then create days_until_school_is_over, a column that contains the number of days between date and end_of_school.

Click to see solution
one_year_later = myDF['date'] + pd.offsets.DateOffset(years=1)
myDF['end_of_school'] = pd.to_datetime({'month': 5, 'day': 31, 'year':one_year_later.dt.year})
myDF['days_until_school_is_over'] = myDF['end_of_school'] - myDF['date']
print(myDF)
        date          type  attendance end_of_school days_until_school_is_over
0 2021-07-01  Presentation          25    2022-05-31                  334 days
1 2021-07-04         Class           0    2022-05-31                  331 days
2 2021-08-20  Presentation          50    2022-05-31                  284 days
3 2021-09-02         Event          48    2022-05-31                  271 days
4 2050-11-01         Class     1000000    2051-05-31                  211 days


Resources

As always, the information we include here is just a portion of all you can know about using dates and times in pandas. They have a great, extensive user guide that includes DatetimeIndex, indexing using dates, and much more.