Home Data Engineering Data DIY Analyzing Your Personal Netflix Data

# Analyzing Your Personal Netflix Data

 Audio version of the article

How much time have I spent watching The Office?

That’s a question that has run through my head repeatedly over the years. The beloved sitcom has been my top “comfort show/background noise” choice for a long time.

It used to be a question I couldn’t answer, because the data Netflix allowed users to download about their activity was extremely limited.

Now, though, Netflix allows you to download a veritable treasure-trove of data about your account. With a just a little Python and pandas programming, we can now get a concrete answer to the question:Â how much time have I spent watching The Office?

Want to find out how much time you have spent watchingÂ The Office, or any other show on Netflix?

In this tutorial, we’ll walk you through exactly how to do it step by step!

But first, let’s answer a quick question . . .

## Can’t I Just Use Excel? Why Do I Need to Write Code?

Depending on how much Netflix you watch and how long you’ve had the service, youÂ mightÂ be able to use Excel or some other spreadsheet software to analyze your data.

But there’s a good chance that will be tough.

The dataset you’ll get from Netflix includesÂ everyÂ time a video ofÂ anyÂ length played â€” that includes those trailers that auto-play as you’re browsing your list.

So, if you use Netflix often or have had the streaming service for a long time, the file you’re working with is likely to be pretty big. My own viewing activity data, for example, was over 27,000 rows long.

OpeningÂ a file that big in Excel is no problem. But to do our analysis, we’ll need to do a bunch of filtering and performing calculations. With that much data, Excel can get seriously bogged-down, especially if your computer isn’t particularly powerful.

Scrolling through such a huge dataset trying to find specific cells and formulas can also become confusing fast.

Python can handle large datasets and calculations like this much more smoothly because it doesn’t have to render everything visually. And since we can do everything with just a few lines of code, it’ll be really easy to see everything we’re doing, without having to scroll through a big spreadsheet looking for cells with formulas.

For the purposes of this tutorial, I’ll be using my own Netflix data. To grab your own, make sure you’re logged in to Netflix and thenÂ visit this page. From the main Netflix screen, you can also find this page by clicking your account icon in the top right, clicking “Account”, and then clicking “Download your personal information” on the page that loads.

On the next page, you should see this:

Click “Submit a Request.” Netflix will send you a confirmation email, which you’ll need to click.

Then, unfortunately, you’ll have to wait. Netflix says preparing your data report can take up to 30 days. I once got one report within 24 hours, but another one took several weeks. Consider bookmarking this page so that you can come back once you’ve got your data.

If you’d like, I’ve also madeÂ a small sample from my own data available for download here. If you’d like, you can download that file and use it work through this project. Then, when your own data becomes available, simply substitute your file for the same, run your code again, and you’ll get your answers almost instantly!

Netflix will email you when your report is available to download. When it is, act fast because the download will “expire” and disappear again after a couple of weeks!

The download will arrive as a .zip file that contains roughly a dozen folders, most of which contain data tables in .csv format. There are also two PDFs with additional information about the data.

## Step 2: Familiarize Yourself with the Data

This is a critical step in the data analysis process. The better we understand our data, the better our chances are of producing meaningful analysis.

Let’s take a look at what we’ve got. Here’s what we’ll see when we unzip the file:

Our goal here is to figure out how much time I’ve spent watching Netflix.Â `Content Interaction`Â seems like the most likely folder to contain that data. If we open it, we’ll find a file calledÂ `ViewingActivity.csv`Â that looks exactly like what we want â€” a log of everything we’ve viewed over the history of the account.

Looking at the data, we can quickly spot one potential challenge. There’s a single column,Â `Title`, that contains both showÂ andÂ episode titles, so we’ll need to do a little extra work to filter for only episodes ofÂ The Office.

At this point, it would be tempting to dive right into the analysis using that data, but let’s make sure we understand it first! In the downloaded zip file, there’s a file calledÂ `Cover sheet.pdf`Â that contains data dictionaries for all of the .csv files, includingÂ `ViewingActivity.csv`.

This data dictionary can help us answer questions and avoid errors. For example, consulting the dictionary forÂ `ViewingActivity.csv`, we can see that the columnÂ `Start Time`Â uses theÂ UTC timezone. If we want to analyze which times of day we most often watch Netflix, for example, we’ll need to convert this column to our local timezone.

Take some time to look over the data inÂ `ViewingActivity.csv`Â and the data dictionary inÂ `Cover sheet.pdf`Â before moving on to the next step!

For this tutorial, we’ll be analyzing our data using Python and pandas in a Jupyter notebook. If you don’t already have that set up, you can find a quick, beginner-friendly guide at the beginning ofÂ this tutorial, or check outÂ a more in depth Jupyter Notebook for Beginners post.

Once we’ve got a notebook open, we’ll import the pandas library and read our Netflix data CSV into a pandas dataframe we’ll callÂ `df`:

``````import pandas as pd

Now, let’s do a quick preview of the data to make sure everything looks correct. We’ll start withÂ `df.shape`, which will tell us the number of rows and columns in the dataframe we’ve just created.

``df.shape``
``(27354, 10)``

That result means we have 27,353 rows and 10 columns. Now, let’s see what it looks like by previewing the first few rows of data usingÂ `df.head()`.

To maintain some privacy, I’ll be adding the additional argumentÂ `1`Â inside theÂ `.head()`Â parentheses so that only a single row prints in this blog post. In your own analysis, however, you can use the defaultÂ `.head()`Â to print the first five rows.

``df.head(1)``
Profile Name Start Time Duration Attributes Title Supplemental Video Type Device Type Bookmark Latest Bookmark Country
0 Charlie 2020-10-29 3:27:48 0:00:02 NaN The Office (U.S.): Season 7: Ultimatum (Episod… NaN Sony PS4 0:00:02 0:00:02 US (United States)

Perfect!

## Step 4: Preparing the Data for Analysis

Before we can do our number-crunching, let’s clean up this data a bit to make it easier to work with.

### Dropping Unnecessary Columns (Optional)

First, we’ll start by dropping the columns we’re not planning to use. This is totally optional, and it’s probably not a good idea for large-scale or ongoing projects. But for a small-scale personal project like this, it can be nice to work with a dataframe that includesÂ onlyÂ columns we’re actually using.

In this case, we’re planning to analyzeÂ how muchÂ andÂ whenÂ I’ve watchedÂ The Office, so we’ll need to keep theÂ `Start Time`,Â `Duration`, andÂ `Title`Â columns. Everything else can go.

To do this, we’ll useÂ `df.drop()`Â and pass it two arguments:

1. A list of the columns we’d like to drop
2. `axis=1`, which tells pandas to drop columns

Here’s what it looks like:

``````df = df.drop(['Profile Name', 'Attributes', 'Supplemental Video Type', 'Device Type', 'Bookmark', 'Latest Bookmark', 'Country'], axis=1)
Start Time Duration Title
0 2020-10-29 3:27:48 0:00:02 The Office (U.S.): Season 7: Ultimatum (Episod…

Great! Next, let’s work with the time data.

### Converting Strings to Datetime and Timedelta in Pandas

The data in our two time-related columns certainly looks correct, but what format is this data actually being stored in? We can useÂ `df.dtypes`Â to get a quick list of the data types for each column in our dataframe:

``df.dtypes``
``````Start Time    object
Duration      object
Title         object
dtype: object``````

As we can see here, all three columns are stored asÂ `object`, which means they’re strings. That’s fine for theÂ `Title`Â column, but we need to change the two time-related columns into the correct datatypes before we can work with them.

Specifically, we need to do the following:

• ConvertÂ `Start Time`Â to datetime (a data and time format pandas can understand and perform calculations with)
• ConvertÂ `Start Time`Â from UTC to our local timezone
• ConvertÂ `Duration`Â to timedelta (a time duration format pandas can understand and perform calculations with)

So, let’s approach those tasks in that order, starting with convertingÂ `Start Time`Â to datetime using pandas’sÂ `pd.to_datetime()`.

We’ll also add the optional argumentÂ `utc=True`Â so that our datetime data has the UTC timezone attached to it. This is important, since we’ll need to convert it to a different timezone in the next step.

We’ll then runÂ `df.dtypes`Â again just to confirm that this has worked as expected.

``````df['Start Time'] = pd.to_datetime(df['Start Time'], utc=True)
df.dtypes``````
``````Start Time    datetime64[ns, UTC]
Duration                   object
Title                      object
dtype: object``````

Now we’ve got that column in the correct format, it’s time to change the timezone so that when we do our analysis, we’ll see everything in local time.

We can convert datetimes to any timezone using theÂ `.tz_convert()`Â and passing it an argument with the string for the timezone we want to convert to. In this case, that’sÂ `'US/Eastern'`. To find your specific timezone,Â here’s a handy reference of TZ timezone options.

The tricky bit here is that we can only useÂ `.tz_convert()`Â on a DatetimeIndex, so we need to set ourÂ `Start Time`Â column as the index usingÂ `set_index()`Â before we perform the conversion.

In this tutorial, we’ll then useÂ `reset_index()`Â to turn it back into a regular column afterwards. Depending on your preference and goals, this may not be necessary, but for the purposes of simplicity here, we’ll try to do our analysis with all of our data in columns rather than having some of it as the index.

Putting all of that together looks like this:

``````# change the Start Time column into the dataframe's index
df = df.set_index('Start Time')

# convert from UTC timezone to eastern time
df.index = df.index.tz_convert('US/Eastern')

# reset the index so that Start Time becomes a column again
df = df.reset_index()

#double-check that it worked
Start Time Duration Title
0 2020-10-28 23:27:48-04:00 0:00:02 The Office (U.S.): Season 7: Ultimatum (Episod…

We can see this is correct because the previous first row in our dataset had aÂ `Start Time`Â ofÂ `2020-10-29 03:27:48`. During Daylight Savings Time, the U.S. Eastern time zone is four hours behind UTC, so we can see that our conversion has happened correctly!

Now, let’s deal with ourÂ `Duration`Â column. This is, as the name suggests, a duration â€” a measure of aÂ lengthÂ of time. So, rather than converting it to a datetime, we need to convert it to a timedelta, which is a measure of time duration that pandas understands.

This is very similar to what we did when converting theÂ `Start Time`Â column. We’ll just need to useÂ `pd.to_timedelta()`Â and pass it the column we want to convert as an argument.

Once again, we’ll useÂ `df.dtypes`Â to quickly check our work.

``````df['Duration'] = pd.to_timedelta(df['Duration'])
df.dtypes``````
``````Start Time    datetime64[ns, US/Eastern]
Duration                 timedelta64[ns]
Title                             object
dtype: object``````

Perfect! But we’ve got one more data preparation task to handle: filtering thatÂ `Title`Â column so that we can analyze only views ofÂ The Office.

### Filtering Strings by Substring in pandas Using str.contains

There are many ways we could approach filteringÂ The OfficeÂ views. For our purposes here, though, we’re going to create a new dataframe calledÂ `office`Â and populate it only with rows where theÂ `Title`Â column containsÂ `'The Office (U.S.)'`.

We can do this usingÂ `str.contains()`, giving it two arguments:

• `'The Office (U.S.)'`, which is the substring we’re using to pick out only episodes ofÂ The Office.
• `regex=False`, which tells the function that the previous argument is a string and not a regular expression.

Here’s what it looks like in practice:

``````# create a new dataframe called office that that takes from df
# only the rows in which the Title column contains 'The Office (U.S.)'
office = df[df['Title'].str.contains('The Office (U.S.)', regex=False)]``````

Once we’ve done this, there are a few ways we could double-check our work. For example, we could useÂ `office.sample(20)`Â to inspect a random ten rows of our newÂ `office`Â dataframe. If all twenty rows contained Office episodes, we could be pretty confident things worked as expected.

For the purposes of preserving a little privacy in this tutorial, though, I’ll runÂ `office.shape`Â to check the size of the new dataframe. Since this dataframe should containÂ onlyÂ my views ofÂ The Office, we should expect it to have significantly fewer rows than the 27,000+ rowÂ `df`Â dataset.

``office.shape``
``(5479, 3)``

### Filtering Out Short Durations Using Timedelta

Before we really dig in and analyze, we should probably take one final step. We noticed in our data exploration that when something like an episode preview auto-plays on the homepage, it counts as a view in our data.

However, watching two seconds of a trailer as you scroll past isn’t the same as actually watching an episode! So let’s filter ourÂ `office`Â dataframe down a little bit further by limiting it to only rows where theÂ `Duration`Â value is greater than one minute. This should effectively count the watchtime for partially watched episodes, while filtering out those short, unavoidable “preview” views.

Again,Â `office.head()`Â orÂ `office.sample()`Â would be good ways to check our work here, but to maintain some semblance of privacy, I’ll again useÂ `df.shape`Â just to confirm that some rows were removed from the dataframe.

``````office = office[(office['Duration'] > '0 days 00:01:00')]
office.shape``````
``(5005, 3)``

That looks good, so let’s move on to the fun stuff!

## Analyzing the Data

### How much time have I spent watchingÂ The Office?

First, let’s answer the big question: How much time have I spent watchingÂ The Office?

Since we’ve already got ourÂ `Duration`Â column in a format that pandas can compute, answering this question is quite straightforward. We can useÂ `.sum()`Â to add up the total duration:

``office['Duration'].sum()``
``Timedelta('58 days 14:03:33')``

So, I’ve spent a total of 58 days, 14 hours, 3 minutes and 33 seconds watchingÂ The OfficeÂ on Netflix. That is . . . a lot.

In my defense, that’s over the course of a decade, and a good percentage of that time wasn’t spent actively watching! When I’m doing brain-off work, working out, playing old video games, etc., I’ll often turnÂ The OfficeÂ on as a kind of background noise that I can zone in and out of. I also used to use it as a kind of white noise while falling asleep.

But we’re not here to make excuses for my terrible lifestyle choices! Now that we’ve answered the big question, let’s dig a little deeper into myÂ The Office-viewing habits:

### When do I watchÂ The Office?

Let’s answer this question in two different ways:

• On which days of the week have I watched the mostÂ OfficeÂ episodes?
• During which hours of the day do I most often startÂ OfficeÂ episodes?

We’ll start with a little prep work that’ll make these tasks a little more straightforward: creating new columns for “weekday” and “hour”.

We can use theÂ `.dt.weekday`Â andÂ `.dt.hour`Â methods on theÂ `Start Time`Â column to do this and assign the results to new columns namedÂ `weekday`Â andÂ `hour`:

``````office['weekday'] = office['Start Time'].dt.weekday
office['hour'] = office['Start Time'].dt.hour

# check to make sure the columns were added correctly
Start Time Duration Title weekday hour
1 2020-10-28 23:09:43-04:00 0 days 00:18:04 The Office (U.S.): Season 7: Classy Christmas:… 2 23

Now, let’s do a little analysis! These results will be easier to understand visually, so we’ll start by using theÂ `%matplotlib inline`Â magic to make our charts show up in our Jupyter notebook. Then, we’ll importÂ `matplotlib`.

``````%matplotlib inline
import matplotlib``````

Now, let’s plot a chart of my viewing habits by day of the week. To do this, we’ll need to work through a few steps:

• Tell pandas the order we want to chart the days in usingÂ `pd.Categorical`Â â€” by default, it will plot them in descending order based on the number of episodes watched on each day, but when looking at a graph, it’ll be more intuitive to see the data in Monday-Sunday order.
• Count the number of episodes I viewed on each day in total
• Sort and plot the data

(There are also many other ways we could approach analyzing and visualizing this data, of course.)

Let’s see how it looks step by step:

``````# set our categorical and define the order so the days are plotted Monday-Sunday
office['weekday'] = pd.Categorical(office['weekday'], categories=
[0,1,2,3,4,5,6],
ordered=True)

# create office_by_day and count the rows for each weekday, assigning the result to that variable
office_by_day = office['weekday'].value_counts()

# sort the index using our categorical, so that Monday (0) is first, Tuesday (1) is second, etc.
office_by_day = office_by_day.sort_index()

# optional: update the font size to make it a bit larger and easier to read
matplotlib.rcParams.update({'font.size': 22})

# plot office_by_day as a bar chart with the listed size and title
````office_by_day.plot(kind='bar', figsize=(20,10), title='Office Episodes Watched by Day')````

As we can see, I’ve actually tended to watch The Office more during the week than on weekends. This makes sense based on my habits, since it’s often background noise during evening work, workouts, etc.

Now, let’s take a look at the same data by hour. The process here is very similar to what we just did above:

``````# set our categorical and define the order so the hours are plotted 0-23
office['hour'] = pd.Categorical(office['hour'], categories=
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23],
ordered=True)

# create office_by_hour and count the rows for each hour, assigning the result to that variable
office_by_hour = office['hour'].value_counts()

# sort the index using our categorical, so that midnight (0) is first, 1 a.m. (1) is second, etc.
office_by_hour = office_by_hour.sort_index()

# plot office_by_hour as a bar chart with the listed size and title
````office_by_hour.plot(kind='bar', figsize=(20,10), title='Office Episodes Watched by Hour')````

From the data, it looks like 12 a.m. and 1 a.m. were the hours during which I most often started episodes ofÂ The Office. This is due to my (unhealthy) habit of using the show as white noise while going to sleep â€” many of these episodes probably auto-played while I was already asleep!

Outside of that, it’s no surprise to see that most of my viewing happened during the evenings.

(Note: This data actually may not reflect my real habits very well, because I lived in China for a significant portion of my Netflix account ownership. We didn’t account for that in this tutorial because it’s a unique situation that won’t apply for most people. If you’ve spent significant time in different timezones during your Netflix usage, then you may need to do some additional date filtering and timezone conversion in the data cleaning stage before analysis.)