I have kids. Our family has had a family Netflix account since 2017. My daughter decided to jump into 18 Seasons of Grey’s Anatomy last year, and has been binging it ever since.
So when I put together a post last week talking about using SQL to analyze Spotify streaming data, my brain quietly whispered to itself, I bet Netflix data would be more fun.
So let’s do it. Let’s move this data into a database, and use SQL to figure out how much time she’s invested in Dr. Meredith Grey, and her friends.
Step 0: Download your Netflix data.
You request it from your online account. Do that in your browser, not on your Netflix app.
You’ll have to confirm the request. They say it will take up to 30 days. It took only 1 day for me.
Then you’ll get an email saying it’s ready. You’ll following the link, provide your password (again), and download a Zip file.
We’re going to import that ‘ViewingActivity.csv’ file as a new table in an Oracle Database, so you’ll need one of those. I’ve talked about that before, a lot.
Step 1: Import your data
I’m going to use my Always Free Oracle Autonomous Database, and the Database Actions (SQL Developer Web!) interface in my browser to do it.
This will open a wizard, and on the 2nd page, you can define the columns. I’ve left the data type defaults, but have changed the precision of the text fields.
Note that the wizard has recognized that ‘START_TIME’ is a date, and it’s recommending we bring it in as a TIMESTAMP, and has even recognized the proper format mask to read those values in.
We can click to the end of the wizard, and hit ‘Finish.’
When it’s done processing my 2.5MB of data, I can see it’s imported more than 14,000 rows. Is that sad? I don’t know, the four of us watch a TON of online content.
I simply added this data as a table called ‘NETFLIX’ –
Step 2: Start doing SQL
Again, kudos to Netflix for including a ‘data dictionary’ in their personal data package. It’s quite nice –
- Profile – my daughter
- Duration – how long did she spend watching it
- Title – anything Grey’s Anatomy
- Supplemental Video Type – just episodes, not teasers, trailers, bonus content, etc
Starting Simpler, what have I seen recently?
I know MY data, so let’s just make sure this CSV dump is accurate. So I’m going to ask what I’ve watched recently.
SELECT START_TIME, DURATION, TITLE FROM NETFLIX WHERE PROFILE_NAME = 'Jefferson' AND SUPPLEMENTAL_VIDEO_TYPE IS NULL ORDER BY START_TIME DESC
DURATION was brought in as a VARCHAR2, or a string. So doing MATHs on this column later will get a bit tricksy. However, the results seem ‘right’ to me. I gave up on re-watching the original Fletch about 30 minutes into it. I just wasn’t feeling it. So yeah, it looks good.
Let’s find my daughter’s ‘raw’ data for Grey’s
SELECT START_TIME, DURATION, TITLE FROM NETFLIX WHERE PROFILE_NAME = 'Daughter' AND SUPPLEMENTAL_VIDEO_TYPE IS NULL AND TITLE LIKE 'Grey''s%';
The only tricky thing here is the TITLE predicate clause. Grey’s Anatomy has an apostrophe, and that’s also the character we use to enclose strings in Oracle SQL. So the ‘trick’ is you have to escape it. One way is to just use an extra quote.
Tip: Perhaps better way to deal with this is to use our built-in function q[].
Docs Link.
LiveLabs Tutorial.
SELECT START_TIME, DURATION, TITLE FROM NETFLIX WHERE PROFILE_NAME = 'Daughter' AND SUPPLEMENTAL_VIDEO_TYPE IS NULL AND TITLE LIKE q'[Grey's Anatomy%]');
OK, now how do I sum up the duration?
That string is representing what Oracle basically considers an INTERVAL YEAR TO SECOND piece of data. Except in this case it’s only HOURS:MINUTES:SECONDS.
So what I’m going to do is create a VIRTUAL COLUMN that uses this data to create an actual INTERVAL interpretation of this data.
We’re not physically storing new information here. VIRTUAL COLUMNs in an Oracle table allow for that data to be derived from other values in the row.
In this case we have
DURATION 0:13:3
and we want
DURATION 00 0:13:3
ALTER TABLE NETFLIX ADD ( DURATION_INTERVAL INTERVAL DAY TO SECOND generated always AS ( TO_DSINTERVAL('00 ' || duration) ) virtual );
The TYPE of the new VIRTUAL COLUMN is ‘INTERVAL DAY TO SECOND.’ And the way it’s being derived is taking the value of the DURATION column and prefixing it with ’00 ‘
Then with that string being ‘computed’, I’m then sending that as an input to the TO_DSINTERVAL function, which returns an INTERVAL value.
Ok, so now I can just do a SUM on DURATION_INTERVAL, right?
Well, not quite. Unfortunately the database doesn’t give us a SUM() function for the INTERVAL type. So we need to do something else.
From the INTERVAL value, we’re going to extract the total number of minutes. And from that value, we’re going to total that up, and divide by ’60’ to get the total number of hours.
I asked my wife, does that sound about right? And yes, about 8 straight days of binging over the last 18 months sounded about right.
Ok, let’s look at that code, esp the the SELECT
-- Hours spent watching Grey's Anatomy SELECT SUM(EXTRACT(MINUTE FROM duration_interval))/60 time_wasted --title, duration_interval FROM NETFLIX WHERE PROFILE_NAME = 'Daughter' AND SUPPLEMENTAL_VIDEO_TYPE IS NULL AND TITLE LIKE 'Grey''s%';
It’s always easier to break the nested function calls into the various pieces, so let’s do that.
- extract(minute from duration_interval) – pull out the minutes portion of time
- sum sum(extract…) – add this number up
- /60 – take the sum and divide to get ‘hours’
Wait a second, those pesky seconds really add up, right?
Now, simply writing this post and the description has forced me into a different level of thinking. What about the SECONDS? Over 666 viewings, surely the fractional number of minutes will be significant?
Let’s find out.
So if I ignored the seconds, I’d be off ~2.5% in my answer. Is that significant to worry about? Probably not. But you need to KNOW what you’re asking, when you’re writing the SQL. Assumptions DIG HUGE HOLES that you might have to crawl out of later.
So to answer my original question, my daughter has spent ~ 205 hours watching Grey’s Anatomy since May 22nd, 2022.
Kudos to my fellow product managers here on the Database team at Oracle. They helped me a bit with this SQL, especially Michelle with the Extract function().
Connor reminds me that this gets much easier if you take advantage of SQL Macros (introduced in 21c) and even EASIER EASIER once 23c is released.
Disclaimer: I’ve probably spent at least 50 hours watching this show with her. Enough to know the writing got EXTREMELY lazy when it came to killing off the main cast of characters.
Disclaimer 2: if you’re going to leave a nasty comment about my kid’s taste in pop culture entertainment or my parenting skills, just don’t. Feel free to bash my data modeling and SQL skills, as always.
DDL for my TABLE
CREATE TABLE NETFLIX ( PROFILE_NAME VARCHAR2(10), START_TIME TIMESTAMP (6), DURATION VARCHAR2(20), ATTRIBUTES VARCHAR2(256), TITLE VARCHAR2(256), SUPPLEMENTAL_VIDEO_TYPE VARCHAR2(256), DEVICE_TYPE VARCHAR2(256), BOOKMARK VARCHAR2(20), LATEST_BOOKMARK VARCHAR2(20), COUNTRY VARCHAR2(256), DURATION_INTERVAL INTERVAL DAY (2) TO SECOND (6) GENERATED ALWAYS AS (TO_DSINTERVAL('00 '||"DURATION")) VIRTUAL ) ;
Working with Oracle SQL and Temporal Data, The Movie
Watch this, it’s great. Chris is our SQL evangelist, and he’s been working with Connor to answer these types of questions for years.
4 Comments
Nice 🙂
Would suggest a warning that this works because each episode apparently is less than an hour?
“extract(minute from…” results in 30 minutes if the interval is an hour-and-a-half, *not* 90 minutes. From your text “From the INTERVAL value, we’re going to extract the total number of minutes” a reader might think that extract function gives you *total* minutes as in “hours*60 + minutes”, which it does not.
You can try it like this:
with d (duration_interval) as (
select interval ’00:30:00′ hour to second from dual union all
select interval ’01:30:00′ hour to second from dual union all
select interval ’02:30:00′ hour to second from dual
)
select
extract(hour from duration_interval) + extract(minute from duration_interval)/60 as hours
from d;
Cheerio
/Kim
Yeah that was in my head as I wrote it and didn’t put that in writing – since it was based on a TV show that never exceeded 40-45 minutes, I could assume the window was always an hour or less.
What I love about blogs is that smarter people than me can come in and add their insight and experience, not to mention, probably better code!
Hi Jeff – what a very excellent post, thank you! I didn’t know it was possible to get access to Netflix data in this way. Thank you very much. Also – my daughter has watched a lot of Grey’s Anatomy, and I have also watched a fair amount of it with her. Your instructions are very clear to follow, and I have enjoyed reading your blog for a number of years, just have never posted before. Thanks again.
Jim, you just made my week.
Cheers!
PS Hand me the 10 blade.