Session 3.3: Parsing large files (15:15 - 15:45)

Having fun with IMDb data files

From https://datasets.imdbws.com download title.basics.tsv.gz and title.ratings.tsv.gz (these datasets contain movie titles and movie ratings):

$ wget https://datasets.imdbws.com/title.basics.tsv.gz
$ wget https://datasets.imdbws.com/title.ratings.tsv.gz

If you don't have wget you can try curl instead:

$ curl -O https://datasets.imdbws.com/title.basics.tsv.gz
$ curl -O https://datasets.imdbws.com/title.ratings.tsv.gz

Extract these files:

$ gunzip title.basics.tsv.gz
$ gunzip title.ratings.tsv.gz

Together:

We explore and discuss these files.

Our challenge:

Find all movies which contain the word "python".
Discuss problems of the following solution:

"""
This script will find all movies which
contain the word "python".
"""

titles = []
with open('title.basics.tsv', 'r') as f:
    for line in f.read().splitlines():
        if not 'primaryTitle' in line:
            s = line.split('\t')
            titles.append(line.split('\t')[2])

for title in titles:
    if 'python' in title.lower():
        print(title)

Optional take-home exercises:

Find the 20 movies with highest ratings (use only those with many votes).
Find the 10 most popular comedies.
Write your code so that it can work with datasets of in principle any size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

part-3.md

part-3.md

Session 3.3: Parsing large files (15:15 - 15:45)

Having fun with IMDb data files

Files

part-3.md

Latest commit

History

part-3.md

File metadata and controls

Session 3.3: Parsing large files (15:15 - 15:45)

Having fun with IMDb data files