Skip to main content
All transcript records share three fields: context (scene identifier), speaker, and text. The examples below use the combined Parquet file so queries span the full dataset, but the same syntax applies to any per-chapter file.
import pandas as pd

df = pd.read_parquet("parquet/full_chapters_dataset.parquet")

Filter by speaker

# All Susie lines
susie = df[df["speaker"] == "Susie"]

# All narration
narrator = df[df["speaker"] == "Narrator"]

# All player choices
player = df[df["speaker"] == "Player"]

Filter by scene

The context field stores the full scene label, including the "Scene: " prefix.
# All lines from a single scene
classscene = df[df["context"] == "Scene: Obj Classscene"]

# Partial match — all scenes whose name contains a substring
school_scenes = df[df["context"].str.contains("School", case=False)]

Filter by chapter

Chapter-scoped analysis is easiest by loading the per-chapter file directly.
import pandas as pd

chap2 = pd.read_parquet("parquet/chap2_dataset.parquet")
To filter the combined file by chapter, add a chapter column after loading each file individually and then concatenate:
import pandas as pd

chapters = {}
for i in range(1, 5):
    path = f"parquet/chap{i}_dataset.parquet"
    chapters[i] = pd.read_parquet(path).assign(chapter=i)

df = pd.concat(chapters.values(), ignore_index=True)

# Now filter by chapter number
chap3_df = df[df["chapter"] == 3]

Aggregations

Count lines per speaker

lines_per_speaker = (
    df.groupby("speaker")["text"]
    .count()
    .sort_values(ascending=False)
    .rename("line_count")
)
print(lines_per_speaker.head(10))

List all unique scenes

scenes = df["context"].unique()
print(f"{len(scenes)} unique scenes")
for scene in sorted(scenes):
    print(scene)

List all speakers in a chapter

chap1 = pd.read_parquet("parquet/chap1_dataset.parquet")
print(chap1["speaker"].unique())

Count scenes per chapter

scenes_per_chapter = (
    df.groupby("chapter")["context"]
    .nunique()
    .rename("unique_scenes")
)
print(scenes_per_chapter)

SQL-style alternatives

import duckdb

# Register the Parquet file directly — no pandas load needed
con = duckdb.connect()

susie_lines = con.execute("""
    SELECT context, text
    FROM 'parquet/full_chapters_dataset.parquet'
    WHERE speaker = 'Susie'
    ORDER BY rowid
""").df()

lines_per_speaker = con.execute("""
    SELECT speaker, COUNT(*) AS line_count
    FROM 'parquet/full_chapters_dataset.parquet'
    GROUP BY speaker
    ORDER BY line_count DESC
""").df()
DuckDB can query Parquet files on disk without loading them into memory first, which is useful when working with large combined files.