All transcript records share three fields: context (scene identifier), speaker, and text. The examples below use the combined Parquet file so queries span the full dataset, but the same syntax applies to any per-chapter file.
import pandas as pd
df = pd.read_parquet("parquet/full_chapters_dataset.parquet")
Filter by speaker
# All Susie lines
susie = df[df["speaker"] == "Susie"]
# All narration
narrator = df[df["speaker"] == "Narrator"]
# All player choices
player = df[df["speaker"] == "Player"]
Filter by scene
The context field stores the full scene label, including the "Scene: " prefix.
# All lines from a single scene
classscene = df[df["context"] == "Scene: Obj Classscene"]
# Partial match — all scenes whose name contains a substring
school_scenes = df[df["context"].str.contains("School", case=False)]
Filter by chapter
Chapter-scoped analysis is easiest by loading the per-chapter file directly.
import pandas as pd
chap2 = pd.read_parquet("parquet/chap2_dataset.parquet")
To filter the combined file by chapter, add a chapter column after loading each file individually and then concatenate:
import pandas as pd
chapters = {}
for i in range(1, 5):
path = f"parquet/chap{i}_dataset.parquet"
chapters[i] = pd.read_parquet(path).assign(chapter=i)
df = pd.concat(chapters.values(), ignore_index=True)
# Now filter by chapter number
chap3_df = df[df["chapter"] == 3]
Aggregations
Count lines per speaker
lines_per_speaker = (
df.groupby("speaker")["text"]
.count()
.sort_values(ascending=False)
.rename("line_count")
)
print(lines_per_speaker.head(10))
List all unique scenes
scenes = df["context"].unique()
print(f"{len(scenes)} unique scenes")
for scene in sorted(scenes):
print(scene)
List all speakers in a chapter
chap1 = pd.read_parquet("parquet/chap1_dataset.parquet")
print(chap1["speaker"].unique())
Count scenes per chapter
scenes_per_chapter = (
df.groupby("chapter")["context"]
.nunique()
.rename("unique_scenes")
)
print(scenes_per_chapter)
SQL-style alternatives
import duckdb
# Register the Parquet file directly — no pandas load needed
con = duckdb.connect()
susie_lines = con.execute("""
SELECT context, text
FROM 'parquet/full_chapters_dataset.parquet'
WHERE speaker = 'Susie'
ORDER BY rowid
""").df()
lines_per_speaker = con.execute("""
SELECT speaker, COUNT(*) AS line_count
FROM 'parquet/full_chapters_dataset.parquet'
GROUP BY speaker
ORDER BY line_count DESC
""").df()
DuckDB can query Parquet files on disk without loading them into memory first, which is useful when working with large combined files.