Parquet format

Apache Parquet is a columnar binary storage format optimized for analytical workloads. Compared to JSONL, Parquet files load faster, use less disk space due to built-in compression, and allow engines to skip reading columns or row groups that are not needed for a query. Parquet files for this dataset are in the parquet/ directory.

Available files

File	Contents
`parquet/chap1_dataset.parquet`	Chapter 1 only
`parquet/chap2_dataset.parquet`	Chapter 2 only
`parquet/chap3_dataset.parquet`	Chapter 3 only
`parquet/chap4_dataset.parquet`	Chapter 4 only
`parquet/full_chapters_dataset.parquet`	All chapters combined

Use full_chapters_dataset.parquet for any cross-chapter analysis. It is generated from all four per-chapter JSONL files and is the most convenient starting point for most queries.

Schema

All Parquet files share the same three-column schema:

Column	Type	Description
`context`	string	Scene identifier, prefixed with `Scene:`
`speaker`	string	Character name or `Narrator`
`text`	string	Exact line of dialogue or narration

Loading with pandas

import pandas as pd

# Load a single chapter
df = pd.read_parquet("parquet/chap2_dataset.parquet")
print(df.head())
print(f"Records: {len(df)}")

# Load the combined file
df_all = pd.read_parquet("parquet/full_chapters_dataset.parquet")
print(f"Total records: {len(df_all)}")

Loading with pyarrow

import pyarrow.parquet as pq

# Read as an Arrow table (efficient for large files)
table = pq.read_table("parquet/full_chapters_dataset.parquet")
print(table.schema)
print(table.num_rows)

# Convert to pandas for filtering
df = table.to_pandas()

Query examples

Filter by speaker

import pandas as pd

df = pd.read_parquet("parquet/full_chapters_dataset.parquet")

# All Ralsei lines across all chapters
ralsei = df[df["speaker"] == "Ralsei"]
print(f"Ralsei has {len(ralsei)} lines")
print(ralsei["text"].head(5).tolist())

Filter by scene

import pandas as pd

df = pd.read_parquet("parquet/full_chapters_dataset.parquet")

# All lines in a specific scene
scene = df[df["context"] == "Scene: Obj Krisroom"]
for _, row in scene.iterrows():
    print(f"{row['speaker']}: {row['text']}")

Count records per chapter (using per-chapter files)

import pandas as pd
import glob

for path in sorted(glob.glob("parquet/chap*_dataset.parquet")):
    df = pd.read_parquet(path)
    chapter = path.split("/")[-1].replace("_dataset.parquet", "")
    print(f"{chapter}: {len(df)} records")

Top speakers across all chapters

import pandas as pd

df = pd.read_parquet("parquet/full_chapters_dataset.parquet")
counts = df["speaker"].value_counts()
print(counts.head(10))

Regenerating parquet files

The parquet.py script in the repository root regenerates Parquet files from the source JSONL files.

parquet.py

import pandas as pd
import glob
import os

def make_parquets():
    jsonl_files = glob.glob('chap*_dataset.jsonl')

    if not jsonl_files:
        print("[-] Files chap*_dataset.jsonl Not found in current dir")
        return

    all_dataframes = []
    required_columns = ['context', 'speaker', 'text']

    for file_name in jsonl_files:
        try:
            df = pd.read_json(file_name, lines=True)
            df = df[required_columns]
            output_name = file_name.replace('.jsonl', '.parquet')
            df.to_parquet(output_name, index=False, engine='pyarrow')
            print(f"[+] Created: {output_name} ({len(df)} strings)")
            all_dataframes.append(df)
        except KeyError:
            print(f"[!] Error in {file_name}: missing required columns {required_columns}")
        except Exception as e:
            print(f"[!] Unable to process {file_name}: {e}")

    if all_dataframes:
        full_df = pd.concat(all_dataframes, ignore_index=True)
        full_df.to_parquet('full_chapters_dataset.parquet', index=False, engine='pyarrow')
        print(f"\n[OK] Main file ready: full_chapters_dataset.parquet ({len(full_df)} rows)")

if __name__ == "__main__":
    make_parquets()

The script uses glob.glob('chap*_dataset.jsonl') to find input files and writes output Parquet files to the current working directory. Run it from the directory containing the JSONL files (data/). Output Parquet files will appear alongside the JSONL files, not in the parquet/ directory.

Get Started

Data Model

Dataset Files

Usage Guide

Coverage & Gaps

Reference

Available files

Schema

Loading with pandas

Loading with pyarrow

Query examples

Filter by speaker

Filter by scene

Count records per chapter (using per-chapter files)

Top speakers across all chapters

Regenerating parquet files

Get Started

Data Model

Dataset Files

Usage Guide

Coverage & Gaps

Reference

​Available files

​Schema

​Loading with pandas

​Loading with pyarrow

​Query examples

​Filter by speaker

​Filter by scene

​Count records per chapter (using per-chapter files)

​Top speakers across all chapters

​Regenerating parquet files

Available files

Schema

Loading with pandas

Loading with pyarrow

Query examples

Filter by speaker

Filter by scene

Count records per chapter (using per-chapter files)

Top speakers across all chapters

Regenerating parquet files