Skip to main content
JSONL (newline-delimited JSON) stores one JSON object per line. Each line is a self-contained, valid JSON record. This makes it easy to stream large files without loading everything into memory, and to process records incrementally in scripts and pipelines. The JSONL files for this dataset are located at data/chap*_dataset.jsonl, one file per chapter.

Record structure

Each record has three fields:
FieldTypeDescription
contextstringThe scene identifier, prefixed with Scene:
speakerstringThe character delivering this line, or Narrator for non-dialogue text
textstringThe exact line of dialogue or narration

Example records

{"context": "Scene: Device Contact", "speaker": "Narrator", "text": "ARE YOU THERE?"}
{"context": "Scene: Device Contact", "speaker": "Narrator", "text": "ARE WE CONNECTED?"}
{"context": "Scene: Device Contact", "speaker": "Narrator", "text": "EXCELLENT."}
{"context": "Scene: Obj Krisroom", "speaker": "Toriel", "text": "Kris...!"}
{"context": "Scene: Obj Krisroom", "speaker": "Toriel", "text": "Wake up!"}
{"context": "Scene: Obj Classscene", "speaker": "Alphys", "text": "So, does everyone have a..."}
{"context": "Scene: Obj Classscene", "speaker": "Susie", "text": "... am I late?"}
{"context": "Scene: Device Contact", "speaker": "Player", "text": "YES"}
Records within each file are in chronological order as they appear in the game. Do not assume ordering is preserved when combining multiple files unless you sort explicitly.

Loading the data

import pandas as pd

# Load a single chapter
df = pd.read_json("data/chap1_dataset.jsonl", lines=True)
print(df.head())
print(f"Total records: {len(df)}")

Filtering examples

Filter by chapter

To work with a specific chapter, load only that chapter’s file.
import pandas as pd

# Chapter 3 only
df = pd.read_json("data/chap3_dataset.jsonl", lines=True)

Filter by speaker

import pandas as pd

df = pd.read_json("data/chap1_dataset.jsonl", lines=True)

# All lines spoken by Susie
susie_lines = df[df["speaker"] == "Susie"]
print(f"Susie has {len(susie_lines)} lines in Chapter 1")
print(susie_lines[["context", "text"]].head(10))

Filter by scene

import pandas as pd

df = pd.read_json("data/chap1_dataset.jsonl", lines=True)

# All lines in the classroom scene
classroom = df[df["context"] == "Scene: Obj Classscene"]
for _, row in classroom.iterrows():
    print(f"{row['speaker']}: {row['text']}")

Combine all chapters

import pandas as pd
import glob

files = sorted(glob.glob("data/chap*_dataset.jsonl"))
dfs = [pd.read_json(f, lines=True) for f in files]
df_all = pd.concat(dfs, ignore_index=True)

print(f"Total records across all chapters: {len(df_all)}")
print(df_all["speaker"].value_counts().head(10))
If you need cross-chapter analysis regularly, use the parquet/full_chapters_dataset.parquet file instead. It is pre-built from all JSONL files and loads faster for repeated queries.