Skip to main content
1

Clone or download the repository

Clone the repository from its source, or download and extract the archive.
git clone https://github.com/ntvm/Deltarune-Complete-Transcript-Cleaned
cd Deltarune-Complete-Transcript-Cleaned
The repository contains two main data directories:
data/
  chap1_dataset.jsonl
  chap2_dataset.jsonl
  chap3_dataset.jsonl
  chap4_dataset.jsonl
  chatml/
    deltarune_story_chatml.jsonl
parquet/
  chap1_dataset.parquet
  chap2_dataset.parquet
  chap3_dataset.parquet
  chap4_dataset.parquet
  full_chapters_dataset.parquet
2

Install dependencies

The dataset examples use pandas and pyarrow. Install them with your package manager of choice.
pip install pandas pyarrow
3

Load a chapter from JSONL

Each chapter is available as a JSONL file with one record per line. Use pandas.read_json with lines=True to load it into a DataFrame.
load_jsonl.py
import pandas as pd

# Load a single chapter
df = pd.read_json('data/chap1_dataset.jsonl', lines=True)

print(df.shape)        # (1068, 3)
print(df.columns.tolist())  # ['context', 'speaker', 'text']
print(df.head())
Example output:
                 context   speaker          text
0  Scene: Device Contact  Narrator  ARE YOU THERE?
1  Scene: Device Contact  Narrator  ARE WE CONNECTED?
2  Scene: Device Contact  Narrator  ...
4

Load Parquet with pandas

Parquet files load faster than JSONL for large queries. Use the per-chapter files or the combined file.
load_parquet.py
import pandas as pd

# Load a single chapter as Parquet
df_ch1 = pd.read_parquet('parquet/chap1_dataset.parquet')

# Load all chapters at once
df_all = pd.read_parquet('parquet/full_chapters_dataset.parquet')

print(f"Total records: {len(df_all)}")
print(df_all['speaker'].value_counts().head(10))
Use full_chapters_dataset.parquet whenever you need to query across chapters — for example, to find every line spoken by a character regardless of chapter, or to analyze narrator patterns across the full game.
5

Filter by chapter, speaker, and scene

All three fields — context, speaker, and text — support standard pandas filtering.
filter_examples.py
import pandas as pd

df = pd.read_parquet('parquet/full_chapters_dataset.parquet')

# Filter by speaker
susie_lines = df[df['speaker'] == 'Susie']
print(f"Susie has {len(susie_lines)} lines")

# Filter narrator lines only
narrator = df[df['speaker'] == 'Narrator']

# Filter by scene (context field)
cyber_world = df[df['context'] == 'Scene: Cyber World']

# Combine filters: Ralsei lines in a specific scene
ralsei_castle = df[
    (df['speaker'] == 'Ralsei') &
    (df['context'].str.contains('Castle', case=False))
]

# Load a specific chapter by file
ch2 = pd.read_parquet('parquet/chap2_dataset.parquet')
ch2_player_choices = ch2[ch2['speaker'] == 'Player']

print(ch2_player_choices['text'].tolist())

ChatML format

The data/chatml/deltarune_story_chatml.jsonl file contains the same transcript data pre-formatted as ChatML message sequences, suitable for supervised fine-tuning of instruction-following models. Each record in this file uses the standard ChatML structure with three roles:
chatml example
{
  "messages": [
    {
      "role": "system",
      "content": "You are the Deltarune Script Archive. You provide exact, chronological scene transcripts including dialogues, actions, and pauses."
    },
    {
      "role": "user",
      "content": "Provide the transcript for Scene: Obj Krisroom.\nContext: This scene occurs after 'Device Contact' and before 'Obj Carcutscene'."
    },
    {
      "role": "assistant",
      "content": "Toriel: \"Kris...!\"\nToriel: \"Wake up!\"\nToriel: \"KRIS...!\"\nToriel: \"Kris, if you do not wake up, we will be late for school!\"\nToriel: \"I will wait outside for you, alright?\""
    }
  ]
}
Load it the same way as any other JSONL file:
load_chatml.py
import pandas as pd

chatml = pd.read_json('data/chatml/deltarune_story_chatml.jsonl', lines=True)
print(chatml.head())
See LLM fine-tuning for guidance on using this file with training frameworks.