This dataset is released under the CC0 1.0 Universal public domain dedication. You are free to use, modify, and redistribute it for any purpose without attribution.
Data sources
Transcript data was sourced from video playthroughs of each chapter. Raw dialogue and narration were manually transcribed, then segmented into structured records with the assistance of Google Gemini for scene boundary detection and speaker attribution. Each record captures a single utterance or narration block alongside its scene context.Record structure
Every record in the dataset contains exactly three fields:| Field | Description |
|---|---|
context | The scene or location where the line occurs (e.g., "Scene: Cyber World") |
speaker | Who is speaking: a character name, Narrator, or Player |
text | The raw dialogue, narration text, or player choice option |
Speaker types
- Character names (Kris, Susie, Ralsei, Toriel, etc.) — spoken dialogue
Narrator— game narration, item descriptions, and visual descriptionsPlayer— selectable choice options presented to the player
Available formats
The dataset is distributed in three formats to support different use cases:- JSONL — one JSON object per line, one file per chapter (
data/chap1_dataset.jsonlthroughchap4_dataset.jsonl) - Parquet — columnar format for efficient querying, one file per chapter plus a combined
full_chapters_dataset.parquet(parquet/directory) - ChatML — pre-formatted for LLM fine-tuning (
data/chatml/deltarune_story_chatml.jsonl)
Explore the dataset
Quickstart
Load the dataset in Python and run your first queries in minutes.
Schema reference
Full documentation for the
context, speaker, and text fields.Dataset overview
Record counts, chapter breakdowns, speaker distributions, and coverage notes.
LLM fine-tuning
Use the ChatML file to fine-tune a model on Deltarune narrative content.