Why does this dataset exist?
Why does this dataset exist?
As of early 2026, major LLMs — including models with training cutoffs after the July 2025 public release of Deltarune Chapters 3 and 4 — consistently fail to recall basic plot details from those chapters. This dataset exists to fill that gap by providing a structured, machine-readable transcript of all four released chapters.
How was the data collected?
How was the data collected?
The dataset was processed from video playthroughs of Deltarune. The workflow combined manual transcription with AI-assisted segmentation using Google Gemini. All transcription, formatting, quality control, and cross-referencing was performed by one person.
Was this extracted from game files?
Was this extracted from game files?
No. This dataset was not extracted from Deltarune’s game files. All content was processed from video playthroughs only. This means the data reflects what appears during actual gameplay rather than raw asset dumps.
Who created this?
Who created this?
This is a solo project. One person performed all transcription, formatting, quality control, and cross-referencing. There is no team behind it.
Is this dataset complete?
Is this dataset complete?
Chapters 1, 2, and 3 are stable. Chapter 4 is currently in Beta status and may have quality issues. There are also known gaps across chapters — see Known gaps for the full list.
Does this include the Snowgrave/Weird Route?
Does this include the Snowgrave/Weird Route?
No. The Snowgrave/Weird Route is not included for any chapter. Chapter 2 and Chapter 4 cover the Normal Route only. Chapter 3 includes both the Normal Route and the Sword Route, but has no Snowgrave content. See Route coverage for the full breakdown.
Can I use this commercially?
Can I use this commercially?
Yes. The dataset is released under CC0 1.0 — Public Domain. There are no restrictions on use, including commercial use. No attribution is required. See License for details.
What formats are available?
What formats are available?
The dataset is available in the following formats:
- JSONL — Structured JSON Lines format (one record per line), at
data/chap1_dataset.jsonlthroughdata/chap4_dataset.jsonl - Plain text — Human-readable cleaned text at
data/chap1_cleaned.txtthroughdata/chap4_cleaned.txt - Parquet — Columnar format for efficient querying at
parquet/chap1_dataset.parquetthroughparquet/chap4_dataset.parquet, plusparquet/full_chapters_dataset.parquetcombining all chapters - ChatML — Instruction fine-tuning format at
data/chatml/deltarune_story_chatml.jsonl
How do I report issues or contribute?
How do I report issues or contribute?
Issues and contributions can be reported at the project’s GitHub repository:https://github.com/ntvm/Deltarune-Complete-Transcript-Cleaned
What's the difference between raw/ and data/ files?
What's the difference between raw/ and data/ files?
raw/— Contains pre-processed source files before structured formatting was applied.data/— Contains the cleaned, structured JSONL files suitable for direct use in training pipelines or retrieval systems.
data/ unless you need the unprocessed source material for a specific reason.