File coverage
| File | Coverage | Status |
|---|---|---|
data/chap1_cleaned.txt / chap1_dataset.jsonl | Chapter 1 (full) | Stable |
data/chap2_cleaned.txt / chap2_dataset.jsonl | Chapter 2 (full, Normal Route) | Stable |
data/chap3_cleaned.txt / chap3_dataset.jsonl | Chapter 3 (full, includes Sword Route) | Stable |
data/chap4_cleaned.txt / chap4_dataset.jsonl | Chapter 4 (full, Normal Route) | Beta |
Chapter 4 data is currently in Beta. Record counts and scene ordering may change in future releases.
Available formats
The dataset is distributed in three formats. Each serves a different use case.| Format | Location | Best for |
|---|---|---|
| JSONL | data/chap*_dataset.jsonl | Streaming, scripting, custom pipelines |
| Parquet | parquet/chap*_dataset.parquet, parquet/full_chapters_dataset.parquet | Analytical queries, columnar access, large-scale filtering |
| ChatML | data/chatml/deltarune_story_chatml.jsonl | Instruction fine-tuning of language models |
JSONL format
Newline-delimited JSON records. One dialogue line per record with context, speaker, and text fields.
Parquet format
Columnar binary format for fast filtering and aggregation. Includes a combined file covering all chapters.
ChatML format
Instruction-tuning format with system, user, and assistant roles. Use for fine-tuning language models.
Repository structure
The raw/ directory
The raw/ directory contains the original pre-processed transcript text files (chap1.txt through chap4.txt). These are unstructured plain text files and are not suitable for direct programmatic use. They represent the source material before scene segmentation, speaker attribution, and field extraction were applied.
Use the data/chap*_cleaned.txt files if you need a human-readable cleaned version, or any of the structured formats above for programmatic access.