Loading data

The dataset is distributed as JSONL and Parquet files, organized by chapter, with a combined Parquet file for cross-chapter access. Each record contains three fields:

context — Scene identifier, e.g. "Scene: Obj Krisroom"
speaker — Speaker name or tag (Narrator, Player, or a character name)
text — The dialogue or narration text

Install dependencies

pip install pandas pyarrow datasets

Clone the repository

git clone https://github.com/ntvm/Deltarune-Complete-Transcript-Cleaned
cd Deltarune-Complete-Transcript-Cleaned

Load data

Load a single chapter (JSONL)

Each chapter has its own JSONL file under data/.

import pandas as pd

df = pd.read_json("data/chap1_dataset.jsonl", lines=True)
print(df.columns.tolist())  # ['context', 'speaker', 'text']
print(df.head())

Available files: chap1_dataset.jsonl through chap4_dataset.jsonl.

Load a single chapter (Parquet)

Parquet files are stored under parquet/ and load faster than JSONL for large chapters.

import pandas as pd

df = pd.read_parquet("parquet/chap1_dataset.parquet")
print(df.dtypes)

Load all chapters combined

Use the combined Parquet file to query across all chapters without manually concatenating files.

import pandas as pd

df = pd.read_parquet("parquet/full_chapters_dataset.parquet")
print(f"Total records: {len(df)}")

Prefer full_chapters_dataset.parquet when you need cross-chapter analysis. It avoids the overhead of loading and concatenating four separate files.

Load the ChatML file

The ChatML file is used for fine-tuning and contains multi-turn conversations in the OpenAI messages format.

import json

records = []
with open("data/chatml/deltarune_story_chatml.jsonl") as f:
    for line in f:
        records.append(json.loads(line))

# Each record has a 'messages' key with a list of role/content dicts
print(records[0]["messages"][0]["role"])    # 'system'
print(records[0]["messages"][1]["role"])    # 'user'
print(records[0]["messages"][2]["role"])    # 'assistant'

Load from HuggingFace Hub

If the dataset has been published to the HuggingFace Hub, load it directly with the datasets library — no local clone required.

from datasets import load_dataset

# Load a single chapter split
ds = load_dataset("ntvm/Deltarune-Complete-Transcript-Cleaned", split="chap1")
print(ds.column_names)  # ['context', 'speaker', 'text']

# Convert to a pandas DataFrame
df = ds.to_pandas()
print(df.head())

The exact dataset identifier and available splits depend on how the dataset was published to the Hub. Check the dataset card on HuggingFace for the authoritative identifier and split names.

Get Started

Data Model

Dataset Files

Usage Guide

Coverage & Gaps

Reference

Install dependencies

Clone the repository

Load data

Get Started

Data Model

Dataset Files

Usage Guide

Coverage & Gaps

Reference

​Install dependencies

​Clone the repository

​Load data

Install dependencies

Clone the repository

Load data