The context field in every record identifies which scene the line belongs to. Understanding scene context is essential for filtering, ordering, and using the data effectively.
All context values follow this pattern:
The scene name portion uses one of two conventions.
Naming conventions
Descriptive scene names
Some scenes use human-readable names that describe the setting or situation:
| Context value | What it represents |
|---|
"Scene: Cyber World" | The Dark World cyber-themed area |
"Scene: Card Castle" | The card-themed castle area |
"Scene: Dark Worlds" | Generic Dark World scenes |
Internal object names
Other scenes use the internal game object names, prefixed with Obj:
| Context value | What it represents |
|---|
"Scene: Device Contact" | Opening sequence (the SOUL creation screen) |
"Scene: Obj Krisroom" | Kris’s bedroom (game start) |
"Scene: Obj Carcutscene" | Car ride to school |
"Scene: Obj Classscene" | Classroom scene |
"Scene: Obj Schoollobbycutscene" | School lobby encounter with Susie |
"Scene: Obj Insideclosetcutscene" | Inside the closet |
The internal object name convention (Obj prefix) comes from the game’s asset/object naming system. These names are used in the data as-is from the transcription process.
Scene ordering
Within each JSONL file, records are stored in chronological order — the order in which they appear in the game’s story progression. The first record in chap1_dataset.jsonl is the first line of Chapter 1; the last record is the final line.
All records for a given scene are grouped together consecutively.
Scene context in ChatML
The ChatML format makes scene ordering explicit. Each user prompt includes the predecessor and successor scene:
{
"role": "user",
"content": "Provide the transcript for Scene: Obj Krisroom.\nContext: This scene occurs after 'Device Contact' and before 'Obj Carcutscene'."
}
This temporal context is encoded to help language models understand scene sequencing when fine-tuned on this data.
Working with scenes in Python
import pandas as pd
df = pd.read_json('data/chap1_dataset.jsonl', lines=True)
# Get all unique scenes in chapter order
scenes = df['context'].unique()
print(scenes)
# Get all lines for a specific scene
closet_scene = df[df['context'] == 'Scene: Obj Insideclosetcutscene']
# Find all scenes containing a keyword
cyber_scenes = df[df['context'].str.contains('Cyber', case=False)]
# Count lines per scene
scene_counts = df.groupby('context').size().sort_values(ascending=False)
print(scene_counts.head(10))
Cross-chapter scenes
Some location names appear across multiple chapters (e.g., recurring settings). When working with the combined Parquet file, you can use scene names alongside the source file to distinguish chapters:
import pandas as pd
# Load per-chapter data and tag with chapter number
chapters = []
for i in range(1, 5):
df = pd.read_parquet(f'parquet/chap{i}_dataset.parquet')
df['chapter'] = i
chapters.append(df)
full = pd.concat(chapters, ignore_index=True)
# Filter a scene name in a specific chapter
result = full[(full['context'] == 'Scene: Card Castle') & (full['chapter'] == 2)]
The full_chapters_dataset.parquet file does not include a chapter column — it’s a direct concatenation of the four per-chapter Parquet files. Add the chapter label yourself as shown above if you need to distinguish sources.