Skip to main content

Case Study · Game Design

Difficulty as Data

McVroom, N. (2026). Difficulty as Data. NUR Technical Reports, NUR-2026-002. nickmcvroom.com/work/difficulty-as-data-crackpoint
TypeScript Game Design Systems Design Data-Driven

Abstract

CRACKPOINT's time pressure system eliminates special cases by treating every difficulty vector as a factor in an equation. Three modes, four zones, five modifiers, a sigmoid, and a log-normal distribution, all independent, all composable.

Problem

Difficulty systems built from special cases

Approach

Everything is a multiplier, table lookup, or continuous function

Outcome

One code path for every room in every mode

Most difficulty systems are built from special cases. Easy mode disables the timer. Hard mode adds extra enemies. A particular level has a hand-tuned exception because playtesters said it was too punishing. Each exception is a conditional branch, and each branch is a maintenance liability that interacts unpredictably with every other branch.

CRACKPOINT’s difficulty system has a single rule: everything is a factor in an equation. Room one and room fifty go through the same code path. Three modes, four pressure zones, room modifiers, code length scaling, intrusion probability, hint rarity scoring: all expressed as multipliers, table lookups, and continuous functions. No mode-specific branching in the difficulty pipeline. No hardcoded triggers for special events. The logic is identical everywhere. The numbers change.

Modes as Threshold Tables

A mode is a constant object, not a code branch. PressureMode is one of three values: zen, progression, or frantic. The selection happens once, a single switch statement picks a threshold table, and from that point forward, the mode is just data flowing through the same pipeline as everything else.

Progression mode defines per-tier thresholds. Every two rooms, the tier advances, capping at tier five:

// Tier = Math.min(Math.floor((roomNumber - 1) / 2), 5)

const PROGRESSION_THRESHOLDS = [
  { amber: 90,  red: 150, critical: 240 },  // Tier 0: rooms 1-2
  { amber: 75,  red: 120, critical: 180 },  // Tier 1: rooms 3-4
  // ...tightening every two rooms...
  { amber: 40,  red: 65,  critical: 100 },  // Tier 5: rooms 11+
];

Frantic mode is a single constant: { amber: 30, red: 50, critical: 80 }. Zen returns no thresholds at all: the zone calculation has a single early return for zen mode, and everything else flows through the same threshold comparisons. Three modes, three data shapes, one trivial branch.

Zones as Multipliers

Within a room, elapsed time pushes the player through four pressure zones: green, amber, red, critical. Each zone escalates the penalties for mistakes, and the escalation is a multiplier table, not a conditional:

// Zone penalties - same structure, different multipliers
// Green:    wrong guess = 10s × 1.0,  hint draw = 5s × 1.0
// Amber:    wrong guess = 10s × 1.5,  hint draw = 5s × 1.0
// Red:      wrong guess = 10s × 2.0,  hint draw = 5s × 1.5
// Critical: wrong guess = 10s × 3.0,  hint draw = 5s × 2.0

The player always knows their zone: the timer changes colour. Critical zone also triggers periodic bark messages from the game’s personality system: a pool of themed voice lines indexed by game state (zone, streak, modifier) that give the game a consistent tone without any mode-specific logic. Even the barks are data-driven: a lookup by zone, not a bespoke critical-mode handler.

Zones reset to green when the player enters a new room. The zone calculation is a pure function of elapsed time and config: when the timer resets for a new room, the zone follows. The pressure builds within a room, releases between rooms, and the rhythm emerges from the data rather than from an explicit reset handler.

Modifiers as Factors

Each room has a modifier (clean, sparse, locked, noisy, or dark) that scales the zone thresholds. The scaling is a single multiplication:

const MODIFIER_MULTIPLIERS = {
  clean: 1.0, sparse: 1.1, locked: 1.1, noisy: 1.2, dark: 1.3
};

// effective_threshold = base_threshold[tier] × modifier_multiplier[modifier]

A dark room in progression tier 3 has an amber threshold of 50 × 1.3 = 65 seconds. The modifier doesn’t know about modes. The mode doesn’t know about modifiers. They compose through multiplication because the system was designed for composition, not for awareness of its own parts.

The Intrusion Sigmoid

This is the anecdote that tests the philosophy.

CRACKPOINT’s standard puzzles use three-digit and four-digit codes. Five-digit codes exist as intrusions: harder puzzles that appear with increasing probability as the player progresses. The original design was a special case: “if the room number exceeds X and the player is performing well, inject a five-digit room.” That’s a hardcoded trigger, a separate code path, and a design decision buried in an if statement.

Instead, intrusion probability is a sigmoid curve with four tunable parameters:

// ceiling: 0.35    - never more than 35% chance
// steepness: 0.08  - gradual S-curve
// midpoint: 40     - inflection at room 40
// minRoom: 10      - zero probability before room 10

At room ten, the intrusion chance is roughly 0.3%. At room forty, it’s about 17.5%. At room one hundred, it’s approaching the 35% ceiling. Intrusions emerge as a natural consequence of progression; a continuous curve, not a scripted event.

The reuse proof is simple. A five-digit-only mode sets intrusionProbability to 1.0. A three-digit-only mode sets it to 0.0. The “special” modes are just parameter values fed into the same function that handles everything else. No new code path. No mode-specific logic.

0%10%20%30%35%ceiling = 0.35010406080100Room NumberminRoommidpoint (17.5%)Fig. 1 — Intrusion probability as a function of room number. The curve, not a threshold, governs five-digit code appearance.

Ghost Hints: Data In, Data Out

The noisy room modifier introduces ghost hints: hints that look legitimate but mislead. The temptation was to special-case it: “if the modifier is noisy, replace one hint with a lie.”

Instead, a createGhostHint function generates a random fake guess, evaluates it against the real solution to get real placed/misplaced counts, then perturbs those counts by ±1. The ghost is assigned a random rarity from the upper tiers (uncommon, rare, or epic) rather than going through the standard scoring function: this is an honest departure from “same pipeline everywhere.” The creation is different. But the consumption is identical: the ghost enters the room as a hint object with a boolean isGhost flag. The engine, the UI, and the room’s hint pool all treat it as just another hint. The only visual difference is a tell rendered by the UI layer.

The pattern is “constructed differently, consumed identically”: not as pure as running through the same scoring function, but the engine never branches on hint type. Data in, data out.

Power-Ups From Pattern Matching

Hints have a rarity scoring function:

// score = (extremity × 0.5) + (placedRatio × 0.3) + (distinctRatio × 0.2)
// where extremity = |totalCorrect / codeLength - 0.5| × 2

Hints that are 50% correct score highest: they’re the most informative. Rarity cutoffs map the score to tiers: legendary (top 10%), epic, rare, uncommon, common.

But certain guess patterns produce degenerate hints: evaluations so distinctive that they become power-ups. The classification is a lookup table against the evaluation result, not a flag set during generation:

PatternConditionPower-Up
Repeated digitsAll digits the sameBoost
Partial anagramOverlap ≥ n-1Surge I
Near solutionExactly 1 digit offSurge II
Full anagramAll correct, all misplacedOracle

The degenerate type is a computed property of the hint data: placed count, misplaced count, code length, distinct digits. Power-ups are selected from the classified pool, with Oracle always included if available. If no Oracle exists naturally, one is constructed from a solution permutation; but it still enters the room as data in the same hint structure.

Power-ups aren’t a separate system bolted onto the difficulty model. They emerge from the hint evaluation pipeline. A hint that happens to be an anagram of the solution isn’t “special”; it’s a hint whose evaluation pattern matches the Oracle row in a lookup table.

Synthetic Percentiles

Players see a percentile ranking after each room: “faster than 73% of players.” At launch, there are no real players to compare against. The rankings are generated from a log-normal distribution with per-tier parameters:

const DISTRIBUTIONS = [
  { medianSeconds: 45,  sigma: 0.5 },   // Tier 0: fast rooms, tight spread
  // ...scaling with difficulty...
  { medianSeconds: 180, sigma: 0.7 },   // Tier 5: slow rooms, wide spread
];

Log-normal because solve times naturally skew right: some rooms take much longer than average, but few are dramatically faster. The sigma widens at higher tiers because variance increases with difficulty. The percentile is calculated from the cumulative distribution function (CDF), given a solve time, what proportion of the modelled population would have been slower? The CDF uses an Abramowitz and Stegun error function approximation, no external math library, just a polynomial that’s been accurate enough for fifty years.

A “Speed Demon” badge triggers at the 90th percentile. Roughly one in ten rooms: feels earned but achievable. For aggregate run percentiles, room medians are summed and sigmas averaged, then the same CDF applies. One function. Every context.

What I’d Do Differently

The config values live in TypeScript source files (thresholds.ts, intrusionConfig.ts, hintScoringConfig.ts. They’re isolated from logic, but they’re not external. A dedicated tuning tool that visualises the difficulty curve, the sigmoid, and the percentile distributions in real time, editable sliders feeding live previews, would have dramatically accelerated balancing. Right now, tuning means editing constants, saving, and watching the dev server hot-reload. It works. It’s not efficient.

The synthetic percentile distributions were hand-tuned by feel. They should have been calibrated against automated play-throughs: run a thousand simulated sessions with varying “skill” profiles and fit the distributions to the simulated population. The current parameters produce results that feel reasonable, but “feels reasonable” is a weaker foundation than “matches simulated reality.” This is a lesson I applied directly in AFTERTOUCH, where an automated tuning pipeline plays simulated games and emits calibrated config values in under eight minutes; replacing intuition with data.

Outcome

Three modes. Four zones. Five modifiers. A sigmoid. A log-normal distribution. A rarity scoring function. All of them independent. All of them composable. None of them aware of each other. The same getRoomConfig() function handles every room in every mode; it just multiplies different numbers.

The philosophy isn’t “avoid special cases because special cases are bad.” It’s that special cases are debt. Each one is cheap to add and expensive to maintain, because it interacts with every other special case in ways you can’t predict at the time you write it. Treating everything as a factor in an equation isn’t elegance for its own sake. It’s a survival strategy for a system that needs to grow.

References

McVroom, N. (2026). Shipping One Game Across Four Platforms. NUR Technical Reports, NUR-2026-001./work/platform-abstraction-crackpoint

McVroom, N. (2026). Generating Puzzles Worth Playing. NUR Technical Reports, NUR-2026-003./work/sculpted-erosion-aftertouch