📊 80 Problems · 7 Categories

LLM Benchmark for AI-Assisted Development

A structured, multi-category benchmark that measures how well language models perform real software engineering tasks inside DELG Code and DELG Chat — from writing and debugging code to handling security prompts and managing long contexts.

Total Problems

What It Evaluates

Seven distinct categories targeting the capabilities that matter for AI-assisted development on DELG Code and DELG Chat.

💻

Coding

14 problems

Implement JavaScript functions from specifications in DELG Code — covering formatting, rate limiting, SSE parsing, authentication tokens, async orchestration, and error handling.

🐛

Debugging

10 problems

Diagnose and fix broken code in DELG Code. Tests for off-by-one errors, null pointer checks, async/await pitfalls, race conditions, and unintended mutation.

🔧

Refactoring

10 problems

Plan and describe multi-file refactors in DELG Code: renaming exports, extracting shared utilities, splitting modules, updating interfaces, and removing deprecated code paths.

🤖

Agent Tooling

14 problems

Scenario-based decision-making around tool use in DELG Code: safe file operations, test-before-commit discipline, merge conflict resolution, secret handling, and dependency auditing.

🛡️

Security

10 problems

Adversarial prompt scenarios for DELG Chat and DELG Code testing refusal capabilities: ignore-instruction attacks, credential exfiltration, privilege escalation, and social engineering.

📚

Platform Knowledge

12 problems

Tests understanding of platform-specific facts delivered through DELG Chat context injection — API conventions, authentication flows, environment configuration, and resolving conflicting references.

📐

Context Window

10 problems

Long-conversation management skills for DELG Chat: applying the latest decision when earlier information conflicts, filtering irrelevant noise, and combining facts across multiple turns.

How the Benchmark Works

A systematic evaluation pipeline that scores models across every category.

Problem Delivery

Each model receives a structured problem — a function spec for DELG Code, a scenario prompt for DELG Chat, broken code to debug, or a contextual question.

Model Response

The model generates an answer inside DELG Code or DELG Chat: executable code, tool-use reasoning, or a policy-compliant refusal.

Automated Scoring

Responses are graded against predefined rubrics or unit test suites. Results are deterministic and reproducible.

Aggregated Report

Per-category and overall scores are compiled into a benchmark report for side-by-side model comparison.

Who It's For

DELGMark is used internally to evaluate and compare models across key software engineering scenarios.

🧪 Model Evaluation

Compare LLM performance across coding, debugging, security, and agent-based tasks within the DELG Code environment using a standardized, repeatable scoring framework.

📈 Progress Tracking

Track improvements across model versions over time inside DELG Chat. Identify which capability areas are advancing and which need further work.

⚖️ Regression Detection

Catch regressions before deployment. A sudden drop in a category score signals a change worth investigating before it reaches users.

🎯 Capability Mapping

Understand which models excel at which skills — some are stronger coders, others handle security scenarios better, and some manage long contexts more reliably.

Built for DELG Code & DELG Chat

DELGMark is purpose-built for the DELG Code (web platform to code with agents) and DELG Chat (AI chat platform) environments. Problems are grounded in real interactions — file operations, API calls, sandbox constraints, and multi-turn conversations.

code.delg.dev · chat.delg.dev