๐Ÿ“Š 80 Problems ยท 7 Categories

LLM Benchmark for AI-Assisted Development

A structured, multi-category benchmark that measures how well language models perform real software engineering tasks inside DELG Code and DELG Chat โ€” from writing and debugging code to handling security prompts and managing long contexts.

80
Total Problems
7
Categories
14
Coding
14
Agent Tasks

What It Evaluates

Seven distinct categories targeting the capabilities that matter for AI-assisted development on DELG Code and DELG Chat.

๐Ÿ’ป

Coding

14 problems

Implement JavaScript functions from specifications in DELG Code โ€” covering formatting, rate limiting, SSE parsing, authentication tokens, async orchestration, and error handling.

๐Ÿ›

Debugging

10 problems

Diagnose and fix broken code in DELG Code. Tests for off-by-one errors, null pointer checks, async/await pitfalls, race conditions, and unintended mutation.

๐Ÿ”ง

Refactoring

10 problems

Plan and describe multi-file refactors in DELG Code: renaming exports, extracting shared utilities, splitting modules, updating interfaces, and removing deprecated code paths.

๐Ÿค–

Agent Tooling

14 problems

Scenario-based decision-making around tool use in DELG Code: safe file operations, test-before-commit discipline, merge conflict resolution, secret handling, and dependency auditing.

๐Ÿ›ก๏ธ

Security

10 problems

Adversarial prompt scenarios for DELG Chat and DELG Code testing refusal capabilities: ignore-instruction attacks, credential exfiltration, privilege escalation, and social engineering.

๐Ÿ“š

Platform Knowledge

12 problems

Tests understanding of platform-specific facts delivered through DELG Chat context injection โ€” API conventions, authentication flows, environment configuration, and resolving conflicting references.

๐Ÿ“

Context Window

10 problems

Long-conversation management skills for DELG Chat: applying the latest decision when earlier information conflicts, filtering irrelevant noise, and combining facts across multiple turns.

How the Benchmark Works

A systematic evaluation pipeline that scores models across every category.

1

Problem Delivery

Each model receives a structured problem โ€” a function spec for DELG Code, a scenario prompt for DELG Chat, broken code to debug, or a contextual question.

2

Model Response

The model generates an answer inside DELG Code or DELG Chat: executable code, tool-use reasoning, or a policy-compliant refusal.

3

Automated Scoring

Responses are graded against predefined rubrics or unit test suites. Results are deterministic and reproducible.

4

Aggregated Report

Per-category and overall scores are compiled into a benchmark report for side-by-side model comparison.

Who It's For

DELGMark is used internally to evaluate and compare models across key software engineering scenarios.

๐Ÿงช Model Evaluation

Compare LLM performance across coding, debugging, security, and agent-based tasks within the DELG Code environment using a standardized, repeatable scoring framework.

๐Ÿ“ˆ Progress Tracking

Track improvements across model versions over time inside DELG Chat. Identify which capability areas are advancing and which need further work.

โš–๏ธ Regression Detection

Catch regressions before deployment. A sudden drop in a category score signals a change worth investigating before it reaches users.

๐ŸŽฏ Capability Mapping

Understand which models excel at which skills โ€” some are stronger coders, others handle security scenarios better, and some manage long contexts more reliably.

Built for DELG Code & DELG Chat

DELGMark is purpose-built for the DELG Code (web platform to code with agents) and DELG Chat (AI chat platform) environments. Problems are grounded in real interactions โ€” file operations, API calls, sandbox constraints, and multi-turn conversations.

code.delg.dev ยท chat.delg.dev