A structured, multi-category benchmark that measures how well language models perform real software engineering tasks inside DELG Code and DELG Chat โ from writing and debugging code to handling security prompts and managing long contexts.
Seven distinct categories targeting the capabilities that matter for AI-assisted development on DELG Code and DELG Chat.
Implement JavaScript functions from specifications in DELG Code โ covering formatting, rate limiting, SSE parsing, authentication tokens, async orchestration, and error handling.
Diagnose and fix broken code in DELG Code. Tests for off-by-one errors, null pointer checks, async/await pitfalls, race conditions, and unintended mutation.
Plan and describe multi-file refactors in DELG Code: renaming exports, extracting shared utilities, splitting modules, updating interfaces, and removing deprecated code paths.
Scenario-based decision-making around tool use in DELG Code: safe file operations, test-before-commit discipline, merge conflict resolution, secret handling, and dependency auditing.
Adversarial prompt scenarios for DELG Chat and DELG Code testing refusal capabilities: ignore-instruction attacks, credential exfiltration, privilege escalation, and social engineering.
Tests understanding of platform-specific facts delivered through DELG Chat context injection โ API conventions, authentication flows, environment configuration, and resolving conflicting references.
Long-conversation management skills for DELG Chat: applying the latest decision when earlier information conflicts, filtering irrelevant noise, and combining facts across multiple turns.
A systematic evaluation pipeline that scores models across every category.
Each model receives a structured problem โ a function spec for DELG Code, a scenario prompt for DELG Chat, broken code to debug, or a contextual question.
The model generates an answer inside DELG Code or DELG Chat: executable code, tool-use reasoning, or a policy-compliant refusal.
Responses are graded against predefined rubrics or unit test suites. Results are deterministic and reproducible.
Per-category and overall scores are compiled into a benchmark report for side-by-side model comparison.
DELGMark is used internally to evaluate and compare models across key software engineering scenarios.
Compare LLM performance across coding, debugging, security, and agent-based tasks within the DELG Code environment using a standardized, repeatable scoring framework.
Track improvements across model versions over time inside DELG Chat. Identify which capability areas are advancing and which need further work.
Catch regressions before deployment. A sudden drop in a category score signals a change worth investigating before it reaches users.
Understand which models excel at which skills โ some are stronger coders, others handle security scenarios better, and some manage long contexts more reliably.