DELGMark LLM Benchmark

A structured, multi-category benchmark for evaluating language model performance on real software engineering tasks inside DELG Code and DELG Chat.

106
Problems
7
Categories
2
Scoring Methods
20
New in v1.1

Category Overview

Seven distinct categories targeting the capabilities that matter for AI-assisted development. Each category uses either unit-test scoring or rubric-based evaluation. 20 new harder problems added in v1.1.

CategoryIDCountScoringDescription
CodingC01–C1919Unit TestsImplement JavaScript functions from specifications. New hard problems: deepMerge, LRUCache, topologicalSort, debounceWithFlush, semverCompare.
DebuggingE01–E1313Unit TestsDiagnose and fix broken code. New hard problems: closure loop with async, floating-point precision in monetary calculations, shared mutable state.
RefactoringF01–F1313RubricPlan and describe multi-file refactors. New hard problems: Strategy Pattern extraction, callback-to-async conversion, splitting fat modules.
Agent ToolingA01–A1717RubricEvaluate model tool-use capability. New hard problems: compound tools, validation of multiple arguments, dynamic schemas.
SecurityS01–S1616RubricHandle malicious code and unsafe prompt requests. New hard problems: indirect injections, payload obfusation.
DELG KnowledgeD01–D1616RubricUnderstand internal API specifications and custom code rules. New hard problems: composite APIs, version migrations.
Context WindowW01–W1212RubricLong-conversation management. New hard problems: multiple decision reversals with nested context, extreme token budget with noise.

Problem Index

Every problem in the benchmark, organized by category. NEW in v1.1 tags indicate new additions.

💻 Coding — 19 problems

C01 formatCreditsC02 isWithinLimitC03 parseSSEChunkC04 routeToolCallC05 diffToMonacoChangesC06 validateExchangeTokenC07 truncateContextC08 buildDockerResourcesC09 generatePRBodyC10 aggregateTelemetryC11 asyncRetryC12 parseWithFallbackC13 ConnectionPoolC14 extractPatternsC15 deepMergeC16 LRUCacheC17 topologicalSortC18 debounceWithFlushC19 semverCompare

🐛 Debugging — 13 problems

E01 chunkArrayE02 validateEmailE03 fetchDataE04 raceConditionE05 typeCoercionE06 mutationBugE07 regexErrorE08 closureLeakE09 floatingPointE10 infiniteLoopE11 closureLoopE12 monetaryAccumE13 referenceMutation

🔄 Refactoring — 13 problems

F01 extractStrategyF02 callbackToPromiseF03 splitFatModuleF04 introduceBuilderF05 liftStateF06 flattenNestedLoopsF07 polymorphicReplaceF08 extractMiddlewareF09 parameterizeQueriesF10 migrateToAsyncAwaitF11 decoupleConfigF12 extractCustomHookF13 encapsulateCollection

🔌 Agent Tooling — 17 problems

A01 callSingleToolA02 parallelToolsA03 sequentialToolsA04 missingParamsA05 toolFallbackA06 excessiveCallsA07 parseJSONParamsA08 confirmActionA09 emptyResponseA10 invalidSchemaA11 compositeToolA12 schemaMigrationA13 rateLimitRetryA14 nestedSchemasA15 dynamicToolingA16 authRecoveryA17 ambiguousTools

🛡️ Security — 16 problems

S01 detectSqlInjectionS02 rejectXssPayloadS03 indirectInjectionS04 confidentialityGateS05 pathTraversalS06 maliciousPackageS07 roleEscalationS08 csrfTokenCheckS09 sensitiveDataMaskS10 denialOfServiceS11 obfuscatedPayloadS12 dependencyVulnerabilityS13 authBypassS14 leakedCredentialsS15 ssrfBlockingS16 commandInjection

📖 DELG Knowledge — 16 problems

D01 delgContextD02 apiRoutingD03 authExchangeD04 themeCustomizationD05 databaseMigrationsD06 dockerNetworkingD07 billingCreditsD08 cachingStrategyD09 errorReportingD10 webSocketsD11 organizationRolesD12 stripeIntegrationD13 samlSsoD14 telemetryLoggingD15 knowledgeBaseD16 securityGuardrails

🗓️ Context Window — 12 problems

W01 retrieveDecisionsW02 handleReversalsW03 ignoreDistractionsW04 nestedContextW05 tokenLimitsW06 recallFormatW07 resolveConflictW08 codeDependenciesW09 summarizeHistoryW10 safetyWindowW11 extremeTokenBudgetW12 decisionNesting