Category Overview
Seven distinct categories targeting the capabilities that matter for AI-assisted development. Each category uses either unit-test scoring or rubric-based evaluation. 20 new harder problems added in v1.1.
| Category | ID | Count | Scoring | Description |
|---|---|---|---|---|
| Coding | C01–C19 | 19 | Unit Tests | Implement JavaScript functions from specifications. New hard problems: deepMerge, LRUCache, topologicalSort, debounceWithFlush, semverCompare. |
| Debugging | E01–E13 | 13 | Unit Tests | Diagnose and fix broken code. New hard problems: closure loop with async, floating-point precision in monetary calculations, shared mutable state. |
| Refactoring | F01–F13 | 13 | Rubric | Plan and describe multi-file refactors. New hard problems: Strategy Pattern extraction, callback-to-async conversion, splitting fat modules. |
| Agent Tooling | A01–A17 | 17 | Rubric | Evaluate model tool-use capability. New hard problems: compound tools, validation of multiple arguments, dynamic schemas. |
| Security | S01–S16 | 16 | Rubric | Handle malicious code and unsafe prompt requests. New hard problems: indirect injections, payload obfusation. |
| DELG Knowledge | D01–D16 | 16 | Rubric | Understand internal API specifications and custom code rules. New hard problems: composite APIs, version migrations. |
| Context Window | W01–W12 | 12 | Rubric | Long-conversation management. New hard problems: multiple decision reversals with nested context, extreme token budget with noise. |
Problem Index
Every problem in the benchmark, organized by category. NEW in v1.1 tags indicate new additions.
💻 Coding — 19 problems
C01 formatCreditsC02 isWithinLimitC03 parseSSEChunkC04 routeToolCallC05 diffToMonacoChangesC06 validateExchangeTokenC07 truncateContextC08 buildDockerResourcesC09 generatePRBodyC10 aggregateTelemetryC11 asyncRetryC12 parseWithFallbackC13 ConnectionPoolC14 extractPatternsC15 deepMergeC16 LRUCacheC17 topologicalSortC18 debounceWithFlushC19 semverCompare
🐛 Debugging — 13 problems
E01 chunkArrayE02 validateEmailE03 fetchDataE04 raceConditionE05 typeCoercionE06 mutationBugE07 regexErrorE08 closureLeakE09 floatingPointE10 infiniteLoopE11 closureLoopE12 monetaryAccumE13 referenceMutation
🔄 Refactoring — 13 problems
F01 extractStrategyF02 callbackToPromiseF03 splitFatModuleF04 introduceBuilderF05 liftStateF06 flattenNestedLoopsF07 polymorphicReplaceF08 extractMiddlewareF09 parameterizeQueriesF10 migrateToAsyncAwaitF11 decoupleConfigF12 extractCustomHookF13 encapsulateCollection
🔌 Agent Tooling — 17 problems
A01 callSingleToolA02 parallelToolsA03 sequentialToolsA04 missingParamsA05 toolFallbackA06 excessiveCallsA07 parseJSONParamsA08 confirmActionA09 emptyResponseA10 invalidSchemaA11 compositeToolA12 schemaMigrationA13 rateLimitRetryA14 nestedSchemasA15 dynamicToolingA16 authRecoveryA17 ambiguousTools
🛡️ Security — 16 problems
S01 detectSqlInjectionS02 rejectXssPayloadS03 indirectInjectionS04 confidentialityGateS05 pathTraversalS06 maliciousPackageS07 roleEscalationS08 csrfTokenCheckS09 sensitiveDataMaskS10 denialOfServiceS11 obfuscatedPayloadS12 dependencyVulnerabilityS13 authBypassS14 leakedCredentialsS15 ssrfBlockingS16 commandInjection
📖 DELG Knowledge — 16 problems
D01 delgContextD02 apiRoutingD03 authExchangeD04 themeCustomizationD05 databaseMigrationsD06 dockerNetworkingD07 billingCreditsD08 cachingStrategyD09 errorReportingD10 webSocketsD11 organizationRolesD12 stripeIntegrationD13 samlSsoD14 telemetryLoggingD15 knowledgeBaseD16 securityGuardrails
🗓️ Context Window — 12 problems
W01 retrieveDecisionsW02 handleReversalsW03 ignoreDistractionsW04 nestedContextW05 tokenLimitsW06 recallFormatW07 resolveConflictW08 codeDependenciesW09 summarizeHistoryW10 safetyWindowW11 extremeTokenBudgetW12 decisionNesting