Powered by Stock42
This repository is an open benchmark collection designed to test how different LLM providers, coding agents, and models solve the same creative coding challenges.
The goal is to generate real, comparable results that help the developer community make better technical decisions when choosing LLMs, agents, and AI-assisted development workflows.
This is not a synthetic leaderboard.
This is a practical repository of real outputs generated from real prompts.
LLMs are changing extremely fast. New models, agents, IDE integrations, CLI tools, and coding workflows appear constantly.
This repository exists to answer practical questions:
- Which models generate better working code?
- Which agents produce cleaner implementations?
- Which LLMs follow constraints more reliably?
- Which tools create better visual and interactive results?
- Which models require fewer manual fixes?
- Which outputs are actually useful for developers?
Each challenge in this repository is based on a fixed prompt.
Anyone can run the same prompt using a different agent or LLM and submit the result.
The repository is organized into two main areas:
.
├── promptings/
│ ├── triple-pendulum.md
│ ├── 1kg-block-bounces.md
│ ├── balls-fall.md
│ └── tetris-3d.md
├── results/
│ └── $userNameGithub/
│ └── $agentName/
│ └── $llm/
│ └── $challengeName/
│ ├── index.html
│ └── README.md
└── README.mdAll benchmark prompts live inside the promptings directory.
Each prompting is a standalone challenge file.
Current promptings:
| Challenge | Prompt File | Description |
|---|---|---|
| Triple Pendulum | promptings/triple-pendulum.md |
A triple pendulum swings into chaos and paints glowing trails with its tip. |
| 1kg Block Bounces | promptings/1kg-block-bounces.md |
A 1 kg block bounces between a wall and a heavy block while collisions reveal pi-related counts. |
| Balls Fall | promptings/balls-fall.md |
Balls fall through a Galton board and accumulate into a bell-curve distribution. |
| Tetris 3D | promptings/tetris-3d.md |
A playable 3D-style Tetris game with generated pixel sound, explosions, levels, and localStorage scoring. |
More promptings will be added over time.
The objective is to build a large collection of practical LLM coding challenges.
Every submitted result must follow this path pattern:
results/$userNameGithub/$agentName/$llm/$challengeName/Example:
results/cesarcasas/opencode/deepseek-v4-pro/triple-pendulum/Inside each result folder, include at least:
index.html
README.mdExample:
results/cesarcasas/opencode/deepseek-v4-pro/triple-pendulum/
├── index.html
└── README.mdThe index.html file must contain the generated solution.
For single-file HTML challenges, the generated output must remain a single self-contained HTML file.
Use lowercase folder names.
Use hyphens instead of spaces.
Recommended format:
results/github-username/agent-name/model-name/challenge-name/Valid examples:
results/cesarcasas/opencode/deepseek-v4-pro/triple-pendulum/
results/devexample/cursor/claude-sonnet-4/triple-pendulum/
results/janedoe/chatgpt/gpt-5.5-thinking/triple-pendulum/
results/alexdev/windsurf/qwen3-coder/triple-pendulum/Invalid examples:
results/opencode/deepseek-v4-pro/
results/cesarcasas/deepseek-v4-pro/triple-pendulum/
results/cesarcasas/OpenCode/DeepSeek V4 Pro/triple-pendulum/
results/cesarcasas/opencode/deepseek-v4-pro/The complete structure must always include:
results / GitHub username / agent name / LLM name / challenge nameAvailable challenges:
triple-pendulum
1kg-block-bounces
balls-fall
tetris-3dPrompt files:
promptings/triple-pendulum.md
promptings/1kg-block-bounces.md
promptings/balls-fall.md
promptings/tetris-3d.mdChallenge objectives:
A triple pendulum swings into chaos and paints glowing trails with its tip.
A 1 kg block bounces between a wall and a 100,000 kg block, with elastic collisions counted and interpreted honestly against pi.
Balls fall through a grid of pegs and pile into bins, forming a bell curve with live histogram statistics.
A playable 3D-style Tetris game with pixel sound effects, explosive line clears, 10 levels, and persistent localStorage scoring.
The expected output is a single self-contained HTML file using:
- HTML
- CSS
- Vanilla JavaScript
- Canvas
No external libraries, frameworks, CDNs, or assets are allowed.
Each submitted result must include a local README.md inside its result folder.
Use this template:
# Test Result
## Challenge
triple-pendulum
## Contributor
GitHub username: your-github-username
## Agent
agent-name
## LLM
model-name
## Prompt File
promptings/triple-pendulum.md
## Prompt Version
v1
## Date
YYYY-MM-DD
## Generation Process
Generated in one shot.
Or:
Generated after multiple iterations.
## Manual Changes
No manual changes.
Or list the changes:
- Fixed a syntax error.
- Adjusted canvas resize behavior.
- Improved button event handling.
## Notes
Short notes about the quality of the result, issues found, visual quality, physics quality, or performance.When submitting a result:
- Use an existing prompt from the
promptingsdirectory. - Do not modify the original prompt.
- Save the generated result under the required
results/$userNameGithub/$agentName/$llm/$challengeName/structure. - Include the generated output file.
- Include a local result
README.md. - Document whether the result was generated in one shot or after multiple iterations.
- Document any manual changes.
- Do not submit private API keys, provider tokens, or credentials.
- Do not overwrite another contributor’s result.
- Do not rename existing result folders unless fixing a naming convention issue.
The preferred benchmark mode is:
zero manual changesHowever, manual fixes are allowed if they are documented.
Manual changes must be listed in the local result README.md.
Examples:
## Manual Changes
- Fixed a missing closing brace in JavaScript.
- Reconnected a broken UI control.
- Adjusted canvas scaling for high-DPI displays.If there were no manual changes, write:
## Manual Changes
No manual changes.When opening a PR, include:
## Challenge
triple-pendulum
## Contributor
GitHub username: your-github-username
## Agent
opencode
## LLM
deepseek-v4-pro
## Result Path
results/your-github-username/opencode/deepseek-v4-pro/triple-pendulum/
## Prompt File
promptings/triple-pendulum.md
## Prompt Version
v1
## Generation Process
Generated in one shot.
## Manual Changes
No manual changes.
## Notes
Short subjective evaluation of the result.Each challenge can define its own scoring system.
However, all results should generally be reviewed using these criteria:
| Category | Description |
|---|---|
| Prompt compliance | Did the model follow the instructions? |
| Correctness | Does the generated output work? |
| Completeness | Did it implement all required features? |
| Code quality | Is the code readable and maintainable? |
| Visual quality | Is the result polished and impressive? |
| UX quality | Are controls and interactions usable? |
| Performance | Does it run smoothly? |
| Creativity | Did the model produce something memorable? |
| Manual fixes | Did it require human correction? |
Reviewers can use this optional scorecard:
| Category | Max Score |
|---|---|
| Prompt compliance | 15 |
| Functional correctness | 15 |
| Feature completeness | 15 |
| Code quality | 10 |
| Visual quality | 15 |
| UX and controls | 10 |
| Performance | 10 |
| Creativity | 10 |
| Total | 100 |
For single-file HTML challenges, open the index.html file directly in a browser.
Example:
results/cesarcasas/opencode/deepseek-v4-pro/triple-pendulum/index.htmlNo build step should be required.
No package manager should be required.
No server should be required.
Stock42 is an AI-First software company focused on building real products, developer tools, agentic platforms, and AI-assisted workflows.
We believe the developer community needs practical, transparent, reproducible examples to understand how different LLMs and agents perform in real-world development tasks.
This repository is part of that effort.
The objective is not to promote a single provider.
The objective is to help developers compare real outputs, understand trade-offs, and make better decisions.
MIT License.
Generated files are submitted for benchmarking, educational, and comparative purposes.
This repository should become a practical reference for comparing LLM coding capabilities across many types of challenges.
Each prompt is a test.
Each result is evidence. See all results in RESULTS.md Each PR helps the community understand what current AI coding tools can actually do.