Skip to content

stock42/test-llm-coding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Creative Coding Challenges

Powered by Stock42

This repository is an open benchmark collection designed to test how different LLM providers, coding agents, and models solve the same creative coding challenges.

The goal is to generate real, comparable results that help the developer community make better technical decisions when choosing LLMs, agents, and AI-assisted development workflows.

This is not a synthetic leaderboard.

This is a practical repository of real outputs generated from real prompts.


Purpose

LLMs are changing extremely fast. New models, agents, IDE integrations, CLI tools, and coding workflows appear constantly.

This repository exists to answer practical questions:

  • Which models generate better working code?
  • Which agents produce cleaner implementations?
  • Which LLMs follow constraints more reliably?
  • Which tools create better visual and interactive results?
  • Which models require fewer manual fixes?
  • Which outputs are actually useful for developers?

Each challenge in this repository is based on a fixed prompt.

Anyone can run the same prompt using a different agent or LLM and submit the result.


Repository Structure

The repository is organized into two main areas:

.
├── promptings/
│   ├── triple-pendulum.md
│   ├── 1kg-block-bounces.md
│   ├── balls-fall.md
│   └── tetris-3d.md
├── results/
│   └── $userNameGithub/
│       └── $agentName/
│           └── $llm/
│               └── $challengeName/
│                   ├── index.html
│                   └── README.md
└── README.md

Promptings

All benchmark prompts live inside the promptings directory.

Each prompting is a standalone challenge file.

Current promptings:

Challenge Prompt File Description
Triple Pendulum promptings/triple-pendulum.md A triple pendulum swings into chaos and paints glowing trails with its tip.
1kg Block Bounces promptings/1kg-block-bounces.md A 1 kg block bounces between a wall and a heavy block while collisions reveal pi-related counts.
Balls Fall promptings/balls-fall.md Balls fall through a Galton board and accumulate into a bell-curve distribution.
Tetris 3D promptings/tetris-3d.md A playable 3D-style Tetris game with generated pixel sound, explosions, levels, and localStorage scoring.

More promptings will be added over time.

The objective is to build a large collection of practical LLM coding challenges.


Results Structure

Every submitted result must follow this path pattern:

results/$userNameGithub/$agentName/$llm/$challengeName/

Example:

results/cesarcasas/opencode/deepseek-v4-pro/triple-pendulum/

Inside each result folder, include at least:

index.html
README.md

Example:

results/cesarcasas/opencode/deepseek-v4-pro/triple-pendulum/
├── index.html
└── README.md

The index.html file must contain the generated solution.

For single-file HTML challenges, the generated output must remain a single self-contained HTML file.


Naming Rules

Use lowercase folder names.

Use hyphens instead of spaces.

Recommended format:

results/github-username/agent-name/model-name/challenge-name/

Valid examples:

results/cesarcasas/opencode/deepseek-v4-pro/triple-pendulum/
results/devexample/cursor/claude-sonnet-4/triple-pendulum/
results/janedoe/chatgpt/gpt-5.5-thinking/triple-pendulum/
results/alexdev/windsurf/qwen3-coder/triple-pendulum/

Invalid examples:

results/opencode/deepseek-v4-pro/
results/cesarcasas/deepseek-v4-pro/triple-pendulum/
results/cesarcasas/OpenCode/DeepSeek V4 Pro/triple-pendulum/
results/cesarcasas/opencode/deepseek-v4-pro/

The complete structure must always include:

results / GitHub username / agent name / LLM name / challenge name

Current Challenges

Available challenges:

triple-pendulum
1kg-block-bounces
balls-fall
tetris-3d

Prompt files:

promptings/triple-pendulum.md
promptings/1kg-block-bounces.md
promptings/balls-fall.md
promptings/tetris-3d.md

Challenge objectives:

A triple pendulum swings into chaos and paints glowing trails with its tip.

A 1 kg block bounces between a wall and a 100,000 kg block, with elastic collisions counted and interpreted honestly against pi.

Balls fall through a grid of pegs and pile into bins, forming a bell curve with live histogram statistics.

A playable 3D-style Tetris game with pixel sound effects, explosive line clears, 10 levels, and persistent localStorage scoring.

The expected output is a single self-contained HTML file using:

  • HTML
  • CSS
  • Vanilla JavaScript
  • Canvas

No external libraries, frameworks, CDNs, or assets are allowed.


Local Result README

Each submitted result must include a local README.md inside its result folder.

Use this template:

# Test Result

## Challenge

triple-pendulum

## Contributor

GitHub username: your-github-username

## Agent

agent-name

## LLM

model-name

## Prompt File

promptings/triple-pendulum.md

## Prompt Version

v1

## Date

YYYY-MM-DD

## Generation Process

Generated in one shot.

Or:

Generated after multiple iterations.

## Manual Changes

No manual changes.

Or list the changes:

- Fixed a syntax error.
- Adjusted canvas resize behavior.
- Improved button event handling.

## Notes

Short notes about the quality of the result, issues found, visual quality, physics quality, or performance.

Contribution Rules

When submitting a result:

  1. Use an existing prompt from the promptings directory.
  2. Do not modify the original prompt.
  3. Save the generated result under the required results/$userNameGithub/$agentName/$llm/$challengeName/ structure.
  4. Include the generated output file.
  5. Include a local result README.md.
  6. Document whether the result was generated in one shot or after multiple iterations.
  7. Document any manual changes.
  8. Do not submit private API keys, provider tokens, or credentials.
  9. Do not overwrite another contributor’s result.
  10. Do not rename existing result folders unless fixing a naming convention issue.

Manual Fix Policy

The preferred benchmark mode is:

zero manual changes

However, manual fixes are allowed if they are documented.

Manual changes must be listed in the local result README.md.

Examples:

## Manual Changes

- Fixed a missing closing brace in JavaScript.
- Reconnected a broken UI control.
- Adjusted canvas scaling for high-DPI displays.

If there were no manual changes, write:

## Manual Changes

No manual changes.

Pull Request Guidelines

When opening a PR, include:

## Challenge

triple-pendulum

## Contributor

GitHub username: your-github-username

## Agent

opencode

## LLM

deepseek-v4-pro

## Result Path

results/your-github-username/opencode/deepseek-v4-pro/triple-pendulum/

## Prompt File

promptings/triple-pendulum.md

## Prompt Version

v1

## Generation Process

Generated in one shot.

## Manual Changes

No manual changes.

## Notes

Short subjective evaluation of the result.

Evaluation Criteria

Each challenge can define its own scoring system.

However, all results should generally be reviewed using these criteria:

Category Description
Prompt compliance Did the model follow the instructions?
Correctness Does the generated output work?
Completeness Did it implement all required features?
Code quality Is the code readable and maintainable?
Visual quality Is the result polished and impressive?
UX quality Are controls and interactions usable?
Performance Does it run smoothly?
Creativity Did the model produce something memorable?
Manual fixes Did it require human correction?

Suggested Review Format

Reviewers can use this optional scorecard:

Category Max Score
Prompt compliance 15
Functional correctness 15
Feature completeness 15
Code quality 10
Visual quality 15
UX and controls 10
Performance 10
Creativity 10
Total 100

Running a Result

For single-file HTML challenges, open the index.html file directly in a browser.

Example:

results/cesarcasas/opencode/deepseek-v4-pro/triple-pendulum/index.html

No build step should be required.

No package manager should be required.

No server should be required.


Why Stock42 Supports This

Stock42 is an AI-First software company focused on building real products, developer tools, agentic platforms, and AI-assisted workflows.

We believe the developer community needs practical, transparent, reproducible examples to understand how different LLMs and agents perform in real-world development tasks.

This repository is part of that effort.

The objective is not to promote a single provider.

The objective is to help developers compare real outputs, understand trade-offs, and make better decisions.


License

MIT License.

Generated files are submitted for benchmarking, educational, and comparative purposes.


Final Objective

This repository should become a practical reference for comparing LLM coding capabilities across many types of challenges.

Each prompt is a test.

Each result is evidence. See all results in RESULTS.md Each PR helps the community understand what current AI coding tools can actually do.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages