Following some "real world" agentic benchmarks with OpenCode (aka ClaudeCode opensource alternative) and Pi-Coding-Agent (aka the OpenClaw core harness).
What to test?
I created a small Bun/Svelte application an I ask to the LLM to do some changes that require custom skill usage and custom CLI tool