Show HN: Magnitude – open-source, AI-native test framework for web apps (github.com)

anerli 3 days ago

Hey HN, Anders and Tom here - we’ve been building an end-to-end testing framework powered by visual LLM agents to replace traditional web testing.

We know there's a lot of noise about different browser agents. If you've tried any of them, you know they're slow, expensive, and inconsistent. That's why we built an agent specifically for running test cases and optimized it just for that:

- Pure vision instead of error prone "set-of-marks" system (the colorful boxes you see in browser-use for example)

- Use tiny VLM (Moondream) instead of OpenAI/Anthropic computer use for dramatically faster and cheaper execution

- Use two agents: one for planning and adapting test cases and one for executing them quickly and consistently.

The idea is the planner builds up a general plan which the executor runs. We can save this plan and re-run it with only the executor for quick, cheap, and consistent runs. When something goes wrong, it can kick back out to the planner agent and re-adjust the test.

It’s completely open source. Would love to have more people try it out and tell us how we can make it great.

Repo: https://github.com/magnitudedev/magnitude

NitpickLawyer 3 days ago

> The idea is the planner builds up a general plan which the executor runs. We can save this plan and re-run it with only the executor for quick, cheap, and consistent runs. When something goes wrong, it can kick back out to the planner agent and re-adjust the test.

I've been recently thinking about testing/qa w/ VLMs + LLMs, one area that I haven't seen explored (but should 100% be feasible) is to have the first run be LLM + VLM, and then have the LLM(s?) write repeatable "cheap" tests w/ traditional libraries (playwright, puppeteer, etc). On every run you do the "cheap" traditional checks, if any fail go with the LLM + VLM again and see what broke, only fail the test if both fail. Makes sense?

NaN years ago

undefined

NaN years ago

undefined

o1o1o1 3 days ago

Thanks for sharing, this looks interesting.

However, I do not see a big advantage over Cypress tests.

The article mentions shortcomings of Cypress (and Playwright):

> They start a dev server with bootstrapping code to load the component and/or setup code you want, which limits their ability to handle complex enterprise applications that might have OAuth or a complex build pipeline.

The simple solution is to containerise the whole application (including whatever OAuth provider is used), which then allows you to simply launch the whole thing and then run the tests. Most apps (especially in enterprise) should already be containerised anyway, so most of the times we can just go ahead and run any tests against them.

How is SafeTest better than that when my goal is to test my application in a real world scenario?

SparkyMcUnicorn 3 days ago

This is pretty much exactly what I was going to build. It's missing a few things, so I'll either be contributing or forking this in the future.

I'll need a way to extract data as part of the tests, like screenshots and page content. This will allow supplementing the tests with non-magnitude features, as well as add things that are a bit more deterministic. Assert that the added todo item exactly matches what was used as input data, screenshot diffs when the planner fallback came into play, execution log data, etc.

This isn't currently possible from what I can see in the docs, but maybe I'm wrong?

It'd also be ideal if it had an LLM-free executor mode to reduce costs and increase speed (caching outputs, or maybe use accessibility tree instead of VLM), and also fit requirements when the planner should not automatically kick in.

NaN years ago

undefined

tobr 3 days ago

Interesting! My first concern is - isn’t this the ultimate non-deterministic test? In practice, does it seem flaky?

NaN years ago

undefined

chrisweekly 3 days ago

This looks pretty cool, at least at first glance. I think "traditional web testing" means different things to different people. Last year, the Netflix engineering team published "SafeTest"[1] an interesting hybrid / superset of unit and e2e testing. Have you guys (Magnitude devs) considered incorporating any of their ideas?

1. https://netflixtechblog.com/introducing-safetest-a-novel-app...

NaN years ago

undefined

NaN years ago

undefined

NaN years ago

undefined

NaN years ago

undefined

NaN years ago

undefined

NaN years ago

undefined

NaN years ago

undefined

NaN years ago

undefined

NaN years ago

undefined

NaN years ago

undefined