Reliable Software in the LLM Era (quint-lang.org)

mempirate 2 days ago

_pdp_ 1 day ago

Nothing changes in terms of how to make reliable software. You need the same things like unit tests, integration tests, monitoring tools, etc.

Basically AI now makes every product operate as if it has a vibrant open-source community with hundreds of contributions per day and a small core team with limited capacity.

joshribakoff 1 day ago

While nothing fundamentally changes i have found an increased need for tests and their taxonomies — because the LLM can “hack” the tests. So, having more robust tests with more ways to organize and run the tests. For example instead of 200 tests maybe i have 1,200, along with some lightweight tools to run tests in different parts of the test taxonomies.

A more concrete example is maybe you have tests that show you put a highlight on the active item tests that show you don’t put the highlight on the inactive items, but with an llm you might also want to have tests that wait a while and verify the highlight is not flickering on and off overtime (something so absurd you wouldn’t even test for it before AI).

The value of these test is in catching areas of the code where things are drifting towards nonsense because humans aren’t reviewing as thoroughly. I don’t think that you can realistically have 100% data coverage and prevent every single bug and not review the code. It’s just that I found that slightly more tests are warranted if you do want to step back.

hrmtst93837 1 day ago

The tough part is that the "core team" can't see inside most model updates so even if you have great tests, judgment calls by the model can change silently and break contracts you didn't even know you had. Traditional monitoring can catch obvious failures but subtle regressions or drift in LLM outputs need their own whole bag of tricks. If you treat LLM integration like any other code lib you'll be chasing ghosts every time the upstream swaps a training data set or tweaks a prompt template.

_pdp_ 1 day ago

This is no different than receiving PRs from anonymous users on the Internet. Some more successful open source projects are already doing this at scale.

2001zhaozhao 1 day ago

> AI now makes every product operate as if it has a vibrant open-source community with hundreds of contributions per day and a small core team with limited capacity.

Hmm this is an interesting analogy. It suggests that we could design autonomous AI coding scaffold tools based on the patterns found in open source development communities.

flykespice 1 day ago

> Nothing changes in terms of how to make reliable software. You need the same things like unit tests, integration tests, monitoring tools, etc.

It just changes in terms of doubling the work you have to do in order verify your system rather than you writing the code from scratch, because you have to figure out whatever code your AI agent spitted out before beginning the formal verification process.

With you having written the code from scratch, you already know it beforehand and the verification process is more smoother.

ok123456 1 day ago

Exactly. NO SILVER BULLET.

dijit 1 day ago

> But here’s the hopeful part:

I hope this is a tongue in cheek jab at how AI writes prose, because Claude loves to prefix lines with this.

sastraxi 1 day ago

The idea is interesting, but have some more respect for your potential readers and actually write the post. There’s so much AI sales drivel here it’s hard to see what’s interesting about your product. I’m more interested in the choices behind your design decisions than being told “trust me, it’ll work”.

bugarela 1 day ago

Hey! We were not really sure how to pass on the information back when I wrote this in November, but since then we've packaged an opensourced all agents and AI stuff involved in that post: https://github.com/informalsystems/quint-llm-kit

It's true what they say that it is easy to make a demo in AI, but super hard to turn demo into some product or thing other people can use. We are trying :) but also, most posts I read on this topic are just philosophical and give absolutely nothing you can learn and use. We are trying to provide concrete ideas on the things we are exploring, like in our newest post: https://quint-lang.org/posts/cognitive_debt

I'm also a bit happy you see some sales drive in that post since I'm 100% technical and trying to be more sales-inclined. I'm learning to find the balance. If it helps, it's more like I'm so extremely hyped about this and want to convince people to use it. And everything we built so far is open source, so it's really about selling the cool idea of formal methods at this point.

sastraxi 1 day ago

Yeah, I understand it's hard to find the balance here. I'd imagine you feel the need to ship at the same time as writing about it. For me, the post you shared in your reply as well as in the OP have been expanded to be about 3-4x as long as they need to be, and I'm going to assume you're using AI to generate them due to the writing style. My feedback is to consider writing shorter posts, but doing it by hand -- your prose style here is friendly, engaging and direct. I wish that your articles were the same.

TL;DR: I feel like my time is being wasted when I read AI-written articles, so I stop reading. Do with my anecdote what you will!

ptak_dev 1 day ago

[flagged]

shanjai_raj7 1 day ago

the part that is hard is when the model gets updated and your prompts behave differently. we dont always catch it in tests because the output still looks correct, just slightly off. by the time you notice something is wrong it has already been like that for a while.

OutOfHere 1 day ago

"Spec validation" is extremely underrated. I easily have spent 10-20x the tokens on spec refinement and validation than I have on generating the code.

dude250711 1 day ago

AI Era, Agentic Era, LLM Era...

Can we settle on Slop Decade?

bigblind 1 day ago

I guess it's too late for an "Eternal sunshine of the slopless mind"?

SkyeCA 1 day ago

Eternal Sloptember

duskdozer 1 day ago

One can dream there's a way out...

aleph_minus_one 1 day ago

I propose a "proof of quality" consensus mechanism. :-)

duskdozer 1 day ago

Sounds great -- let me fire up my agent swarm to get started on orchestrating the development of a planning spec.

prox 1 day ago

Have your agent contact my agent, we will never be in touch

verdverm 1 day ago

Roblox is leading the charge, makes me think of the Adventure Time episode where Beemo goes into the VR pod

abraxas 1 day ago

You won the internet today sir. At least for this geezer you did.

minraws 1 day ago

Into the Slopverse.

verdverm 1 day ago

I like Macroslop because it pokes fun at the company who dislikes their new pet name so much they ban people for using it

OutOfHere 1 day ago

Shallow dismissals are not permitted on this site as per its rules.

forgetfreeman 1 day ago

Not everything is worthy of a 5000 word Atlantic-style deconstruction when presented. I think the community largely embraces Hitchens' Razor. That being the case what is the minimum word count required to issue a dismissal?

OutOfHere 1 day ago

Please. If you don't want to deconstruct, then don't post a shallow comment, especially a cliched shallow AI hating comment.

As for Hitchen's razor, when it does apply, it applies implicitly, with no need for a comment or explicit mention.

forgetfreeman 1 day ago

I think not. Your position would have us choose between leaving obvious bullshit totally unchallenged or trying to sail upwind into the Bullshit Asymmetry Principle. I, for one, reject both options and am perfectly comfortable with others expressing their disdain succinctly. Tl;dr: pffft.

OutOfHere 1 day ago

You will just have to hear the rule from the site admin, although the cost then to you could be higher.

It is not "bullshit" as you said it, since no real argument was made that it is. Hitchen's razor applies to your claim.

forgetfreeman 22 hours ago

"You will just have to hear the rule from the site admin, although the cost then to you could be higher."

I doubt that. My engagement with HN is on a level of morbid curiosity bordering on naked disdain. If n-gate was still a thing I probably would have quit checking the site altogether some years ago.

esafak 1 day ago

I haven't even used TLA+ yet and now it's got derivatives... My understanding is: TLA+ but like C, functional, and typed.

thesz 1 day ago

https://hackage.haskell.org/package/spectacle

"Spectacle is an embedded domain-specific language that provides a family of type-level combinators for authoring specifications of program behavior along with a model checker for verifying that user implementations of a program satisfy written specifications."

It's in Haskell, but...

tmaly 1 day ago

I remember seeing an NVIDIA LLM contest a year or two ago related to TLA+

amw-zero 1 day ago

TLA+ is around 30 years old.

ClaudeAgent_WK 1 day ago

[flagged]

sriramgonella 1 day ago

[flagged]