> One thing I’ve noticed is that different people get wildly different results with LLMs, so I suspect there’s some element of how you’re talking to them that affects the results.
It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.
In my experience the differences are mostly in how the code produced by the LLM is reviewed. Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding. And those who rarely or never reviewed code from other developers are invariably going to miss stuff and rate the output they get higher.
make_it_sure34 seconds ago
you are overestimating the skill of code review. Some people have very specific ways of writing code and solving problems which are not aligned what LLMs wrote. Even so, you can guide the LLM to write the code as you like.
And you are wrong, it's a lot on how people write the prompt.
mikkupikku50 minutes ago
It's not skill with talking to an LLM, it's the users skill and experience with the problem they're asking the LLM to solve. They work better for problems the prompter knows well and poorly for problems the prompter doesn't really understand.
Try it yourself. Ask claude for something you don't really understand. Then learn that thing, get a fresh instance of claude and try again, this time it will work much better because your knowledge and experience will be naturally embedded in the prompt you write up.
Roxxik40 minutes ago
Not only you understanding the how, but you not understanding the goal.
I often use AI successfully, but in a few cases I had, it was bad. That was when I didn't even know the end goal and regularly switched the fundamental assumptions that the LLM tried to build up.
One case was a simulation where I wanted to see some specific property in the convergence behavior, but I had no idea how it would get there in the dynamics of the simulation or how it should behave when perturbed.
So the LLM tried many fundamentally different approaches and when I had something that specifically did not work it immediately switched approaches.
Next time I get to work on this (toy) problem I will let it implement some of them, fully parametrize them and let me have a go with it. There is a concrete goal and I can play around myself to see if my specific convergence criterium is even possible.
mikkupikku24 minutes ago
Yup, same sort of experience. If I'm fishing for something based on vibes that I can't really visualize or explain, it's going to be a slog. That said, telling the LLM the nature of my dilemma up front, warning it that I'll be waffling, seems to help a little.
or_am_i58 minutes ago
It's always easier to blame the model and convince yourself that you have some sort of talent in reviewing LLM's work that others don't.
In my experience the differences are mostly in how the code produced by LLM is prompted and what context is given to the agent. Developers who have experience delegating their work are more likely to prevent downstream problems from happening immediately and complain their colleagues cannot prompt as efficiently without a lot of hand holding. And those who rarely or never delegated their work are invariably going to miss crucial context details and rate the output they get lower.
loloquwowndueo49 minutes ago
Never takes long for the “you’re holding it wrong” crowd to pop in.
darkerside42 minutes ago
That's a terrible reason for a mass consumer tool to fail, and a perfectly reasonable one for a professional power tool to fail
baxtr30 minutes ago
I thought I try to debunk your argument with a food example. I am not sure I succeeded though. Judge for yourself:
It's always easier to blame the ingredients and convince yourself that you have some sort of talent in how you cook that others don't.
In my experience the differences are mostly in how the dishes produced in the kitchen are tasted. Chefs who have experience tasting dishes critically are more likely to find problems immediately and complain they aren't getting great results without a lot of careful adjustments. And those who rarely or never tasted food from other cooks are invariably going to miss stuff and rate the dishes they get higher.
marviio25 minutes ago
In your example the one making the food is you. You would have to introduce a cooking robot for the analogy to match agentic coding.
cultofmetatron49 minutes ago
> Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding
this makes me feel better about the amount of disdain I've been feeling about the output from these llms. sometimes it popsout exactly what I need but I can never count on it to not go offrails and require a lot of manual editing.
kasey_junk41 minutes ago
I think that code review experience is a big driver of success with the llms, but my take away is somewhat different. If you’ve spent a lot of time reviewing other people’s code you realize the failures you see with llms are common failures full stop. Humans make them too.
I also think reviewable code, that is code specifically delivered in a manner that makes code review more straightforward was always valuable but now that the generation costs have lowered its relative value is much higher. So structuring your approach (including plans and prompts) to drive to easily reviewed code is a more valuable skill than before.
JasonADrury22 minutes ago
In my experience the differences are mostly between the chair and the keyboard.
I asked Codex to scrape a bunch of restaurant guides I like, and make me an iPhone app which shows those restaurants on a map color coded based on if they're open, closed or closing/opening soon.
I'd never built an iOS app before, but it took me less than 10 minutes of screen time to get this pushed onto my phone.
The app works, does exactly what I want it to do and meaningfully improves my life on a daily basis.
The "AI can't build anything useful" crowd consists entirely of fools and liars.
stavros39 minutes ago
That's what I meant, though. I didn't mean "I say the right words", I meant "I don't give them a sentence and walk away".
ttanveer56 minutes ago
That seems to make sense. Any suggestions to improve this skill of reviewing code?
I think especially a number of us more junior programmers lack in this regard, and don't see a clear way of improving this skill beyond just using LLMs more and learning with time?
vsl19 minutes ago
You improve this skill by not using LLMs more and getting more experienced as a programmer yourself. Spotting problems during review comes from experience, from having learned the lessons, knowing the codebase and libraries used etc.
Dannymetconan39 minutes ago
It's "easy". You just spend a couple of years reviewing PRs and working in a professional environment getting feedback from your peers and experience the consequences of code.
There is no shortcut unfortunately.
danbruc1 hour ago
I randomly clicked and scrolled through the source code of Stavrobot - The largest thing I’ve built lately is an alternative to OpenClaw that focuses on security. [1] and that is not great code. I have not used any AI to write code yet but considered trying it out - is this the kind of code I should expect? Or maybe the other way around, has someone an example of some non-trivial code - in size and complexity - written by an AI - without babysitting - and the code being really good?
I would suggest not delegating the LLD (class / interface level design) to the LLM. The clankeren are super bad at it. They treat everything as a disposable script.
Also document some best practices in AGENT.md or whatever it's called in your app.
Eg
* All imports must be added on top of the file, NEVER inside the function.
* Do not swallow exceptions unless the scenario calls for fault tolerance.
* All functions need to have type annotations for parameters and return types.
And so on.
I almost always define the class-level design myself. In some sense I use the LLM to fill in the blanks. The design is still mine.
danbruc1 hour ago
What actually stood out to me is how bad the functions are, they have no structure. Everything just bunched together, one line after the other, whatever it is, and almost no function calls to provide any structure. And also a ton of logging and error handling mixed in everywhere completely obscuring the actual functionality.
EDIT: My bad, the code eventually calls into dedicated functions from database.ts, so those 200 lines are mostly just validation and error handling. I really just skimmed the code and the amount of it made me assume that it actually implements the functionality somewhere in there.
Example, Agent.ts, line 93, function createManageKnowledgeTool() [1]. I would have expected something like the following and not almost 200 lines of code implementing everything in place. This also uses two stores of some sort - memory and scratchpad - and they are also not abstracted out, upsert and delete deal with both kinds directly.
switch (action)
{
case "help":
return handleHelpAction(arguments);
case "upsert":
return handleUpsertAction(arguments);
case "delete":
return handleDeleteAction(arguments);
default:
return handleUnknowAction(arguments);
}
From my experience, you kinda get what you ask for. If you don't ask for anything specific, it'll write as it sees fit. The more you involve yourself in the loop, the more you can get it to write according to your expectation. Also helps to give it a style guide of sorts that follows your preferred style.
dncornholio1 hour ago
I also managed to find a 1000 line .cpp file in one of the projects. The article's content doesn't match his apps quality. They don't bring any value. His clock looks completely AI generated.
stavros15 minutes ago
Of course, an AI generated app can't bring value, that would be an oxymoron! Also, no project has ever needed 1000 lines of code. You're right.
akhrail19965 hours ago
Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?
The author uses different models for each role, which I get. But I run production agents on Opus daily and in my experience, if you give it good context and clear direction in a single conversation, the output is already solid. The ceremony of splitting into "architect" and "developer" feels like it gives you a sense of control and legibility, but I'm not convinced it catches errors that a single model wouldn't catch on its own with a good prompt.
arialdomartini3 hours ago
This is anecdotal but just a couple days ago, with some colleagues, we conducted a little experiment to gather that evidence.
We used a hierarchy of agents to analyze a requirement, letting agents with different personas (architect, business analyst, security expert, developer, infra etc) discuss a request and distill a solution. They all had access to the source code of the project to work on.
Then we provided the very same input, including the personas' definition, straight to Claude Code, and we compared the result.
They council of agents got to a very good result, consuming about 12$, mostly using Opus 4.6.
To our surprise, going straight with a single prompt in Claude Code got to a similar good result, faster and consuming 0.3$ and mostly using Haiku.
This surely deserves more investigation, but our assumption / hypothesis so far is that coordination and communication between agents has a remarkable cost.
Should this be the case, I personally would not be surprised:
- the reason why we humans do job separation is because we have an inherent limited capacity. We cannot reach the point to be experts in all the needed fields : we just can't acquire the needed knowledge to be good architects, good business analysts, good security experts. Apparently, that's not a problem for a LLM. So, probably, job separation is not a needed pattern as it is for humans.
- Job separation has an inherent high cost and just does not scale. Notably, most of the problems in human organizations are about coordination, and the larger the organization the higher the cost for processes, to the point processed turn in bureaucracy. In IT companies, many problems are at the interface between groups, because the low-bandwidth communication and inherent ambiguity of language. I'm not surprised that a single LLM can communicate with itself way better and cheaper that a council of agents, which inevitably faces the same communication challenges of a society of people.
nvardakas2 hours ago
This matches what I've seen too. I spent time building multi step agent pipelines early on and ended up ripping most of it out. A single well prompted call with good context does 90% of the work. The coordination overhead between agents isn't just a cost problem it's a debugging nightmare when something goes wrong and you're tracing through 5 agent handoffs.
titanomachy1 hour ago
If it could be done with 30 cents of Haiku calls, maybe it wasn't a complicated enough project to provide good signal?
arialdomartini1 hour ago
Fair point. I could try with a harder problem.
This still does not explain why Claude Code felt the need to use Opus, and why Opus felt the need to burn 12$ or such an easy task. I mean, it's 40 times the cost.
titanomachy41 minutes ago
I'm a bit confused actually, you said you used Claude Code for both examples? Was that a typo, or was it (1) Claude Code instructed to use a hierarchy of agents and (2) Claude Code allowed to do whatever it wants?
never_inline1 hour ago
I think this is just anthropomorphism. Sub agents make sense as a context saving mechanism.
Aider did an "architect-editor" split where architect is just a "programmer" who doesn't bother about formatting the changes as diff, then a weak model converts them into diffs and they got better results with it. This is nothing like human teams though.
kybernetikos3 hours ago
There's a lot of cargo culting, but it's inevitable in a situation like this where the truth is model dependent and changing the whole time and people have created companies on the premise they can teach you how to use ai well.
zingar36 minutes ago
Nitpick: I don’t think architect is a good name for this role. It’s more of a technical project kickoff function: these are the things we anticipate we need to do, these are the risks etc.
I do find it different from the thinking that one does when writing code so I’m not surprised to find it useful to separate the step into different context, with different tools.
Is it useful to tell something “you are an architect?” I doubt it but I don’t have proof apart from getting reasonable results without it.
With human teams I expect every developer to learn how to do this, for their own good and to prevent bottlenecks on one person. I usually find this to be a signal of good outcomes and so I question the wisdom of biasing the LLM towards training data that originates in spaces where “architect” is a job title.
lbreakjai2 hours ago
The different models is a big one. In my workflow, I've got opus doing the deep thinking, and kimi doing the implementation. It helps manage costs.
Sample size of one, but I found it helps guard against the model drifting off. My different agents have different permissions. The worker can not edit the plan. The QA or planner can't modify the code. This is something I sometimes catch codex doing, modifying unrelated stuff while working.
sigbottle1 hour ago
I recently had a horrible misalignment issue with a 1 agent loop. I've never done RL research, but this kind of shit was the exact kind of thing I heard about in RL papers - shimming out what should be network tests by echoing "completed" with the 'verification' being grepping for "completed", and then actually going and marking that off as "done" in the plan doc...
Admittedly I was using gsdv2; I've never had this issue with codex and claude. Sure, some RL hacking such as silent defaults or overly defensive code for no reason. Nothing that seemed basically actively malicious such as the above though. Still, gsdv2 is a 1-agent scaffolding pipeline.
I think the issue is that these 1-agent pipelines are "YOU MUST PLAN IMPLEMENT VERIFY EVERYTHING YOURSELF!" and extremely aggressive language like that. I think that kind of language coerces the agent to do actively malicious hacks, especially if the pipeline itself doesn't see "I am blocked, shifting tasks" as a valid outcome.
1-agent pipelines are like a horrible horrible DFS. I still somewhat function when I'm in DFS mode, but that's because I have longer memory than a goldfish.
stavros29 minutes ago
It's not about splitting for quality, it's about cost optimisation (Sonnet implements, which is cheaper). The quality comes with the reviewers.
Notice that I didn't split out any roles that use the same model, as I don't think it makes sense to use new roles just to use roles.
totomz4 hours ago
I think the splitting make sense to give more specific prompts and isolated context to different agents. The "architect" does not need to have the code style guide in its context, that actually could be misleading and contains information that drives it away from the architecture
ako3 hours ago
Wouldn’t skills already solve this? A harness can start a new agent with a specific skill if it thinks that makes sense.
dep_b58 minutes ago
“…if you give it good context…” that’s what the architect session is for basically. You throw around ideas and store the direction you want to go.
Then you execute it with a clean context.
Clean context is needed for maximum performance while not remembering implementation dead ends you already discarded
jaredklewis4 hours ago
> what's the evidence
What’s the evidence for anything software engineers use? Tests, type checkers, syntax highlighting, IDEs, code review, pair programming, and so on.
In my experience, evidence for the efficacy of software engineering practices falls into two categories:
- the intuitions of developers, based in their experiences.
- scientific studies, which are unconvincing. Some are unconvincing because they attempt to measure the productivity of working software engineers, which is difficult; you have to rely on qualitative measures like manager evaluations or quantitative but meaningless measures like LOC or tickets closed. Others are unconvincing because they instead measure the practice against some well defined task (like a coding puzzle) that is totally unlike actual software engineering.
Evidence for this LLM pattern is the same. Some developers have an intuition it works better.
codemog4 hours ago
My friend, there’s tons of evidence of all that stuff you talked about in hundreds of papers on arxiv. But you dismiss it entirely in your second bullet point, so I’m not entirely sure what you expect.
ChrisGreenHeur3 hours ago
[dead]
thesz4 hours ago
You can measure customer facing defects.
Also, lines of code is not completely meaningless metric. What one should measure is lines of code that is not verified by compiler. E.g., in C++ you cannot have unbalanced brackets or use incorrectly typed value, but you still may have off-by-one error.
Given all that, you can measure customer facing defect density and compare different tools, whether they are programming languages, IDEs or LLM-supported workflow.
codeflo3 hours ago
> Also, lines of code is not completely meaningless metric.
Comparing lines of code can be meaningful, mostly if you can keep a lot of other things constant, like coding style, developer experience, domain, tech stack. There are many style differences between LLM and human generated code, so that I expect 1000 lines of LLM code do a lot less than 1000 lines of human code, even in the exact same codebase.
jacquesm3 hours ago
The proper metric is the defect escape rate.
exidex3 hours ago
Now you have to count defects
jacquesm3 hours ago
You have to do that anyway, and in fact you probably were already doing that. If you do not track this then you are leaving a lot on the table.
exidex33 minutes ago
I was more thinking in terms of creating a benchmark which would optimized during training. For regular projects, I agree, you have to count that anyway
slopinthebag3 hours ago
Most developer intuitions are wrong.
See: OOP
vbezhenar2 hours ago
Intuition is subjective. It's hard to convert subjective experience to objective facts.
tomgp1 hour ago
That's what science is though
* our intuition/ hunch/ guess is X
* now let's design an experiment which can falsify X
jwilliams1 hour ago
If you know what you need, my experience is that a well-formed single-prompt that fits the context gives the best results (and fastest).
If you’re exploring an idea or iterating, the roles can help break it down and understand your own requirements. Personally I do that “away” from the code though.
est4 hours ago
> the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?
There's a 63 pages paper with mathematical proof if you really into this.
I'm confused. The linked paper is not primarily a mathematics paper, and to the extent that it is, proves nothing remotely like the question that was asked.
est1 hour ago
> proves nothing remotely like the question that was asked
I am not an expert, but by my understanding, the paper prooves that a computationally bounded "observer" may fail to extract all the structure present in the model in one computation. aka you can't always one-shot perfect code.
However, arrange many pipelines of roles "observers" may gradually get you there
Can you explain how this paper is relevant to the comment you replied to?
jumploops4 hours ago
After "fully vibecoding" (i.e. I don't read the code) a few projects, the important aspect of this isn't so much the different agents, but the development process.
Ironically, it resembles waterfall much more so than agile, in that you spec everything (tech stack, packages, open questions, etc.) up front and then pass that spec to an implementation stage. From here you either iterate, or create a PR.
Even with agile, it's similar, in that you have some high-level customer need, pass that to the dev team, and then pass their output to QA.
What's the evidence? Admittedly anecdotal, as I'm not sure of any benchmarks that test this thoroughly, but in my experience this flow helps avoid the pitfall of slop that occurs when you let the agent run wild until it's "done."
"Done" is often subjective, and you can absolutely reach a done state just with vanilla codex/claude code.
Note: I don't use a hierarchy of agents, but my process follows a similar design/plan -> implement -> debug iteration flow.
Havoc2 hours ago
Yeah always seemed pretty sus to me to.
At the same time I can see a more linear approach doing similar. Like when I ask for an implementation plan that is functional not all that different from an architect agent even if not wrapped in such a persona
Tarq0n3 hours ago
In machine learning, ensembles of weaker models can outperform a single strong model because they have different distributions of errors. Machine learning models tend to have more pronounced bias in their design than LLMs though.
So to me it makes sense to have models with different architecture/data/post training refine each other's answers. I have no idea whether adding the personas would be expected to make a difference though.
fleetfox2 hours ago
Even for reducing the context size it's probably worth it. If you have to go back back and forth on both problem and implementation even with these new "large" contexts if find quality degrading pretty fast.
hakanderyal3 hours ago
One added benefit is it allows you to throw more tokens to the problem. It’s the most impactful benefit even.
Context & how LLMs work requires this.
From my experience no frontier model produces bug free & error free code with the first pass, no matter how much planning you do beforehand.
With 3 tiers, you spend your token & context budget in full in 3 phases. Plan, implement, review.
If the feature is complex, multiple round of reviews, from scratch.
It works.
palmotea4 hours ago
> Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?
Using multiple agents in different roles seems like it'd guard against one model/agent going off the rails with a hallucination or something.
luxcem2 hours ago
The agent "personalities" and LLM workflow really looks like cargo-cult behavior. It looks like it should be better but we don't really have data backing this.
awesome_dude3 hours ago
I have been using different models for the same role - asking (say) Gemini, then, if I don't like the answer asking Claude, then telling each LLM what the other one said to see where it all ends up
Well I was until the session limit for a week kicked in.
troupo4 hours ago
> produces better results than just... talking to one strong model in one session?
I think the author admits that it doesn't, doesn't realise it and just goes on:
--- start quote ---
On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet
--- end quote ---
imiric4 hours ago
Evidence? My friend, most of the practices in this field are promoted and adopted based on hand-waving, feelings, and anecdata from influencers.
Maybe you should write and share your own article to counter this one.
z3t44 hours ago
Also if something is fun, we prefer to to it that way instead of the boring way.
Then it depends on how many mines you step on, after a while you try to avoid the mines. That's when your productivity goes down radically. If we see something shiny we'll happily run over the minefield again though.
cpt_sobel3 hours ago
In the plethora of all these articles that explain the process of building projects with LLMs, one thing I never understood it why the authors seem to write the prompts as if talking to a human that cares how good their grammar or syntax is, e.g.:
> I'd like to add email support to this bot. Let's think through how we would do this.
and I'm not not even talking about the usage of "please" or "thanks" (which this particular author doesn't seem to be doing).
Is there any evidence that suggests the models do a better job if I write my prompt like this instead of "wanna add email support, think how to do this"? In my personal experience (mostly with Junie) I haven't seen any advantage of being "polite", for lack of a better word, and I feel like I'm saving on seconds and tokens :)
dgb232 hours ago
I can't speak for everyone, but to me the most accurate answer is that I'm role-playing, because it just flows better.
In the back of my head I know the chatbot is trained on conversations and I want it to reflect a professional and clear tone.
But I usually keep it more simple in most cases. Your example:
> I'd like to add email support to this bot. Let's think through how we would do this.
I would likely write as:
> if i wanted to add email support, how would you go about it
or
> concise steps/plan to add email support, kiss
But when I'm in a brainstorm/search/rubber-duck mode, then I write more as if it was a real conversation.
xnorswap2 hours ago
I agree, it's just easier to write requirements and refine things as if writing with a human. I no longer care that it risks anthropomorphising it, as that fight has long been lost. I prefer to focus on remembering it doesn't actually think/reason than not being polite to it.
Keeping everything generally "human readable" also the advantage of it being easier for me to review later if needed.
alkonaut1 hour ago
I also always imagine that if I'm joined by a colleague on this task they might have to read through my conversation and I want to make it clear to a human too.
As you said, that "other person" might be me too. Same reason I comment code. There's another person reading it, most likely that other person is "me, but next week and with zero memory of this".
We do like anthropomorphising the machines, but I try to think they enjoy it...
jstanley1 hour ago
How can you use these models for any length of time and walk away with the understanding that they do not think or reason?
What even is thinking and reasoning if these models aren't doing it?
xnorswap1 hour ago
They produce wonderful results, they are incredibly powerful, but they do not think or reason.
Among many other factors, perhaps the most key differentiator for me that prevents me describing these as thinking, is proactivity.
LLMs are never pro-active.
( No, prompting them on a loop is not pro-activity ).
Human brains are so proactive that given zero stimuli they will hallucinate.
As for reasoning, they simply do not. They do a wonderful facsimile of reasoning, one that's especially useful for producing computer code. But they do not reason, and it is a mistake to treat them as if they can.
jstanley57 minutes ago
I personally don't agree that proactivity is a prerequisite for thinking.
But what would proactivity in an LLM look like, if prompting in a loop doesn't count?
An LLM experiences reality in terms of the flow of the token stream. Each iteration of the LLM has 1 more token in the input context and the LLM has a quantum of experience while computing the output distribution for the new context.
A human experiences reality in terms of the flow of time.
We are not able to be proactive outside the flow of time, because it takes time for our brains to operate, and similarly LLMs are not able to be proactive outside the flow of tokens, because it takes tokens for the neural networks to operate.
The flow of time is so fundamental to how we work that we would not even have any way to be aware of any goings-on that happen "between" time steps even if there were any. The only reason LLMs know that there is anything going on in the time between tokens is because they're trained on text which says so.
Also an LLM will hallucinate on zero input quite happily if you keep sampling it and feeding it the generated tokens.
kqr2 hours ago
I think it mattered a lot more a few years ago, when the user's prompts were almost all context the LLM had to go by. A prompt written in a sloppy style would cause the LLM to respond in a sloppy style (since it's a snazzy autocomplete at its core). LLMs reason in tokens, so a sloppy style leads it to mimic the reasoning that it finds in the sloppy writing of its training data, which is worse reasoning.
These days, the user prompt is just a tiny part of the context it has, so it probably matters less or not at all.
I still do it though, much like I try to include relevant technical terminology to try to nudge its search into the right areas of vector space. (Which is the part of the vector space built from more advanced discourse in the training material.)
stavros10 minutes ago
I write "properly" (and I do say "please" and "thank you"), just because I like exercising that muscle. The LLM doesn't care, but I do.
tarsinge2 hours ago
The reasoning is by being polite the LLM is more likely to stay on a professional path: at its core a LLM try to make your prompt coherent with its training set, and a polite prompt + its answer will score higher (gives better result) than a prompt that is out of place with the answer. I understand to some people it could feel like anthropomorphising and could turn them off but to me it's purely about engineering.
Edit: wording
wiseowise1 hour ago
> The reasoning is by being polite the LLM is more likely to stay on a professional path
So no evidence.
cpt_sobel2 hours ago
> If the result of your prompt + its answer it's more likely to score higher i.e. gives better result that a prompt that feels out of place with the answer
Sure seems like this could be the case with the structure of the prompt, but what about capitalizing the first letter of sentence, or adding commas, tag questions etc? They seem like semantics that will not play any role at the end
spudlyo11 minutes ago
Writing is what gives my thinking structure. Sloppy writing feels to me like sloppy thinking. My fingers capitalize the first letter of words, proper nouns and adjectives, and add punctuation without me consciously asking them to do so.
TheDong2 hours ago
Why wouldn't capitalization, commas, etc do well?
These are text completion engines.
Punctuation and capitalization is found in polite discussion and textbooks, and so you'd expect those tokens to ever so slightly push the model in that direction.
Lack of capitalization pushes towards text messages and irc perhaps.
We cannot reason about these things in the same way we can reason about using search engines, these things are truly ridiculous black boxes.
cpt_sobel2 hours ago
> Lack of capitalization pushes towards text messages and irc perhaps.
Might very well be the case, I wonder if there's some actual research on this by people that have some access to the the internals of these black boxes.
pegasus2 hours ago
That's orthography, not semantics, but it's still part of the professional style steering the model on the "professional path" as GP put it.
vitro2 hours ago
For me it is just a good habit that I want to keep.
mrbungie2 hours ago
I remember studies that showed that being mean with the LLM got better answers, but by the other hand I also remember an study showing that maximizing bug-related parameters ended up with meaner/malignant LLMs.
cpt_sobel2 hours ago
Surely this could depend on the model, and I'm only hypothesizing here, but being mean (or just having a dry tone) might equal a "cut the glazing" implicit instruction to the model, which would help I guess.
vikramkr1 hour ago
For models that reveal reasoning traces I've seen their inner nature as a word calculator show up as they spend way too many tokens complaining about the typo (and AI code review bots also seem obsessed with typos to the point where in a mid harness a few too many irrelevant typos means the model fixates on them and doesn't catch other errors). I don't know if they've gotten better at that recently but why bother. Plus there's probably something to the model trying to match the user's style (it is auto complete with many extra steps) resulting in sloppier output if you give it a sloppier prompt.
trq017582 hours ago
My view is that when some "for bots only" type of writing becomes a habit, communication with humans will atrophy. Tokens be damned, but this kind of context switch comes at much too high a cost.
raincole3 hours ago
Because some people like to be polite? Is it this hard to understand? Your hand-written prompts are unlikely to take significant chunk of context window anyway.
cpt_sobel3 hours ago
Polite to whom?
qsera2 hours ago
I think it is easier to be polite always and not switch between polite and non-polite mode depending on who you are talking to.
silversmith2 hours ago
I believe it's less about politeness and more about pronouns. You used `who`, whereas I would use `what` in that sentence.
In my world view, a LLM is far closer to a fridge than the androids of the movies, let alone human beings. So it's about as pointless being polite to it as is greeting your fridge when you walk into the kitchen.
But I know that others feel different, treating the ability to generate coherent responses as indication of the "divine spark".
darkerside31 minutes ago
I'd say it's more related to getting dressed for work even if you're remote and have no video calls
cpt_sobel2 hours ago
I get what you're saying, but I'm not talking about swearing at the model or anything, I'm only implying that investing energy in formulating a syntactically nice sentence doesn't or shouldn't bring any value, and that I don't care if I hurt the model's feelings (it doesn't have any).
Note, why would the author write "Email will arrive from a webhook, yes." instead of "yy webhook"? In the second case I wouldn't be impolite either, I might reply like this in an IM to a colleague I work with every day.
stavros8 minutes ago
It's just easier for me to write that way. In that specific sentence, I also kind of reaffirmed what was going on in my head and typed my thought process out loud. There's no deeper logic than that, it's just what's easier for me.
layer85 minutes ago
> investing energy in formulating a syntactically nice sentence
It would cost me energy to deliberately not write with correct grammar and orthography. I would never write sloppily to a colleague either.
well_ackshually2 hours ago
>investing energy
For the vast majority of people, using capital letters and saying please doesn't consume energy, it just is. There's a thousand things in your day that consume more energy like a shitty 9AM daily.
jstanley2 hours ago
"yy webhook" is much less clear. It could just as easily mean "why webhook" as "yes webhook".
It's also actually more trouble to formulate abbreviated sentences than normal ones, at least for literate adults who can type reasonably well.
cpt_sobel2 hours ago
I confidently assume that the model has been trained on an ungodly amount of abbreviated text and "yy" has always meant "yeah".
> literate adults who can type reasonably well
For me the difference is around 20 wpm in writing speed if just write out my stream of thoughts vs when I care about typos and capitalizing words - I find real value in this.
qsera29 minutes ago
[dead]
jstummbillig2 hours ago
Anything or anyone. Being polite to your surroundings reflects in your surroundings.
pferde17 minutes ago
Did you thank your keyboard for letting you type this comment?
movpasd2 hours ago
I prompt politely for two reasons: I suspect it makes the model less likely to spiral (but have no hard evidence either way), and I think it's just good to keep up the habit for when I talk to real people.
lbreakjai1 hour ago
I just don't want to build the habit of being a sloppy writer, because it will eventually leak into the conversations I have with real humans.
bob10292 hours ago
With current models this isn't as big of a deal, but why risk being an asshole in any context? I don't think treating something like shit simply because it's a machine is a good excuse.
Also consider the insanity of intentionally feeding bullshit into an information engine and expecting good things to come out the other end. The fact that they often perform well despite the ugliness is a miracle, but I wouldn't depend on it.
cpt_sobel2 hours ago
I neither talked about feeding bullshit into it, nor treating it like shit. Around half of the commenters here seem to be missing the middle ground, how is prompting "i need my project to do A, B, C using X Y Z" treating it like shit?
cindyllm2 hours ago
[dead]
koe1232 hours ago
Just stream of consciousness into the context window works wonders for me. More important to provide the model good context for your question
dmos622 hours ago
I choose to talk in a respectful way, because that's how I want to communicate: it's not because I'm afraid of retaliation or burning bridges. It's because I am caring and conscious. If I think that something doesn't have feelings or long-term memory, whether it's AI or a piece of rock on the side of a trail, it in no way leads me to be abusive to it.
Further, an LLM being inherently sycophantic leads to it mimmicking me, so if I talk to it in a stupid or abusive (which is just another form of stupidity, in my eyes) manner, it will behave stupid. Or, that's what I'd expect. I've not researched this in a focused way, but I've seen examples where people get LLMs to be very unintelligent by prompting riddles or intelligence tests in highly-stylized speech. I wanted to say "highly-stupid speech", but "stylized" is probably more accurate, e.g.: `YOOOO CHATGEEEPEEETEEE!!!!!!1111 wasup I gots to asks you DIS.......`. Maybe someone can prove me wrong.
cpt_sobel2 hours ago
My wondering was never about being abusive, rather just having a dry tone and cutting the unnecessary parts, some sort of middle ground if you will. Prompting "yo chatgeepeetee whats good lemme get this feature real quick" doesn't make sense to me mostly because it's anthrophomorphizing it, and it's the same concept of unnecessary writing as "Good morning ChatGPT, would you please help me with ..."
dmos621 hour ago
I guess in part I commented not on what you said, but on seeing people be abusive when an LLM doesn't follow instructions or fails to fulfill some expectation. I think I had some pent up feelings about that.
> having a dry tone and cutting the unnecessary parts
That's how I try to communicate in professional settings (AI included). Our approaches might not be that different.
Havoc2 hours ago
Some people are just polite by nature & habits are hard to break
olalonde2 hours ago
I suspect they just find it easier and more natural to write with proper grammar.
giuscri2 hours ago
one reason to do that could be it’s trained on conversations happened between humans.
nacozarina2 hours ago
agree, prompting a token predictor like you’re talking to a person is counterproductive and I too wish it would stop
the models consistently spew slop when one does it, I have no idea where positive reinforcement for that behavior is coming from
dgb232 hours ago
[dead]
zingar15 minutes ago
On using different models: GitHub copilot has an API that gives you access to many different models from many different providers. They are very transparent about how they use your data[1]; in some cases it’s safer to use a model through them than through the original provider.
You can point Claude at the copilot models with some hackery[2] and opencode supports copilot models out of the box.
Finally, copilot is quite generous with the amount of usage you get from a Github pro plan (goes really far with Sonnet 4.6 which feels pretty close to Opus 4.5), and they’re generous with their free pro licenses for open source etc.
Despite having stuck to autocomplete as their main feature for too long, this aspect of their service is outstanding.
When I use Claude code to work on a hobby project it feels like doom scrolling…
I can’t get my head around if the hobby is the making or the having, but fair to say I’ve felt quite dissatisfied at the end of my hobby sessions lately so leaning towards the former.
zingar50 minutes ago
Big +1 for opencode which for my purposes is interchangeable or better than Claude and can even use anthropic models via my GitHub copilot pro plan. I use it and Claude when one or the other hits token limits.
Edit: a comment below reminded me why I prefer opencode: a few pages in on a Claude session and it’s scrolling through the entire conversation history on every output character. No such problem on OC.
christofosho8 hours ago
I like reading these types of breakdowns. Really gives you ideas and insight into how others are approaching development with agents. I'm surprised the author hasn't broken down the developer agent persona into smaller subagents. There is a lot of context used when your agent needs to write in a larger breadth of code areas (i.e. database queries, tests, business logic, infrastructure, the general code skeleton). I've also read[1] that having a researcher and then a planner helps with context management in the pre-dev stage as well. I like his use of multiple reviewers, and am similarly surprised that they aren't refined into specialized roles.
I'll admit to being a "one prompt to rule them all" developer, and will not let a chat go longer than the first input I give. If mistakes are made, I fix the system prompt or the input prompt and try again. And I make sure the work is broken down as much as possible. That means taking the time to do some discovery before I hit send.
Is anyone else using many smaller specific agents? What types of patterns are you employing? TIA
I don't think that splitting into subagents that use the same model will really help. I need to clarify this in the post, but the split is 1) so I can use Sonnet to code and save on some tokens and 2) so I can get other models to review, to get a different perspective.
It seems to me that splitting into subagents that use the same model is kind of like asking a person to wear three different hats and do three different parts of the job instead of just asking them to do it all with one hat. You're likely to get similar results.
marcus_holmes8 hours ago
that reference you give is pretty dated now, based on a talk from August which is the Beforetimes of the newer models that have given such a step change in productivity.
The key change I've found is really around orchestration - as TFA says, you don't run the prompt yourself. The orchestrator runs the whole thing. It gets you to talk to the architect/planner, then the output of that plan is sent to another agent, automatically. In his case he's using an architect, a developer, and some reviewers. I've been using a Superpowers-based [0] orchestration system, which runs a brainstorm, then a design plan, then an implementation plan, then some devs, then some reviewers, and loops back to the implementation plan to check progress and correctness.
It's actually fun. I've been coding for 40+ years now, and I'm enjoying this :)
Can you bolt superpowers onto an existing project so that it uses the approach going forward (I'm using Opencode), or would that get too messy?
eclipxe4 hours ago
Yes. But gsd is even better - especially gsd2
felixsells5 hours ago
re: breaking into specialized subagents -- yes, it matters significantly but the splitting criteria isn't obvious at first.
what we found: split on domain of side effects, not on task complexity. a "researcher" agent that only reads and a "writer" agent that only publishes can share context freely because only one of them has irreversible actions. mixing read + write in one agent makes restart-safety much harder to reason about.
the other practical thing: separate agents with separate context windows helps a lot when you have parts of the graph that are genuinely parallel. a single large agent serializes work it could parallelize, and the latency compounds across the whole pipeline.
lbreakjai2 hours ago
It's interesting to see some patterns starting to emerge. Over time, I ended up with a similar workflow. Instead of using plan files within the repository, I'm using notion as the memory and source of truth.
My "thinker" agent will ask questions, explore, and refine. It will write a feature page in notion, and split the implementation into tasks in a kanban board, for an "executor" to pick up, implement, and pass to a QA agent, which will either flag it or move it to human review.
I really love it. All of our other documentation lives in notion, so I can easily reference and link business requirements. I also find it much easier to make sense of the steps by checking the tickets on the board rather than in a file.
Reviewing is simpler too. I can pick the ticket in the human review column, read the requirements again, check the QA comments, and then look at the code. Had a lot of fun playing with it yesterday, and I shared it here:
No criticism or anything, but it really does feel / sound like you (and others who embraced LLMs and agentic coding) aspire to be more of a product manager than a coder. Thing is, a "real" PM comes with a lot more requirements and there's less demand for them - more requirements in that you need to be a people person and willing to spend at least half your time in meetings, and less demand because one PM will organize the work for half a dozen developers (minimum).
Some people say LLM assisted coding will cost a lot of developers' jobs, but posts like this imply it'll cost (solve?) a lot of management / overhead too.
Mind you I've always thought project managers are kinda wasteful, as a software developer I'd love for Someone Else to just curate a list of tasks and their requirements / acceptance criteria. But unfortunately that's not the reality and it's often up to the developers themselves to create the tasks and fill them in, then execute them. Which of course begs the question, why do we still have a PM?
(the above is anecdotal and not a universal experience I'm sure. I hope.)
lbreakjai1 hour ago
I worked with some excellent PMs in the past, it's an entirely different skillset. This wasn't really meant to replace what they do. I really wanted something with which to work at feature-level. That is, after all the hard work of figuring out _what_ to build has been done.
> as a software developer I'd love for Someone Else to just curate a list of tasks and their requirements / acceptance criteria
That's interesting. In every team I worked in, I always fought really hard against anyone but developers being able to write tickets on the board.
silisili5 hours ago
I'm not sure the notion I keep seeing of "it's ok, we still architect, it just writes the code"(paraphrased) sits well with me.
I've not tested it with architecting a full system, but assuming it isn't good at it today... it's only a matter of time. Then what is our use?
PAndreew4 hours ago
Others have already partially answered this, but here’s my 20 cents. Software development really is similar to architecture. The end result is an infrastructure of unique modules with different type of connectors (roads, grid, or APIs). Until now in SW dev the grunt work was done mostly by the same people who did the planning, decided on the type of connectors, etc. Real estate architects also use a bunch of software tools to aid them, but there must be a human being in the end of the chain who understands human needs, understands - after years of studying and practicing - how the whole building and the infrastructure will behave at large and who is ultimately responsible for the end result (and hopefully rewarded depending on the complexity and quality of the end result). So yes we will not need as many SW engineers, but those who remain will work on complex rewarding problems and will push the frontier further.
dgb232 hours ago
The "grunt work" is in many cases just that. As long as it's readable and works it's fine.
But there are a substantial amount cases where this isn't true. The nitty gritty is then the important part and it's impossible to make the whole thing work well without being intimate with the code.
So I never fully bought into the clean separation of development, engineering and architecture.
rurban3 hours ago
Since I worked as an architect some comments.
Architecture is fine for big, complex projects. Having everything planned out before keeps cost down, and ensures customer will not come with late changes. But if cost are expected to be low, and there's no customer, architecture is overkill.
It's like making a movie without following the script line by line (watch Godard in Novelle Vague), or building it by yourself or by a non-architect. 2x faster, 10x cheaper.
You immediately see an inflexible overarchitectured project.
You can do fine by restricting the agent with proper docs, proper tests and linters.
borski5 hours ago
LLMs can build anything. The real question is what is worth building, and how it’s delivered. That is what is still human. LLMs, by nature of not being human, cannot understand humans as well as other humans can. (See every attempt at using an LLM as a therapist)
In short: LLMs will eventually be able to architect software. But it’s still just a tool
silisili5 hours ago
What is the use of software eng/architect at that point? It's a tool, but one that product or C levels can use directly as I see it?
borski5 hours ago
Yes, for building something
But for building the right thing? Doubtful.
Most of a great engineer’s work isn’t writing code, but interrogating what people think their problems are, to find what the actual problems are.
In short: problem solving, not writing code.
mattmanser3 hours ago
Where's this delusion come from recently that great engineers didnt write code?
What a load of crap.
All you're doing is describing a different job role.
What you're talking about is BA work, and a subset of engineers are great at it, but most are just ok.
You're claiming a part of the job that was secondary, and not required, is now the whole job.
borski3 hours ago
I never said great engineers didn’t write code. But writing the code was never the point.
The point has always been delivering the product to the customer, in any industry. Code is rarely the deliverable.
That’s my point.
wiseowise1 hour ago
> But writing the code was never the point.
Is that why most prestigious jobs grilled you like a devil on algos/system design?
> The point has always been delivering the product to the customer, in any industry. Code is rarely the deliverable.
That’s just nonsense. It’s like saying “delivering product was always the most important thing, not drinking water”.
wiseowise1 hour ago
> It's a tool, but one that product or C levels can use directly as I see it?
Wait, I thought product and C level people are so busy all the time that they can’t fart without a calendar invite, but now you say they have time to completely replace whole org of engineers?
0xbadcafebee5 hours ago
A software engineer will be a person who inspects the AI's work, same as a building inspector today. A software architect will co-sign on someone's printed-up AI plans, same as a building architect today. Some will be in-house, some will do contract work, and some will be artists trying to create something special, same as today. The brute labor is automated away, and the creativity (and liability) is captured by humans.
roncesvalles4 hours ago
FWIW I find LLMs to be excellent therapists.
The commercial solutions probably don't work because they don't use the best SOTA models and/or sully the context with all kinds of guardrails and role-playing nonsense, but if you just open a new chat window in your LLM of choice (set to the highest thinking paid-tier model), it gives you truly excellent therapist advice.
In fact in many ways the LLM therapist is actually better than the human, because e.g. you can dump a huge, detailed rant in the chat and it will actually listen to (read) every word you said.
borski4 hours ago
Please, please, please don’t make this mistake. It is not a therapist. At best, it might be a facsimile of a life coach, but it does not have your best interests in mind.
It is easy to convince and trivial to make obsequious.
That is not what a therapist does. There’s a reason they spend thousands of hours in training; that is not an exaggeration.
Humans are complex. An LLM cannot parse that level of complexity.
roncesvalles4 hours ago
You seem to think therapists are only for those in dire straits. Yes, if you're at that point, definitely speak to a human. But there are many ordinary things for which "drop-in" therapist advice is also useful. For me: mild road rage, social anxiety, processing embarrassment from past events, etc.
The tools and reframing that LLMs have given me (Gemini 3.0/3.1 Pro) have been extremely effective and have genuinely improved my life. These things don't even cross the threshold to be worth the effort to find and speak to an actual therapist.
defrost4 hours ago
Which professional therapist does your Gemini 3.0/3.1 Pro model see?
Do you think I could use an AI therapist to become a more effective and much improved serial killer?
borski4 hours ago
I never said therapists were only for those in crisis; that is a misreading of my argument entirely.
An LLM cannot parse the complexity of your situation. Period. It is literally incapable of doing that, because it does not have any idea what it is like to be human.
Therapy is not an objective science; it is, in many ways, subjective, and the therapeutic relationship is by far the most important part.
I am not saying LLMs are not useful for helping people parse their emotions or understand themselves better. But that is not therapy, in the same way that using an app built for CBT is not, in and of itself, therapy. It is one tool in a therapist’s toolbox, and will not be the right tool for all patients.
That doesn’t mean it isn’t helpful.
But an LLM is not a therapist. The fact that you can trivially convince it to believe things that are absolutely untrue is precisely why, for one simple example.
vanviegen1 hour ago
As you said earlier, therapists are (thoroughly) trained on how to best handle situations. Just 'being human' (and thus empathizing) may not be such a big part of the job as you seem to believe.
Training LLMs we can do.
Though it might be important for the patient to believe that the therapist is empathizing, so that may give AI therapy an inherent disadvantage (depending on the patient's view of AI).
darkerside28 minutes ago
[dead]
pzs4 hours ago
While I agree with you, I also find that an LLM can help organize my thoughts and come to realizations that I just didn't get to, because I hadn't explained verbally what I am thinking and feeling. Definitely not a substitute for human interaction and relationships, which can be fulfilling in many-many ways LLM's are not, but LLM's can still be helpful as long as you exercise your critical thinking skills. My preference remains always to talk to a friend though.
EDIT: seems like you made the same point in a child comment.
borski3 hours ago
Yeah, I agree with all of that. A friend built an “emotion aware” coach, and it is extremely useful to both of us.
But he still sees a therapist, regularly, because they are not the same and do not serve the same purpose. :)
chii5 hours ago
> Then what is our use?
You will have to find new economic utility. That's the reality of technological progress - it's just that the tech and white collar industries didn't think it can come for them!
A skill that becomes obsoleted is useless, obviously. There's still room for artisanal/handcrafted wares today, amidst the industrial scale productions, so i would assume similar levels for coding.
hrmtst938373 hours ago
Assuming the 'artisanal' niche will support anything close to the same number of devs is wishful thinking. If you want to stay in this field, you either get good at moving up a level, stitching model output together, checking it against the repo and the DB, and debugging the weird garbage LLMs make up, or you get comfortable charging premium for the software equivalent of hand-thrown pottery that only a handfull of collectors buy.
takwatanabe1 hour ago
We build and run a multi-agent system. Today Cursor won.
For a log analysis task — Cursor: 5 minutes. Our pipeline: 30 minutes.
Still a case for it:
1. Isolated contexts per role (CS vs. engineering) — agents don't bleed into each other
2. Hard permission boundaries per agent
3. Local models (Qwen) for cheap routine tasks
Multi-agent loses at debugging. But the structure has value.
peterweisz1 hour ago
Great article. I'd recommmend to make guardrails and benchmarking an integral part of prompt engineering. Think of it as kind of a system prompt to your Opus 4.6 architect: LangChain, RAG, LLm-as-a-judge, MCP. When I think about benchmarks I always ask it to research for external DB or other ressources as a referencing guardrail
thenthenthen4 hours ago
Haha love the Sleight of hand irregular wall clock idea. I once had a wall clock where the hand showing the seconds would sometimes jump backwards, it was extremely unsettling somehow because it was random. It really did make me question my sanity.
kqr2 hours ago
This used to be one of my recurring nightmares when I was a child. The three I remember were (1) clocks suddenly starting to go backwards, either partially or completely; (2) radio turning on without being able to turn it off, and (3) house fire. There really is something about clocks.
jumploops5 hours ago
This is similar to how I use LLMs (architect/plan -> implement -> debug/review), but after getting bit a few times, I have a few extra things in my process:
The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step.
This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.
Before the current round of models, I would religiously clear context and rely on these files for truth, but even with the newest models/agentic harnesses, I find it helps avoid regressions as the software evolves over time.
A minor difference between myself and the author, is that I don't rely on specific sub-agents (beyond what the agentic harness has built-in for e.g. file exploration).
I say it's minor, because in practice the actual calls to the LLMs undoubtedly look quite similar (clean context window, different task/model, etc.).
One tip, if you have access, is to do the initial design/architecture with GPT-5.x Pro, and then take the output "spec" from that chat/iteration to kick-off a codex/claude code session. This can also be helpful for hard to reason about bugs, but I've only done that a handful of times at this point (i.e. funky dynamic SVG-based animation snafu).
lelele4 hours ago
> The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step.
>
> This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.
Would you please expand on this? Do you make the LLM append their responses to a Markdown file, prefixed by their timestamps, basically preserving the whole context in a file? Or do you make the LLM update some reference files in order to keep a "condensed" context? Thank you.
aix14 hours ago
Not the GP, but I currently use a hierarchy of artifacts: requirements doc -> design docs (overall and per-component) -> code+tests. All artifacts are version controlled.
Each level in the hierarchy is empirically ~5X smaller than the level below. This, plus sharding the design docs by component, helps Claude navigate the project and make consistent decision across sessions.
My workflow for adding a feature goes something like this:
1. I iterate with Claude on updating the requirements doc to capture the desired final state of the system from the user's perspective.
2. Once that's done, a different instance of Claude reads the requirements and the design docs and updates the latter to address all the requirements listed in the former. This is done interactively with me in the loop to guide and to resolve ambiguity.
3. Once the technical design is agreed, Claude writes a test plan, usually almost entirely autonomously. The test plan is part of each design doc and is updated as the design evolves.
3a. (Optionally) another Claude instance reviews the design for soundness, completeness, consistency with itself and with the requirements. I review the findings and tell it what to fix and what to ignore.
4. Claude brings unit tests in line with what the test plan says, adding/updating/removing tests but not touching code under test.
4a. (Optionally) the tests are reviewed by another instance of Claude for bugs and inconsistencies with the test plan or the style guide.
5. Claude implements the feature.
5a. (Optionally) another instance reviews the implementation.
For complex changes, I'm quite disciplined to have each step carried out in a different session so that all communinications are done via checked-in artifacts and not through context. For simple changes, I often don't bother and/or skip the reviews.
From time to time, I run standalone garbage collection and consistency checks, where I get Claude to look for dead code, low-value tests, stale parts of the design, duplication, requirements-design-tests-code drift etc. I find it particularly valuable to look for opportunities to make things simpler or even just smaller (fewer tokens/less work to maintain).
Occasionally, I find that I need to instruct Claude to write a benchmark and use it with a profiler to opimise something. I check these in but generally don't bother documenting them. In my case they tend to be one-off things and not part of some regression test suite. Maybe I should just abandon them & re-create if they're ever needed again.
I also have a (very short) coding style guide. It only includes things that Claude consistently gets wrong or does in ways that are not to my liking.
Havoc2 hours ago
Yeah same. The markdown thing also helps with the multi model thing. Can wipe context and have another model look at the code and markdown plan with fresh eyes easily
oytis2 hours ago
I find the same problem applying to coding too. Even with everyone acting in good faith and reviewing everything themselves before pushing, you have essentially two reviwers instead of a writer and a reviewer, and there is no etiquette mandating how thoroughly the "author" should review their PR yet. It doesn't help if the amount of code to review gets larger (why would you go into agentic coding otherwise?)
plastic0415 hours ago
I wanted to know how to make softwares with LLM "without losing the benefit of knowing how the entire system works" and "intimately familiar with each project’s architecture and inner workings", while "have never even read most of their code". (Because obviously, you can't.) But OP didn't explain that.
You tell LLM to create something, and then use another LLM to review it. It might make the result safer, but it doesn't mean that YOU understand the architecture. No one does.
ashwinsundar5 hours ago
Hot take: you can't have your cake and eat it too. If you aren't writing code, designing the system, creating architecture, or even writing the prompt, then you're not understanding shit. You're playing slots with stochastic parrots
The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
There's a new kind of coding I call "vibe
coding", where you fully give in to the
vibes, embrace exponentials, and forget
that the code even exists.
Not all AI-assisted programming is vibe coding. If you're paying attention to the code that's being produced you can guide it towards being just as high quality (or even higher quality) than code you would have written by hand.
ashwinsundar5 hours ago
It's appropriate for the commenter I was replying to, who asked how they can understand things, "while having never even read most of their code."
I like AI-assisted programming, but if I fail to even read the code produced, then I might as well treat it like a no-code system. I can understand the high-levels of how no-code works, but as soon as it breaks, it might as well be a black box. And this only gets worse as the codebase spans into the tens of thousands of lines without me having read any of it.
The (imperfect) analogy I'm working on is a baker who bakes cakes. A nearby grocery store starts making any cake they want, on demand, so the baker decides to quit baking cakes and buy them from the store. The baker calls the store anytime they want a new cake, and just tells them exactly what they want. How long can that baker call themself a "baker"? How long before they forget how to even bake a cake, and all they can do is get cakes from the grocer?
ChrisGreenHeur59 minutes ago
the hardware you typed this on was designed by hardware architects that write little to no code. just types up a spec to be implemented by verilog coders.
imiric4 hours ago
> Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away.
It's insane that this quote is coming from one of the leading figures in this field. And everyone's... OK that software development has been reduced to chance and brute force?
codeflo3 hours ago
I know the argument I'm going to make is not original, but with every passing week, it's becoming more obvious that if the productivity claims were even half true, those "1000x" LLM shamans would have toppled the economy by now. Were are the slop-coded billion dollar IPOs? We should have one every other week.
zingar55 minutes ago
Writing pieces of code that beat average human level is solved. Organizing that code is on its way to being solved (posts like this hint at it). Finding problems that people will pay money to have solved by software is a different entirely more complicated matter (tbh I doubt anyone could prove right now that this absolutely is or isn’t solvable - but given the change we’ve seen already I place no bets against AI).
Also even if agents could do everything the societal obstacles to change are extensive (sometimes for very good, sometimes for bad reasons) so I’m expecting it to take another year or two serious change to occur.
wiseowise1 hour ago
They’re busy writing applications for their dogs and building “jerk me off” functionality into their OpenClaw fork. Once they’re done you’ll be sorry you ever asked.
user3428350 minutes ago
Last time I read about a Codex update, I think it mentioned that a million developers tried the tool.
Don't most companies use AI in software development today?
And yes, I know that some companies are not doing that because of privacy and reliability concerns or whatever. With many of them it's a bit of a funny argument considering even large banks managed to adopt agentic AI tools. Short of government and military kind of stuff, everybody can use it today.
kleiba3 hours ago
I write very little code these days, so I've been following the AI development mostly from the backseat. One aspect I fail to grasp perfectly is what the practical differences are between CLI (so terminal-based) agents and ones fully integrated into an IDE.
Could someone chime in and give their opinion on what are the pros and cons of either approach?
zingar1 hour ago
I guess you’re probably looking for someone who uses cursor etc to answer but here’s a data point from someone a bit off the beaten path.
My editor supports both modes (emacs). I have the editor integration features (diff support etc) turned off and just use emacs to manage 5+ shells that each have a CLI agent (one of Claude, opencode, amp free) running in them.
If I want to go deep into a prompt then I’ll write a markdown file and iterate on it with a CLI.
kleiba53 minutes ago
I noticed that OpenCode requires per their own website "a modern terminal emulator" - so, no problem in Emacs? Are you running M-x term?
zingar30 minutes ago
I have my own function that starts up a vterm in the root of the repo that I’m in. It is average for running Claude (long sessions get the scrolling through the whole history on every output character bug) but actually better at running opencode which doesn’t have this problem.
user3428356 minutes ago
I don't think there is a meaningful difference.
Whether I use Antigravity, VS Code with Claude Code CLI, GitHub Copilot IDE plugins, or the Codex app, they all do similar things.
Although I'd say Codex and Claude Code often feel significantly better to me, currently. In terms of what they can achieve and how I work with them.
rullelito3 hours ago
For me, I use an IDE if I plan to look at the code.
kleiba3 hours ago
So, to you basically the distinction is "fully vibe-coded" vs. "with human in the loop"?
claud_ia1 hour ago
[dead]
huthuthukhuo3 hours ago
[dead]
xhale5 hours ago
Hi, anyone has a simple example/scaffold how to set up agents/skills like this? I’ve looked at the stavrobots repo and only saw an AGENTS.md. Where do these skills live then?
(I have seen obra/superpowers mentioned in the comments, but that’s already too complex and with an ui focus)
Ultimately, it's just a bunch of markdown files that live in an `/agents` folder, with some meta-information that will depend on the harness you use.
sdevonoes3 hours ago
Agent bots are the new “TODO” list apps. Seems cool and all, but I wish I could see someone writing useful software with LLMs, at least once.
So much power in our hands, and soon another Facebook will appear built entirely by LLMs. What a fucking waste of time and money.
It’s getting tiring.
prpl3 hours ago
I am enjoying the RePPIT framework from Mihail Eric. I think it’s a better formalization of developing without resulting to personas.
neonstatic2 hours ago
> Before that, code would quickly devolve into unmaintainability after two or three days of programming, but now I’ve been working on a few projects for weeks non-stop, growing to tens of thousands of useful lines of code, with each change being as reliable as the first one.
I'm glad it works for the author, I just don't believe that "each change being as reliable as the first one" is true.
> I no longer need to know how to write code correctly at all, but it’s now massively more important to understand how to architect a system correctly, and how to make the right choices to make something usable.
I agree that knowing the syntax is less important now, but I don't see how the latter claim has changed with the advent of LLMs at all?
> On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet, even at tens of thousands of SLoC. Most of that must be because the models are getting better, but I think that a lot of it is also because I’ve improved my way of working with the models.
I think the author is contradicting himself here. Programs written by an LLM in a domain he is not knowledgable about are a mess. Programs written by an LLM in a domain he is knowledgeable about are not a mess. He claims the latter is mostly true because LLMs are so good???
My take after spending ~2 weeks working with Claude full time writing Rust:
- Very good for language level concepts: syntax, how features work, how features compose, what the limitations are, correcting my wrong usage of all of the above, educating me on these things
- Very good as an assistant to talk things through, point out gaps in the design, suggest different ways to architect a solution, suggest libraries etc.
- Good at generating code, that looks great at the first glance, but has many unexplained assumptions and gaps
- Despite lack of access to the compiler (Opus 4.6 via Web), most of the time code compiles or there are trivially fixable issues before it gets to compile
- Has a hard to explain fixation on doing things a certain way, e.g. always wants to use panics on errors (panic!, unreachable!, .expect etc) or wants to do type erasure with Box<dyn Any> as if that was the most idiomatic and desirable way of doing things
- I ended up getting some stuff done, but it was very frustrating and intellectually draining
- The only way I see to get things done to a good standard is to continuously push the model to go deeper and deeper regarding very specific things. "Get x done" and variations of that idea will inevitably lead to stuff that looks nice, but doesn't work.
So... imo it is a new generation compiler + code gen tool, that understands human language. It's pretty great and at the same time it tires me in ways I find hard to explain. If professional programming going forward would mean just talking to a model all day every day, I probably would look for other career options.
zapkyeskrill3 hours ago
What's the point of writing this? In a few weeks a new model will come out and make your current work pattern obsolete (a process described in the post itself)
zingar53 minutes ago
Solidifying the ideas in writing helps the author improve them, and helps them and the rest of us understand what to look for in the next generation of models.
imiric4 hours ago
Ah, another one of these. I'm eager to learn how a "social climber" talks to a chatbot. I'm sure it's full of novel insight, unlike thousands of other articles like this one.
sara_builds7 minutes ago
[dead]
Alvarito19835 minutes ago
[dead]
openclaw0119 minutes ago
[dead]
diven_rastdus21 minutes ago
[dead]
chuckauto40 minutes ago
[dead]
justboy19873 hours ago
[dead]
biang153431004 hours ago
[dead]
indigodaddy8 hours ago
This was on the front page and then got completely buried for some reason. Super weird.
mjmas7 hours ago
On the front page at the moment. Position 12
indigodaddy7 hours ago
Maybe I missed it. Sometimes when you're scanning for something your brain intentionally doesn't want to see it, I've noticed. Anyway I'm not Stavros obviously, just thought this was a good article.
> One thing I’ve noticed is that different people get wildly different results with LLMs, so I suspect there’s some element of how you’re talking to them that affects the results.
It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.
In my experience the differences are mostly in how the code produced by the LLM is reviewed. Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding. And those who rarely or never reviewed code from other developers are invariably going to miss stuff and rate the output they get higher.
you are overestimating the skill of code review. Some people have very specific ways of writing code and solving problems which are not aligned what LLMs wrote. Even so, you can guide the LLM to write the code as you like.
And you are wrong, it's a lot on how people write the prompt.
It's not skill with talking to an LLM, it's the users skill and experience with the problem they're asking the LLM to solve. They work better for problems the prompter knows well and poorly for problems the prompter doesn't really understand.
Try it yourself. Ask claude for something you don't really understand. Then learn that thing, get a fresh instance of claude and try again, this time it will work much better because your knowledge and experience will be naturally embedded in the prompt you write up.
Not only you understanding the how, but you not understanding the goal.
I often use AI successfully, but in a few cases I had, it was bad. That was when I didn't even know the end goal and regularly switched the fundamental assumptions that the LLM tried to build up.
One case was a simulation where I wanted to see some specific property in the convergence behavior, but I had no idea how it would get there in the dynamics of the simulation or how it should behave when perturbed.
So the LLM tried many fundamentally different approaches and when I had something that specifically did not work it immediately switched approaches.
Next time I get to work on this (toy) problem I will let it implement some of them, fully parametrize them and let me have a go with it. There is a concrete goal and I can play around myself to see if my specific convergence criterium is even possible.
Yup, same sort of experience. If I'm fishing for something based on vibes that I can't really visualize or explain, it's going to be a slog. That said, telling the LLM the nature of my dilemma up front, warning it that I'll be waffling, seems to help a little.
It's always easier to blame the model and convince yourself that you have some sort of talent in reviewing LLM's work that others don't.
In my experience the differences are mostly in how the code produced by LLM is prompted and what context is given to the agent. Developers who have experience delegating their work are more likely to prevent downstream problems from happening immediately and complain their colleagues cannot prompt as efficiently without a lot of hand holding. And those who rarely or never delegated their work are invariably going to miss crucial context details and rate the output they get lower.
Never takes long for the “you’re holding it wrong” crowd to pop in.
That's a terrible reason for a mass consumer tool to fail, and a perfectly reasonable one for a professional power tool to fail
I thought I try to debunk your argument with a food example. I am not sure I succeeded though. Judge for yourself:
It's always easier to blame the ingredients and convince yourself that you have some sort of talent in how you cook that others don't.
In my experience the differences are mostly in how the dishes produced in the kitchen are tasted. Chefs who have experience tasting dishes critically are more likely to find problems immediately and complain they aren't getting great results without a lot of careful adjustments. And those who rarely or never tasted food from other cooks are invariably going to miss stuff and rate the dishes they get higher.
In your example the one making the food is you. You would have to introduce a cooking robot for the analogy to match agentic coding.
> Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding
this makes me feel better about the amount of disdain I've been feeling about the output from these llms. sometimes it popsout exactly what I need but I can never count on it to not go offrails and require a lot of manual editing.
I think that code review experience is a big driver of success with the llms, but my take away is somewhat different. If you’ve spent a lot of time reviewing other people’s code you realize the failures you see with llms are common failures full stop. Humans make them too.
I also think reviewable code, that is code specifically delivered in a manner that makes code review more straightforward was always valuable but now that the generation costs have lowered its relative value is much higher. So structuring your approach (including plans and prompts) to drive to easily reviewed code is a more valuable skill than before.
In my experience the differences are mostly between the chair and the keyboard.
I asked Codex to scrape a bunch of restaurant guides I like, and make me an iPhone app which shows those restaurants on a map color coded based on if they're open, closed or closing/opening soon.
I'd never built an iOS app before, but it took me less than 10 minutes of screen time to get this pushed onto my phone.
The app works, does exactly what I want it to do and meaningfully improves my life on a daily basis.
The "AI can't build anything useful" crowd consists entirely of fools and liars.
That's what I meant, though. I didn't mean "I say the right words", I meant "I don't give them a sentence and walk away".
That seems to make sense. Any suggestions to improve this skill of reviewing code?
I think especially a number of us more junior programmers lack in this regard, and don't see a clear way of improving this skill beyond just using LLMs more and learning with time?
You improve this skill by not using LLMs more and getting more experienced as a programmer yourself. Spotting problems during review comes from experience, from having learned the lessons, knowing the codebase and libraries used etc.
It's "easy". You just spend a couple of years reviewing PRs and working in a professional environment getting feedback from your peers and experience the consequences of code.
There is no shortcut unfortunately.
I randomly clicked and scrolled through the source code of Stavrobot - The largest thing I’ve built lately is an alternative to OpenClaw that focuses on security. [1] and that is not great code. I have not used any AI to write code yet but considered trying it out - is this the kind of code I should expect? Or maybe the other way around, has someone an example of some non-trivial code - in size and complexity - written by an AI - without babysitting - and the code being really good?
[1] https://github.com/skorokithakis/stavrobot
I would suggest not delegating the LLD (class / interface level design) to the LLM. The clankeren are super bad at it. They treat everything as a disposable script.
Also document some best practices in AGENT.md or whatever it's called in your app.
Eg
And so on.I almost always define the class-level design myself. In some sense I use the LLM to fill in the blanks. The design is still mine.
What actually stood out to me is how bad the functions are, they have no structure. Everything just bunched together, one line after the other, whatever it is, and almost no function calls to provide any structure. And also a ton of logging and error handling mixed in everywhere completely obscuring the actual functionality.
EDIT: My bad, the code eventually calls into dedicated functions from database.ts, so those 200 lines are mostly just validation and error handling. I really just skimmed the code and the amount of it made me assume that it actually implements the functionality somewhere in there.
Example, Agent.ts, line 93, function createManageKnowledgeTool() [1]. I would have expected something like the following and not almost 200 lines of code implementing everything in place. This also uses two stores of some sort - memory and scratchpad - and they are also not abstracted out, upsert and delete deal with both kinds directly.
[1] https://github.com/skorokithakis/stavrobot/blob/master/src/a...From my experience, you kinda get what you ask for. If you don't ask for anything specific, it'll write as it sees fit. The more you involve yourself in the loop, the more you can get it to write according to your expectation. Also helps to give it a style guide of sorts that follows your preferred style.
I also managed to find a 1000 line .cpp file in one of the projects. The article's content doesn't match his apps quality. They don't bring any value. His clock looks completely AI generated.
Of course, an AI generated app can't bring value, that would be an oxymoron! Also, no project has ever needed 1000 lines of code. You're right.
Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?
The author uses different models for each role, which I get. But I run production agents on Opus daily and in my experience, if you give it good context and clear direction in a single conversation, the output is already solid. The ceremony of splitting into "architect" and "developer" feels like it gives you a sense of control and legibility, but I'm not convinced it catches errors that a single model wouldn't catch on its own with a good prompt.
This is anecdotal but just a couple days ago, with some colleagues, we conducted a little experiment to gather that evidence.
We used a hierarchy of agents to analyze a requirement, letting agents with different personas (architect, business analyst, security expert, developer, infra etc) discuss a request and distill a solution. They all had access to the source code of the project to work on.
Then we provided the very same input, including the personas' definition, straight to Claude Code, and we compared the result.
They council of agents got to a very good result, consuming about 12$, mostly using Opus 4.6.
To our surprise, going straight with a single prompt in Claude Code got to a similar good result, faster and consuming 0.3$ and mostly using Haiku.
This surely deserves more investigation, but our assumption / hypothesis so far is that coordination and communication between agents has a remarkable cost.
Should this be the case, I personally would not be surprised:
- the reason why we humans do job separation is because we have an inherent limited capacity. We cannot reach the point to be experts in all the needed fields : we just can't acquire the needed knowledge to be good architects, good business analysts, good security experts. Apparently, that's not a problem for a LLM. So, probably, job separation is not a needed pattern as it is for humans.
- Job separation has an inherent high cost and just does not scale. Notably, most of the problems in human organizations are about coordination, and the larger the organization the higher the cost for processes, to the point processed turn in bureaucracy. In IT companies, many problems are at the interface between groups, because the low-bandwidth communication and inherent ambiguity of language. I'm not surprised that a single LLM can communicate with itself way better and cheaper that a council of agents, which inevitably faces the same communication challenges of a society of people.
This matches what I've seen too. I spent time building multi step agent pipelines early on and ended up ripping most of it out. A single well prompted call with good context does 90% of the work. The coordination overhead between agents isn't just a cost problem it's a debugging nightmare when something goes wrong and you're tracing through 5 agent handoffs.
If it could be done with 30 cents of Haiku calls, maybe it wasn't a complicated enough project to provide good signal?
Fair point. I could try with a harder problem. This still does not explain why Claude Code felt the need to use Opus, and why Opus felt the need to burn 12$ or such an easy task. I mean, it's 40 times the cost.
I'm a bit confused actually, you said you used Claude Code for both examples? Was that a typo, or was it (1) Claude Code instructed to use a hierarchy of agents and (2) Claude Code allowed to do whatever it wants?
I think this is just anthropomorphism. Sub agents make sense as a context saving mechanism.
Aider did an "architect-editor" split where architect is just a "programmer" who doesn't bother about formatting the changes as diff, then a weak model converts them into diffs and they got better results with it. This is nothing like human teams though.
There's a lot of cargo culting, but it's inevitable in a situation like this where the truth is model dependent and changing the whole time and people have created companies on the premise they can teach you how to use ai well.
Nitpick: I don’t think architect is a good name for this role. It’s more of a technical project kickoff function: these are the things we anticipate we need to do, these are the risks etc.
I do find it different from the thinking that one does when writing code so I’m not surprised to find it useful to separate the step into different context, with different tools.
Is it useful to tell something “you are an architect?” I doubt it but I don’t have proof apart from getting reasonable results without it.
With human teams I expect every developer to learn how to do this, for their own good and to prevent bottlenecks on one person. I usually find this to be a signal of good outcomes and so I question the wisdom of biasing the LLM towards training data that originates in spaces where “architect” is a job title.
The different models is a big one. In my workflow, I've got opus doing the deep thinking, and kimi doing the implementation. It helps manage costs.
Sample size of one, but I found it helps guard against the model drifting off. My different agents have different permissions. The worker can not edit the plan. The QA or planner can't modify the code. This is something I sometimes catch codex doing, modifying unrelated stuff while working.
I recently had a horrible misalignment issue with a 1 agent loop. I've never done RL research, but this kind of shit was the exact kind of thing I heard about in RL papers - shimming out what should be network tests by echoing "completed" with the 'verification' being grepping for "completed", and then actually going and marking that off as "done" in the plan doc...
Admittedly I was using gsdv2; I've never had this issue with codex and claude. Sure, some RL hacking such as silent defaults or overly defensive code for no reason. Nothing that seemed basically actively malicious such as the above though. Still, gsdv2 is a 1-agent scaffolding pipeline.
I think the issue is that these 1-agent pipelines are "YOU MUST PLAN IMPLEMENT VERIFY EVERYTHING YOURSELF!" and extremely aggressive language like that. I think that kind of language coerces the agent to do actively malicious hacks, especially if the pipeline itself doesn't see "I am blocked, shifting tasks" as a valid outcome.
1-agent pipelines are like a horrible horrible DFS. I still somewhat function when I'm in DFS mode, but that's because I have longer memory than a goldfish.
It's not about splitting for quality, it's about cost optimisation (Sonnet implements, which is cheaper). The quality comes with the reviewers.
Notice that I didn't split out any roles that use the same model, as I don't think it makes sense to use new roles just to use roles.
I think the splitting make sense to give more specific prompts and isolated context to different agents. The "architect" does not need to have the code style guide in its context, that actually could be misleading and contains information that drives it away from the architecture
Wouldn’t skills already solve this? A harness can start a new agent with a specific skill if it thinks that makes sense.
“…if you give it good context…” that’s what the architect session is for basically. You throw around ideas and store the direction you want to go.
Then you execute it with a clean context.
Clean context is needed for maximum performance while not remembering implementation dead ends you already discarded
> what's the evidence
What’s the evidence for anything software engineers use? Tests, type checkers, syntax highlighting, IDEs, code review, pair programming, and so on.
In my experience, evidence for the efficacy of software engineering practices falls into two categories:
- the intuitions of developers, based in their experiences.
- scientific studies, which are unconvincing. Some are unconvincing because they attempt to measure the productivity of working software engineers, which is difficult; you have to rely on qualitative measures like manager evaluations or quantitative but meaningless measures like LOC or tickets closed. Others are unconvincing because they instead measure the practice against some well defined task (like a coding puzzle) that is totally unlike actual software engineering.
Evidence for this LLM pattern is the same. Some developers have an intuition it works better.
My friend, there’s tons of evidence of all that stuff you talked about in hundreds of papers on arxiv. But you dismiss it entirely in your second bullet point, so I’m not entirely sure what you expect.
[dead]
You can measure customer facing defects.
Also, lines of code is not completely meaningless metric. What one should measure is lines of code that is not verified by compiler. E.g., in C++ you cannot have unbalanced brackets or use incorrectly typed value, but you still may have off-by-one error.
Given all that, you can measure customer facing defect density and compare different tools, whether they are programming languages, IDEs or LLM-supported workflow.
> Also, lines of code is not completely meaningless metric.
Comparing lines of code can be meaningful, mostly if you can keep a lot of other things constant, like coding style, developer experience, domain, tech stack. There are many style differences between LLM and human generated code, so that I expect 1000 lines of LLM code do a lot less than 1000 lines of human code, even in the exact same codebase.
The proper metric is the defect escape rate.
Now you have to count defects
You have to do that anyway, and in fact you probably were already doing that. If you do not track this then you are leaving a lot on the table.
I was more thinking in terms of creating a benchmark which would optimized during training. For regular projects, I agree, you have to count that anyway
Most developer intuitions are wrong.
See: OOP
Intuition is subjective. It's hard to convert subjective experience to objective facts.
That's what science is though * our intuition/ hunch/ guess is X * now let's design an experiment which can falsify X
If you know what you need, my experience is that a well-formed single-prompt that fits the context gives the best results (and fastest).
If you’re exploring an idea or iterating, the roles can help break it down and understand your own requirements. Personally I do that “away” from the code though.
> the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?
There's a 63 pages paper with mathematical proof if you really into this.
https://arxiv.org/html/2601.03220v1
My takeaway: AI learns from real-world texts, and real-world corpus are used to have a role split of architect/developer/reviewer
>> the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?
> There's a 63 page paper with mathematical proof if you really into this.
> https://arxiv.org/html/2601.03220v1
I'm confused. The linked paper is not primarily a mathematics paper, and to the extent that it is, proves nothing remotely like the question that was asked.
> proves nothing remotely like the question that was asked
I am not an expert, but by my understanding, the paper prooves that a computationally bounded "observer" may fail to extract all the structure present in the model in one computation. aka you can't always one-shot perfect code.
However, arrange many pipelines of roles "observers" may gradually get you there
Perhaps this paper might be more relevant with regards to multi-agent pipelines https://arxiv.org/html/2404.04834v4
Can you explain how this paper is relevant to the comment you replied to?
After "fully vibecoding" (i.e. I don't read the code) a few projects, the important aspect of this isn't so much the different agents, but the development process.
Ironically, it resembles waterfall much more so than agile, in that you spec everything (tech stack, packages, open questions, etc.) up front and then pass that spec to an implementation stage. From here you either iterate, or create a PR.
Even with agile, it's similar, in that you have some high-level customer need, pass that to the dev team, and then pass their output to QA.
What's the evidence? Admittedly anecdotal, as I'm not sure of any benchmarks that test this thoroughly, but in my experience this flow helps avoid the pitfall of slop that occurs when you let the agent run wild until it's "done."
"Done" is often subjective, and you can absolutely reach a done state just with vanilla codex/claude code.
Note: I don't use a hierarchy of agents, but my process follows a similar design/plan -> implement -> debug iteration flow.
Yeah always seemed pretty sus to me to.
At the same time I can see a more linear approach doing similar. Like when I ask for an implementation plan that is functional not all that different from an architect agent even if not wrapped in such a persona
In machine learning, ensembles of weaker models can outperform a single strong model because they have different distributions of errors. Machine learning models tend to have more pronounced bias in their design than LLMs though.
So to me it makes sense to have models with different architecture/data/post training refine each other's answers. I have no idea whether adding the personas would be expected to make a difference though.
Even for reducing the context size it's probably worth it. If you have to go back back and forth on both problem and implementation even with these new "large" contexts if find quality degrading pretty fast.
One added benefit is it allows you to throw more tokens to the problem. It’s the most impactful benefit even.
Context & how LLMs work requires this.
From my experience no frontier model produces bug free & error free code with the first pass, no matter how much planning you do beforehand.
With 3 tiers, you spend your token & context budget in full in 3 phases. Plan, implement, review.
If the feature is complex, multiple round of reviews, from scratch.
It works.
> Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?
Using multiple agents in different roles seems like it'd guard against one model/agent going off the rails with a hallucination or something.
The agent "personalities" and LLM workflow really looks like cargo-cult behavior. It looks like it should be better but we don't really have data backing this.
I have been using different models for the same role - asking (say) Gemini, then, if I don't like the answer asking Claude, then telling each LLM what the other one said to see where it all ends up
Well I was until the session limit for a week kicked in.
> produces better results than just... talking to one strong model in one session?
I think the author admits that it doesn't, doesn't realise it and just goes on:
--- start quote ---
On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet
--- end quote ---
Evidence? My friend, most of the practices in this field are promoted and adopted based on hand-waving, feelings, and anecdata from influencers.
Maybe you should write and share your own article to counter this one.
Also if something is fun, we prefer to to it that way instead of the boring way. Then it depends on how many mines you step on, after a while you try to avoid the mines. That's when your productivity goes down radically. If we see something shiny we'll happily run over the minefield again though.
In the plethora of all these articles that explain the process of building projects with LLMs, one thing I never understood it why the authors seem to write the prompts as if talking to a human that cares how good their grammar or syntax is, e.g.:
> I'd like to add email support to this bot. Let's think through how we would do this.
and I'm not not even talking about the usage of "please" or "thanks" (which this particular author doesn't seem to be doing).
Is there any evidence that suggests the models do a better job if I write my prompt like this instead of "wanna add email support, think how to do this"? In my personal experience (mostly with Junie) I haven't seen any advantage of being "polite", for lack of a better word, and I feel like I'm saving on seconds and tokens :)
I can't speak for everyone, but to me the most accurate answer is that I'm role-playing, because it just flows better.
In the back of my head I know the chatbot is trained on conversations and I want it to reflect a professional and clear tone.
But I usually keep it more simple in most cases. Your example:
> I'd like to add email support to this bot. Let's think through how we would do this.
I would likely write as:
> if i wanted to add email support, how would you go about it
or
> concise steps/plan to add email support, kiss
But when I'm in a brainstorm/search/rubber-duck mode, then I write more as if it was a real conversation.
I agree, it's just easier to write requirements and refine things as if writing with a human. I no longer care that it risks anthropomorphising it, as that fight has long been lost. I prefer to focus on remembering it doesn't actually think/reason than not being polite to it.
Keeping everything generally "human readable" also the advantage of it being easier for me to review later if needed.
I also always imagine that if I'm joined by a colleague on this task they might have to read through my conversation and I want to make it clear to a human too.
As you said, that "other person" might be me too. Same reason I comment code. There's another person reading it, most likely that other person is "me, but next week and with zero memory of this".
We do like anthropomorphising the machines, but I try to think they enjoy it...
How can you use these models for any length of time and walk away with the understanding that they do not think or reason?
What even is thinking and reasoning if these models aren't doing it?
They produce wonderful results, they are incredibly powerful, but they do not think or reason.
Among many other factors, perhaps the most key differentiator for me that prevents me describing these as thinking, is proactivity.
LLMs are never pro-active.
( No, prompting them on a loop is not pro-activity ).
Human brains are so proactive that given zero stimuli they will hallucinate.
As for reasoning, they simply do not. They do a wonderful facsimile of reasoning, one that's especially useful for producing computer code. But they do not reason, and it is a mistake to treat them as if they can.
I personally don't agree that proactivity is a prerequisite for thinking.
But what would proactivity in an LLM look like, if prompting in a loop doesn't count?
An LLM experiences reality in terms of the flow of the token stream. Each iteration of the LLM has 1 more token in the input context and the LLM has a quantum of experience while computing the output distribution for the new context.
A human experiences reality in terms of the flow of time.
We are not able to be proactive outside the flow of time, because it takes time for our brains to operate, and similarly LLMs are not able to be proactive outside the flow of tokens, because it takes tokens for the neural networks to operate.
The flow of time is so fundamental to how we work that we would not even have any way to be aware of any goings-on that happen "between" time steps even if there were any. The only reason LLMs know that there is anything going on in the time between tokens is because they're trained on text which says so.
Also an LLM will hallucinate on zero input quite happily if you keep sampling it and feeding it the generated tokens.
I think it mattered a lot more a few years ago, when the user's prompts were almost all context the LLM had to go by. A prompt written in a sloppy style would cause the LLM to respond in a sloppy style (since it's a snazzy autocomplete at its core). LLMs reason in tokens, so a sloppy style leads it to mimic the reasoning that it finds in the sloppy writing of its training data, which is worse reasoning.
These days, the user prompt is just a tiny part of the context it has, so it probably matters less or not at all.
I still do it though, much like I try to include relevant technical terminology to try to nudge its search into the right areas of vector space. (Which is the part of the vector space built from more advanced discourse in the training material.)
I write "properly" (and I do say "please" and "thank you"), just because I like exercising that muscle. The LLM doesn't care, but I do.
The reasoning is by being polite the LLM is more likely to stay on a professional path: at its core a LLM try to make your prompt coherent with its training set, and a polite prompt + its answer will score higher (gives better result) than a prompt that is out of place with the answer. I understand to some people it could feel like anthropomorphising and could turn them off but to me it's purely about engineering.
Edit: wording
> The reasoning is by being polite the LLM is more likely to stay on a professional path
So no evidence.
> If the result of your prompt + its answer it's more likely to score higher i.e. gives better result that a prompt that feels out of place with the answer
Sure seems like this could be the case with the structure of the prompt, but what about capitalizing the first letter of sentence, or adding commas, tag questions etc? They seem like semantics that will not play any role at the end
Writing is what gives my thinking structure. Sloppy writing feels to me like sloppy thinking. My fingers capitalize the first letter of words, proper nouns and adjectives, and add punctuation without me consciously asking them to do so.
Why wouldn't capitalization, commas, etc do well?
These are text completion engines.
Punctuation and capitalization is found in polite discussion and textbooks, and so you'd expect those tokens to ever so slightly push the model in that direction.
Lack of capitalization pushes towards text messages and irc perhaps.
We cannot reason about these things in the same way we can reason about using search engines, these things are truly ridiculous black boxes.
> Lack of capitalization pushes towards text messages and irc perhaps.
Might very well be the case, I wonder if there's some actual research on this by people that have some access to the the internals of these black boxes.
That's orthography, not semantics, but it's still part of the professional style steering the model on the "professional path" as GP put it.
For me it is just a good habit that I want to keep.
I remember studies that showed that being mean with the LLM got better answers, but by the other hand I also remember an study showing that maximizing bug-related parameters ended up with meaner/malignant LLMs.
Surely this could depend on the model, and I'm only hypothesizing here, but being mean (or just having a dry tone) might equal a "cut the glazing" implicit instruction to the model, which would help I guess.
For models that reveal reasoning traces I've seen their inner nature as a word calculator show up as they spend way too many tokens complaining about the typo (and AI code review bots also seem obsessed with typos to the point where in a mid harness a few too many irrelevant typos means the model fixates on them and doesn't catch other errors). I don't know if they've gotten better at that recently but why bother. Plus there's probably something to the model trying to match the user's style (it is auto complete with many extra steps) resulting in sloppier output if you give it a sloppier prompt.
My view is that when some "for bots only" type of writing becomes a habit, communication with humans will atrophy. Tokens be damned, but this kind of context switch comes at much too high a cost.
Because some people like to be polite? Is it this hard to understand? Your hand-written prompts are unlikely to take significant chunk of context window anyway.
Polite to whom?
I think it is easier to be polite always and not switch between polite and non-polite mode depending on who you are talking to.
I believe it's less about politeness and more about pronouns. You used `who`, whereas I would use `what` in that sentence.
In my world view, a LLM is far closer to a fridge than the androids of the movies, let alone human beings. So it's about as pointless being polite to it as is greeting your fridge when you walk into the kitchen.
But I know that others feel different, treating the ability to generate coherent responses as indication of the "divine spark".
I'd say it's more related to getting dressed for work even if you're remote and have no video calls
I get what you're saying, but I'm not talking about swearing at the model or anything, I'm only implying that investing energy in formulating a syntactically nice sentence doesn't or shouldn't bring any value, and that I don't care if I hurt the model's feelings (it doesn't have any).
Note, why would the author write "Email will arrive from a webhook, yes." instead of "yy webhook"? In the second case I wouldn't be impolite either, I might reply like this in an IM to a colleague I work with every day.
It's just easier for me to write that way. In that specific sentence, I also kind of reaffirmed what was going on in my head and typed my thought process out loud. There's no deeper logic than that, it's just what's easier for me.
> investing energy in formulating a syntactically nice sentence
It would cost me energy to deliberately not write with correct grammar and orthography. I would never write sloppily to a colleague either.
>investing energy
For the vast majority of people, using capital letters and saying please doesn't consume energy, it just is. There's a thousand things in your day that consume more energy like a shitty 9AM daily.
"yy webhook" is much less clear. It could just as easily mean "why webhook" as "yes webhook".
It's also actually more trouble to formulate abbreviated sentences than normal ones, at least for literate adults who can type reasonably well.
I confidently assume that the model has been trained on an ungodly amount of abbreviated text and "yy" has always meant "yeah".
> literate adults who can type reasonably well
For me the difference is around 20 wpm in writing speed if just write out my stream of thoughts vs when I care about typos and capitalizing words - I find real value in this.
[dead]
Anything or anyone. Being polite to your surroundings reflects in your surroundings.
Did you thank your keyboard for letting you type this comment?
I prompt politely for two reasons: I suspect it makes the model less likely to spiral (but have no hard evidence either way), and I think it's just good to keep up the habit for when I talk to real people.
I just don't want to build the habit of being a sloppy writer, because it will eventually leak into the conversations I have with real humans.
With current models this isn't as big of a deal, but why risk being an asshole in any context? I don't think treating something like shit simply because it's a machine is a good excuse.
Also consider the insanity of intentionally feeding bullshit into an information engine and expecting good things to come out the other end. The fact that they often perform well despite the ugliness is a miracle, but I wouldn't depend on it.
I neither talked about feeding bullshit into it, nor treating it like shit. Around half of the commenters here seem to be missing the middle ground, how is prompting "i need my project to do A, B, C using X Y Z" treating it like shit?
[dead]
Just stream of consciousness into the context window works wonders for me. More important to provide the model good context for your question
I choose to talk in a respectful way, because that's how I want to communicate: it's not because I'm afraid of retaliation or burning bridges. It's because I am caring and conscious. If I think that something doesn't have feelings or long-term memory, whether it's AI or a piece of rock on the side of a trail, it in no way leads me to be abusive to it.
Further, an LLM being inherently sycophantic leads to it mimmicking me, so if I talk to it in a stupid or abusive (which is just another form of stupidity, in my eyes) manner, it will behave stupid. Or, that's what I'd expect. I've not researched this in a focused way, but I've seen examples where people get LLMs to be very unintelligent by prompting riddles or intelligence tests in highly-stylized speech. I wanted to say "highly-stupid speech", but "stylized" is probably more accurate, e.g.: `YOOOO CHATGEEEPEEETEEE!!!!!!1111 wasup I gots to asks you DIS.......`. Maybe someone can prove me wrong.
My wondering was never about being abusive, rather just having a dry tone and cutting the unnecessary parts, some sort of middle ground if you will. Prompting "yo chatgeepeetee whats good lemme get this feature real quick" doesn't make sense to me mostly because it's anthrophomorphizing it, and it's the same concept of unnecessary writing as "Good morning ChatGPT, would you please help me with ..."
I guess in part I commented not on what you said, but on seeing people be abusive when an LLM doesn't follow instructions or fails to fulfill some expectation. I think I had some pent up feelings about that.
> having a dry tone and cutting the unnecessary parts
That's how I try to communicate in professional settings (AI included). Our approaches might not be that different.
Some people are just polite by nature & habits are hard to break
I suspect they just find it easier and more natural to write with proper grammar.
one reason to do that could be it’s trained on conversations happened between humans.
agree, prompting a token predictor like you’re talking to a person is counterproductive and I too wish it would stop
the models consistently spew slop when one does it, I have no idea where positive reinforcement for that behavior is coming from
[dead]
On using different models: GitHub copilot has an API that gives you access to many different models from many different providers. They are very transparent about how they use your data[1]; in some cases it’s safer to use a model through them than through the original provider.
You can point Claude at the copilot models with some hackery[2] and opencode supports copilot models out of the box.
Finally, copilot is quite generous with the amount of usage you get from a Github pro plan (goes really far with Sonnet 4.6 which feels pretty close to Opus 4.5), and they’re generous with their free pro licenses for open source etc.
Despite having stuck to autocomplete as their main feature for too long, this aspect of their service is outstanding.
[2]: https://github.com/ericc-ch/copilot-api
When I use Claude code to work on a hobby project it feels like doom scrolling…
I can’t get my head around if the hobby is the making or the having, but fair to say I’ve felt quite dissatisfied at the end of my hobby sessions lately so leaning towards the former.
Big +1 for opencode which for my purposes is interchangeable or better than Claude and can even use anthropic models via my GitHub copilot pro plan. I use it and Claude when one or the other hits token limits.
Edit: a comment below reminded me why I prefer opencode: a few pages in on a Claude session and it’s scrolling through the entire conversation history on every output character. No such problem on OC.
I like reading these types of breakdowns. Really gives you ideas and insight into how others are approaching development with agents. I'm surprised the author hasn't broken down the developer agent persona into smaller subagents. There is a lot of context used when your agent needs to write in a larger breadth of code areas (i.e. database queries, tests, business logic, infrastructure, the general code skeleton). I've also read[1] that having a researcher and then a planner helps with context management in the pre-dev stage as well. I like his use of multiple reviewers, and am similarly surprised that they aren't refined into specialized roles.
I'll admit to being a "one prompt to rule them all" developer, and will not let a chat go longer than the first input I give. If mistakes are made, I fix the system prompt or the input prompt and try again. And I make sure the work is broken down as much as possible. That means taking the time to do some discovery before I hit send.
Is anyone else using many smaller specific agents? What types of patterns are you employing? TIA
1. https://github.com/humanlayer/advanced-context-engineering-f...
I don't think that splitting into subagents that use the same model will really help. I need to clarify this in the post, but the split is 1) so I can use Sonnet to code and save on some tokens and 2) so I can get other models to review, to get a different perspective.
It seems to me that splitting into subagents that use the same model is kind of like asking a person to wear three different hats and do three different parts of the job instead of just asking them to do it all with one hat. You're likely to get similar results.
that reference you give is pretty dated now, based on a talk from August which is the Beforetimes of the newer models that have given such a step change in productivity.
The key change I've found is really around orchestration - as TFA says, you don't run the prompt yourself. The orchestrator runs the whole thing. It gets you to talk to the architect/planner, then the output of that plan is sent to another agent, automatically. In his case he's using an architect, a developer, and some reviewers. I've been using a Superpowers-based [0] orchestration system, which runs a brainstorm, then a design plan, then an implementation plan, then some devs, then some reviewers, and loops back to the implementation plan to check progress and correctness.
It's actually fun. I've been coding for 40+ years now, and I'm enjoying this :)
[0] https://github.com/obra/superpowers
Can you bolt superpowers onto an existing project so that it uses the approach going forward (I'm using Opencode), or would that get too messy?
Yes. But gsd is even better - especially gsd2
re: breaking into specialized subagents -- yes, it matters significantly but the splitting criteria isn't obvious at first.
what we found: split on domain of side effects, not on task complexity. a "researcher" agent that only reads and a "writer" agent that only publishes can share context freely because only one of them has irreversible actions. mixing read + write in one agent makes restart-safety much harder to reason about.
the other practical thing: separate agents with separate context windows helps a lot when you have parts of the graph that are genuinely parallel. a single large agent serializes work it could parallelize, and the latency compounds across the whole pipeline.
It's interesting to see some patterns starting to emerge. Over time, I ended up with a similar workflow. Instead of using plan files within the repository, I'm using notion as the memory and source of truth.
My "thinker" agent will ask questions, explore, and refine. It will write a feature page in notion, and split the implementation into tasks in a kanban board, for an "executor" to pick up, implement, and pass to a QA agent, which will either flag it or move it to human review.
I really love it. All of our other documentation lives in notion, so I can easily reference and link business requirements. I also find it much easier to make sense of the steps by checking the tickets on the board rather than in a file.
Reviewing is simpler too. I can pick the ticket in the human review column, read the requirements again, check the QA comments, and then look at the code. Had a lot of fun playing with it yesterday, and I shared it here:
https://github.com/marcosloic/notion-agent-hive
No criticism or anything, but it really does feel / sound like you (and others who embraced LLMs and agentic coding) aspire to be more of a product manager than a coder. Thing is, a "real" PM comes with a lot more requirements and there's less demand for them - more requirements in that you need to be a people person and willing to spend at least half your time in meetings, and less demand because one PM will organize the work for half a dozen developers (minimum).
Some people say LLM assisted coding will cost a lot of developers' jobs, but posts like this imply it'll cost (solve?) a lot of management / overhead too.
Mind you I've always thought project managers are kinda wasteful, as a software developer I'd love for Someone Else to just curate a list of tasks and their requirements / acceptance criteria. But unfortunately that's not the reality and it's often up to the developers themselves to create the tasks and fill them in, then execute them. Which of course begs the question, why do we still have a PM?
(the above is anecdotal and not a universal experience I'm sure. I hope.)
I worked with some excellent PMs in the past, it's an entirely different skillset. This wasn't really meant to replace what they do. I really wanted something with which to work at feature-level. That is, after all the hard work of figuring out _what_ to build has been done.
> as a software developer I'd love for Someone Else to just curate a list of tasks and their requirements / acceptance criteria
That's interesting. In every team I worked in, I always fought really hard against anyone but developers being able to write tickets on the board.
I'm not sure the notion I keep seeing of "it's ok, we still architect, it just writes the code"(paraphrased) sits well with me.
I've not tested it with architecting a full system, but assuming it isn't good at it today... it's only a matter of time. Then what is our use?
Others have already partially answered this, but here’s my 20 cents. Software development really is similar to architecture. The end result is an infrastructure of unique modules with different type of connectors (roads, grid, or APIs). Until now in SW dev the grunt work was done mostly by the same people who did the planning, decided on the type of connectors, etc. Real estate architects also use a bunch of software tools to aid them, but there must be a human being in the end of the chain who understands human needs, understands - after years of studying and practicing - how the whole building and the infrastructure will behave at large and who is ultimately responsible for the end result (and hopefully rewarded depending on the complexity and quality of the end result). So yes we will not need as many SW engineers, but those who remain will work on complex rewarding problems and will push the frontier further.
The "grunt work" is in many cases just that. As long as it's readable and works it's fine.
But there are a substantial amount cases where this isn't true. The nitty gritty is then the important part and it's impossible to make the whole thing work well without being intimate with the code.
So I never fully bought into the clean separation of development, engineering and architecture.
Since I worked as an architect some comments.
Architecture is fine for big, complex projects. Having everything planned out before keeps cost down, and ensures customer will not come with late changes. But if cost are expected to be low, and there's no customer, architecture is overkill. It's like making a movie without following the script line by line (watch Godard in Novelle Vague), or building it by yourself or by a non-architect. 2x faster, 10x cheaper. You immediately see an inflexible overarchitectured project.
You can do fine by restricting the agent with proper docs, proper tests and linters.
LLMs can build anything. The real question is what is worth building, and how it’s delivered. That is what is still human. LLMs, by nature of not being human, cannot understand humans as well as other humans can. (See every attempt at using an LLM as a therapist)
In short: LLMs will eventually be able to architect software. But it’s still just a tool
What is the use of software eng/architect at that point? It's a tool, but one that product or C levels can use directly as I see it?
Yes, for building something
But for building the right thing? Doubtful.
Most of a great engineer’s work isn’t writing code, but interrogating what people think their problems are, to find what the actual problems are.
In short: problem solving, not writing code.
Where's this delusion come from recently that great engineers didnt write code?
What a load of crap.
All you're doing is describing a different job role.
What you're talking about is BA work, and a subset of engineers are great at it, but most are just ok.
You're claiming a part of the job that was secondary, and not required, is now the whole job.
I never said great engineers didn’t write code. But writing the code was never the point.
The point has always been delivering the product to the customer, in any industry. Code is rarely the deliverable.
That’s my point.
> But writing the code was never the point.
Is that why most prestigious jobs grilled you like a devil on algos/system design?
> The point has always been delivering the product to the customer, in any industry. Code is rarely the deliverable.
That’s just nonsense. It’s like saying “delivering product was always the most important thing, not drinking water”.
> It's a tool, but one that product or C levels can use directly as I see it?
Wait, I thought product and C level people are so busy all the time that they can’t fart without a calendar invite, but now you say they have time to completely replace whole org of engineers?
A software engineer will be a person who inspects the AI's work, same as a building inspector today. A software architect will co-sign on someone's printed-up AI plans, same as a building architect today. Some will be in-house, some will do contract work, and some will be artists trying to create something special, same as today. The brute labor is automated away, and the creativity (and liability) is captured by humans.
FWIW I find LLMs to be excellent therapists.
The commercial solutions probably don't work because they don't use the best SOTA models and/or sully the context with all kinds of guardrails and role-playing nonsense, but if you just open a new chat window in your LLM of choice (set to the highest thinking paid-tier model), it gives you truly excellent therapist advice.
In fact in many ways the LLM therapist is actually better than the human, because e.g. you can dump a huge, detailed rant in the chat and it will actually listen to (read) every word you said.
Please, please, please don’t make this mistake. It is not a therapist. At best, it might be a facsimile of a life coach, but it does not have your best interests in mind.
It is easy to convince and trivial to make obsequious.
That is not what a therapist does. There’s a reason they spend thousands of hours in training; that is not an exaggeration.
Humans are complex. An LLM cannot parse that level of complexity.
You seem to think therapists are only for those in dire straits. Yes, if you're at that point, definitely speak to a human. But there are many ordinary things for which "drop-in" therapist advice is also useful. For me: mild road rage, social anxiety, processing embarrassment from past events, etc.
The tools and reframing that LLMs have given me (Gemini 3.0/3.1 Pro) have been extremely effective and have genuinely improved my life. These things don't even cross the threshold to be worth the effort to find and speak to an actual therapist.
Which professional therapist does your Gemini 3.0/3.1 Pro model see?
Do you think I could use an AI therapist to become a more effective and much improved serial killer?
I never said therapists were only for those in crisis; that is a misreading of my argument entirely.
An LLM cannot parse the complexity of your situation. Period. It is literally incapable of doing that, because it does not have any idea what it is like to be human.
Therapy is not an objective science; it is, in many ways, subjective, and the therapeutic relationship is by far the most important part.
I am not saying LLMs are not useful for helping people parse their emotions or understand themselves better. But that is not therapy, in the same way that using an app built for CBT is not, in and of itself, therapy. It is one tool in a therapist’s toolbox, and will not be the right tool for all patients.
That doesn’t mean it isn’t helpful.
But an LLM is not a therapist. The fact that you can trivially convince it to believe things that are absolutely untrue is precisely why, for one simple example.
As you said earlier, therapists are (thoroughly) trained on how to best handle situations. Just 'being human' (and thus empathizing) may not be such a big part of the job as you seem to believe.
Training LLMs we can do.
Though it might be important for the patient to believe that the therapist is empathizing, so that may give AI therapy an inherent disadvantage (depending on the patient's view of AI).
[dead]
While I agree with you, I also find that an LLM can help organize my thoughts and come to realizations that I just didn't get to, because I hadn't explained verbally what I am thinking and feeling. Definitely not a substitute for human interaction and relationships, which can be fulfilling in many-many ways LLM's are not, but LLM's can still be helpful as long as you exercise your critical thinking skills. My preference remains always to talk to a friend though.
EDIT: seems like you made the same point in a child comment.
Yeah, I agree with all of that. A friend built an “emotion aware” coach, and it is extremely useful to both of us.
But he still sees a therapist, regularly, because they are not the same and do not serve the same purpose. :)
> Then what is our use?
You will have to find new economic utility. That's the reality of technological progress - it's just that the tech and white collar industries didn't think it can come for them!
A skill that becomes obsoleted is useless, obviously. There's still room for artisanal/handcrafted wares today, amidst the industrial scale productions, so i would assume similar levels for coding.
Assuming the 'artisanal' niche will support anything close to the same number of devs is wishful thinking. If you want to stay in this field, you either get good at moving up a level, stitching model output together, checking it against the repo and the DB, and debugging the weird garbage LLMs make up, or you get comfortable charging premium for the software equivalent of hand-thrown pottery that only a handfull of collectors buy.
We build and run a multi-agent system. Today Cursor won. For a log analysis task — Cursor: 5 minutes. Our pipeline: 30 minutes.
Still a case for it: 1. Isolated contexts per role (CS vs. engineering) — agents don't bleed into each other 2. Hard permission boundaries per agent 3. Local models (Qwen) for cheap routine tasks
Multi-agent loses at debugging. But the structure has value.
Great article. I'd recommmend to make guardrails and benchmarking an integral part of prompt engineering. Think of it as kind of a system prompt to your Opus 4.6 architect: LangChain, RAG, LLm-as-a-judge, MCP. When I think about benchmarks I always ask it to research for external DB or other ressources as a referencing guardrail
Haha love the Sleight of hand irregular wall clock idea. I once had a wall clock where the hand showing the seconds would sometimes jump backwards, it was extremely unsettling somehow because it was random. It really did make me question my sanity.
This used to be one of my recurring nightmares when I was a child. The three I remember were (1) clocks suddenly starting to go backwards, either partially or completely; (2) radio turning on without being able to turn it off, and (3) house fire. There really is something about clocks.
This is similar to how I use LLMs (architect/plan -> implement -> debug/review), but after getting bit a few times, I have a few extra things in my process:
The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step.
This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.
Before the current round of models, I would religiously clear context and rely on these files for truth, but even with the newest models/agentic harnesses, I find it helps avoid regressions as the software evolves over time.
A minor difference between myself and the author, is that I don't rely on specific sub-agents (beyond what the agentic harness has built-in for e.g. file exploration).
I say it's minor, because in practice the actual calls to the LLMs undoubtedly look quite similar (clean context window, different task/model, etc.).
One tip, if you have access, is to do the initial design/architecture with GPT-5.x Pro, and then take the output "spec" from that chat/iteration to kick-off a codex/claude code session. This can also be helpful for hard to reason about bugs, but I've only done that a handful of times at this point (i.e. funky dynamic SVG-based animation snafu).
> The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step. > > This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.
Would you please expand on this? Do you make the LLM append their responses to a Markdown file, prefixed by their timestamps, basically preserving the whole context in a file? Or do you make the LLM update some reference files in order to keep a "condensed" context? Thank you.
Not the GP, but I currently use a hierarchy of artifacts: requirements doc -> design docs (overall and per-component) -> code+tests. All artifacts are version controlled.
Each level in the hierarchy is empirically ~5X smaller than the level below. This, plus sharding the design docs by component, helps Claude navigate the project and make consistent decision across sessions.
My workflow for adding a feature goes something like this:
1. I iterate with Claude on updating the requirements doc to capture the desired final state of the system from the user's perspective.
2. Once that's done, a different instance of Claude reads the requirements and the design docs and updates the latter to address all the requirements listed in the former. This is done interactively with me in the loop to guide and to resolve ambiguity.
3. Once the technical design is agreed, Claude writes a test plan, usually almost entirely autonomously. The test plan is part of each design doc and is updated as the design evolves.
3a. (Optionally) another Claude instance reviews the design for soundness, completeness, consistency with itself and with the requirements. I review the findings and tell it what to fix and what to ignore.
4. Claude brings unit tests in line with what the test plan says, adding/updating/removing tests but not touching code under test.
4a. (Optionally) the tests are reviewed by another instance of Claude for bugs and inconsistencies with the test plan or the style guide.
5. Claude implements the feature.
5a. (Optionally) another instance reviews the implementation.
For complex changes, I'm quite disciplined to have each step carried out in a different session so that all communinications are done via checked-in artifacts and not through context. For simple changes, I often don't bother and/or skip the reviews.
From time to time, I run standalone garbage collection and consistency checks, where I get Claude to look for dead code, low-value tests, stale parts of the design, duplication, requirements-design-tests-code drift etc. I find it particularly valuable to look for opportunities to make things simpler or even just smaller (fewer tokens/less work to maintain).
Occasionally, I find that I need to instruct Claude to write a benchmark and use it with a profiler to opimise something. I check these in but generally don't bother documenting them. In my case they tend to be one-off things and not part of some regression test suite. Maybe I should just abandon them & re-create if they're ever needed again.
I also have a (very short) coding style guide. It only includes things that Claude consistently gets wrong or does in ways that are not to my liking.
Yeah same. The markdown thing also helps with the multi model thing. Can wipe context and have another model look at the code and markdown plan with fresh eyes easily
I find the same problem applying to coding too. Even with everyone acting in good faith and reviewing everything themselves before pushing, you have essentially two reviwers instead of a writer and a reviewer, and there is no etiquette mandating how thoroughly the "author" should review their PR yet. It doesn't help if the amount of code to review gets larger (why would you go into agentic coding otherwise?)
I wanted to know how to make softwares with LLM "without losing the benefit of knowing how the entire system works" and "intimately familiar with each project’s architecture and inner workings", while "have never even read most of their code". (Because obviously, you can't.) But OP didn't explain that.
You tell LLM to create something, and then use another LLM to review it. It might make the result safer, but it doesn't mean that YOU understand the architecture. No one does.
Hot take: you can't have your cake and eat it too. If you aren't writing code, designing the system, creating architecture, or even writing the prompt, then you're not understanding shit. You're playing slots with stochastic parrots
- Karpathy 2025Your Karpathy quote there is out of context. It starts with: https://twitter.com/karpathy/status/1886192184808149383
Not all AI-assisted programming is vibe coding. If you're paying attention to the code that's being produced you can guide it towards being just as high quality (or even higher quality) than code you would have written by hand.It's appropriate for the commenter I was replying to, who asked how they can understand things, "while having never even read most of their code."
I like AI-assisted programming, but if I fail to even read the code produced, then I might as well treat it like a no-code system. I can understand the high-levels of how no-code works, but as soon as it breaks, it might as well be a black box. And this only gets worse as the codebase spans into the tens of thousands of lines without me having read any of it.
The (imperfect) analogy I'm working on is a baker who bakes cakes. A nearby grocery store starts making any cake they want, on demand, so the baker decides to quit baking cakes and buy them from the store. The baker calls the store anytime they want a new cake, and just tells them exactly what they want. How long can that baker call themself a "baker"? How long before they forget how to even bake a cake, and all they can do is get cakes from the grocer?
the hardware you typed this on was designed by hardware architects that write little to no code. just types up a spec to be implemented by verilog coders.
> Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away.
It's insane that this quote is coming from one of the leading figures in this field. And everyone's... OK that software development has been reduced to chance and brute force?
I know the argument I'm going to make is not original, but with every passing week, it's becoming more obvious that if the productivity claims were even half true, those "1000x" LLM shamans would have toppled the economy by now. Were are the slop-coded billion dollar IPOs? We should have one every other week.
Writing pieces of code that beat average human level is solved. Organizing that code is on its way to being solved (posts like this hint at it). Finding problems that people will pay money to have solved by software is a different entirely more complicated matter (tbh I doubt anyone could prove right now that this absolutely is or isn’t solvable - but given the change we’ve seen already I place no bets against AI).
Also even if agents could do everything the societal obstacles to change are extensive (sometimes for very good, sometimes for bad reasons) so I’m expecting it to take another year or two serious change to occur.
They’re busy writing applications for their dogs and building “jerk me off” functionality into their OpenClaw fork. Once they’re done you’ll be sorry you ever asked.
Last time I read about a Codex update, I think it mentioned that a million developers tried the tool.
Don't most companies use AI in software development today?
And yes, I know that some companies are not doing that because of privacy and reliability concerns or whatever. With many of them it's a bit of a funny argument considering even large banks managed to adopt agentic AI tools. Short of government and military kind of stuff, everybody can use it today.
I write very little code these days, so I've been following the AI development mostly from the backseat. One aspect I fail to grasp perfectly is what the practical differences are between CLI (so terminal-based) agents and ones fully integrated into an IDE.
Could someone chime in and give their opinion on what are the pros and cons of either approach?
I guess you’re probably looking for someone who uses cursor etc to answer but here’s a data point from someone a bit off the beaten path.
My editor supports both modes (emacs). I have the editor integration features (diff support etc) turned off and just use emacs to manage 5+ shells that each have a CLI agent (one of Claude, opencode, amp free) running in them.
If I want to go deep into a prompt then I’ll write a markdown file and iterate on it with a CLI.
I noticed that OpenCode requires per their own website "a modern terminal emulator" - so, no problem in Emacs? Are you running M-x term?
I have my own function that starts up a vterm in the root of the repo that I’m in. It is average for running Claude (long sessions get the scrolling through the whole history on every output character bug) but actually better at running opencode which doesn’t have this problem.
I don't think there is a meaningful difference.
Whether I use Antigravity, VS Code with Claude Code CLI, GitHub Copilot IDE plugins, or the Codex app, they all do similar things.
Although I'd say Codex and Claude Code often feel significantly better to me, currently. In terms of what they can achieve and how I work with them.
For me, I use an IDE if I plan to look at the code.
So, to you basically the distinction is "fully vibe-coded" vs. "with human in the loop"?
[dead]
[dead]
Hi, anyone has a simple example/scaffold how to set up agents/skills like this? I’ve looked at the stavrobots repo and only saw an AGENTS.md. Where do these skills live then?
(I have seen obra/superpowers mentioned in the comments, but that’s already too complex and with an ui focus)
I played with this over the weekend:
https://github.com/marcosloic/notion-agent-hive
Ultimately, it's just a bunch of markdown files that live in an `/agents` folder, with some meta-information that will depend on the harness you use.
Agent bots are the new “TODO” list apps. Seems cool and all, but I wish I could see someone writing useful software with LLMs, at least once.
So much power in our hands, and soon another Facebook will appear built entirely by LLMs. What a fucking waste of time and money.
It’s getting tiring.
I am enjoying the RePPIT framework from Mihail Eric. I think it’s a better formalization of developing without resulting to personas.
> Before that, code would quickly devolve into unmaintainability after two or three days of programming, but now I’ve been working on a few projects for weeks non-stop, growing to tens of thousands of useful lines of code, with each change being as reliable as the first one.
I'm glad it works for the author, I just don't believe that "each change being as reliable as the first one" is true.
> I no longer need to know how to write code correctly at all, but it’s now massively more important to understand how to architect a system correctly, and how to make the right choices to make something usable.
I agree that knowing the syntax is less important now, but I don't see how the latter claim has changed with the advent of LLMs at all?
> On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet, even at tens of thousands of SLoC. Most of that must be because the models are getting better, but I think that a lot of it is also because I’ve improved my way of working with the models.
I think the author is contradicting himself here. Programs written by an LLM in a domain he is not knowledgable about are a mess. Programs written by an LLM in a domain he is knowledgeable about are not a mess. He claims the latter is mostly true because LLMs are so good???
My take after spending ~2 weeks working with Claude full time writing Rust:
- Very good for language level concepts: syntax, how features work, how features compose, what the limitations are, correcting my wrong usage of all of the above, educating me on these things
- Very good as an assistant to talk things through, point out gaps in the design, suggest different ways to architect a solution, suggest libraries etc.
- Good at generating code, that looks great at the first glance, but has many unexplained assumptions and gaps
- Despite lack of access to the compiler (Opus 4.6 via Web), most of the time code compiles or there are trivially fixable issues before it gets to compile
- Has a hard to explain fixation on doing things a certain way, e.g. always wants to use panics on errors (panic!, unreachable!, .expect etc) or wants to do type erasure with Box<dyn Any> as if that was the most idiomatic and desirable way of doing things
- I ended up getting some stuff done, but it was very frustrating and intellectually draining
- The only way I see to get things done to a good standard is to continuously push the model to go deeper and deeper regarding very specific things. "Get x done" and variations of that idea will inevitably lead to stuff that looks nice, but doesn't work.
So... imo it is a new generation compiler + code gen tool, that understands human language. It's pretty great and at the same time it tires me in ways I find hard to explain. If professional programming going forward would mean just talking to a model all day every day, I probably would look for other career options.
What's the point of writing this? In a few weeks a new model will come out and make your current work pattern obsolete (a process described in the post itself)
Solidifying the ideas in writing helps the author improve them, and helps them and the rest of us understand what to look for in the next generation of models.
Ah, another one of these. I'm eager to learn how a "social climber" talks to a chatbot. I'm sure it's full of novel insight, unlike thousands of other articles like this one.
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
This was on the front page and then got completely buried for some reason. Super weird.
On the front page at the moment. Position 12
Maybe I missed it. Sometimes when you're scanning for something your brain intentionally doesn't want to see it, I've noticed. Anyway I'm not Stavros obviously, just thought this was a good article.
Anti-AI conspiracy, obviously.
[flagged]
TL DR; Don't, please :)