> We had weeks to ship what ended up being a million lines of code... Five months later, the repository contains on the order of a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over that period, roughly 1,500 pull requests have been opened and merged with a small team of just three engineers driving Codex. This translates to an average throughput of 3.5 PRs per engineer per day, and surprisingly the throughput has increased as the team has grown to now seven engineers. Importantly, this wasn’t output for output’s sake: the product has been used by hundreds of users internally, including daily internal power users.
That's an insane level of throughput. What's a good baseline? Prior to agentic coding, whats the typical number of PRs engineers were expected to push? Maybe a 2-10?
Do people feel the software has gotten better in the last 6 months? The number of engs is prob the same so we should expect maybe 5x faster cycle in major software apps, but I don't see it. The AI apps do change very fast but given its a very new field, I'd expect as much. But outside of that, I don't see it.
Here's a fun one: firefox lists its current count at about 2.5M LOC, from roughly 1M commits during the years.
You end up with about 3 lines added per commit, which is not ridiculous when you consider that most would be editions rather than full additions.
Here, we have 1500 PRs and 1M LOC, which is about 650 added LOC per PR. Remember, not 650 lines total in the PR, but +650 balance after additions-removals.
Fun questions for attentive readers:
- What does a project growing at a rate of one full firefox-codebase worth of LOC per year look like, a decade down the line?
- What does the line count say about the verbosity of the tool, and what does it say about outcomes that the purpose of the project isn't clearly disclosed?
- Do we have reasons to care about LOC in a world where we don't write code manually? What happens to token usage numbers when the codebase is significantly larger?
- If it was confirmed that LLM usage blows up your line count, what's the implication for codebases that want to return to manual coding after months of usage? (Say, because the tool gets expensive).
> - Do we have reasons to care about LOC in a world where we don't write code manually? What happens to token usage numbers when the codebase is significantly larger?
Yes, at least to the extent that we care about context windows and tokens consumed by coding agents processing code that is ultimately irrelevant to their assigned task.
Anecdotally, I've found keeping file sizes small has been important for agentic coding not just to maintain human readability, but also for optimizing agent performance, precisely because it limits the amount of incidental context they load while working a problem, because they generally load entire files rather than just parsing the part relevant to their current assignment as a human might. That smaller file size thus reduces input noise and the LLM generates a tighter solution, which in turn reduces input noise for future solutions. Or at least this strategy avoids a death spiral into exploding context length.
I expect (but cannot currently prove) that keeping overall LOC down yields similar benefits even when file sizes are kept small because it spares the LLM from parsing potentially relevant files that prove irrelevant to its current task.
Seconded on smaller files. I feel like I tend to get better responses faster.
A notable flaw here is that I’ve not tried large vs small files in a large codebase. Most of my experimentation there has been on personal projects where even a small file contains a significant part of the project. I could see degradation when it has to load 5 files to figure out how something works.
Total LOC (tokens, really, literal lines probably don’t matter) is interesting as a factor. That might go some way towards explaining why LLMs are weirdly good at Clojure.
Eg last I checked Anthropics one-shot performance on Clojure was about the same as Python or Go despite almost certainly being less represented in training data. The combination of density and simple primitives might be easier for an LLM to wrangle, ameliorating the impact of a less popular language.
>Eg last I checked Anthropics one-shot performance on Clojure was about the same as Python or Go despite almost certainly being less represented in training data. The combination of density and simple primitives might be easier for an LLM to wrangle, ameliorating the impact of a less popular language.
There might be tons of confounding factors there. One that comes to mind is the quality of of data, it might perfectly be that the average clojure snippet is higher quality, due to the users demographics. Very few people start writing code with clojure, whether in college or during bootcamps.
Does the Firefox LOC include ALL forms of text: infrastructure (Firefox doesn’t have), documentation, developer scripts,tests, etc? How is the test coverage of Firefox?
We've known for decades that output metrics like LOC/day are very bad measures of real productivity in software. But they seem to be back in vogue in the age of AI, because AI is so good at maxing these useless metrics, and we need to show how impressive our AI is and how impressive our usage of AI is.
They never specified what exactly the product was, without which it's impossible to judge the post.
For some reason most of the uses of "agents" are to build yet other AI products, it's turtles all the way down. Maybe that says more about the field of harnesses than it does about the power of "agents".
There is a sense in which it doesn’t matter at all; many of the limitations of agents in large codebases are just the context management challenges. So proving that you can cohere and progress at O(1m) is a useful scale observation. “Can I use agents in my 1m line codebase?”
There is of course another sense in which the output quality is the only thing that matters. “Can I use agents to build a 1m line codebase that I want to maintain going forward.”
I take this as being exclusively a tech demo of the former. Quality (feature velocity, bugs, scalability) is not demonstrated.
Feels like the active discovery going on is trying to understand what is computer vs what is AI, for every product.
Agents help a ton with the discovery, but the act of building a product needs a deeper level of thought and validation to make it actually better than what came before. So IMO what you see is people still learning what needs to be understood and crafted first hand to make a product better (including economics)
It feels like the update cadence has indeed sped up. But not necessarily quality.
Looking at MS Office I notice a lot of small changes recently that are mostly annoying. Things like Word comments losing the focus after you @-tagged a colleague, needing to click the Outlook search field twice before you can enter text, Outlook mobile date picker losing its ability to show your and attendee's availability.
So it looks like lots of throughput, but unfortunately breaking features that work. Or wasting time on things that don’t matter such as the status bar of OneDrive search circling around the input field.
I’ve been vibe coding a lot over the past year or so, and I think I’m going to stop. In fact, I sort of want to challenge myself to see, can I go back to a sort of the fork in the road with the old copilot autocomplete workflow and really maximize that. Be in the drivers seat for most of the code being written, but find ways to use AI to really enhance the flow state / remove blockers. Tools only minimal actual code generation.
One workflow I like is writing a comment for what I’m about to do and then waiting a few seconds and then tab through the auto-completions. Then I check what the agent came up with, make some edits, and then on to the next block. That works well, I feel in control but don’t have to type as much.
I do use claudecode totally hands off too however. Mostly for UI tasks. Like themifying css or data grids and CRUd with all the bells and whistles, I hate that stuff and cc gets it done in minutes and mostly right. It’s also super nice to say things like “user profile in the upper right hand corner” without having to fight css.
/if it’s not clear, I hate dealing with css and related frameworks.
I would be very impressed with someone who's been vibecoding "a lot" for about a year who could then go back to being fully in the loop for even 50%. I would even say I'd expect withdrawal symptoms at that point.
The dopamine hits are core to why people even do vibecoding (or vibecoding-in-a-dress/spec-driven development) and why they tend to overestimate its output so much. Hell, it's core to all forms of LLM-assisted development (because it feels like magic), but most of the other forms are more value, less delusion.
The dopamine hit is real, I feel like that was identified early on by OpenAI and probably lit a fire to get ChatGPT in the hands of the public. Bf Skinner (I think) is the one who narrowed in on variable ratio reward systems to maximize operant conditioning. An LLM, with hallucinations and imperfections, is the perfect variable ratio reward system. It’s no wonder they’re getting pushed so hard along with a consumption based pricing model. Whether you’re a human, rat, plant, bacteria there’s no real defense against that kind of conditioning.
I actually don’t find vibe coding satisfying is one of the many reasons I’m going back. I feel a little of what you’re talking about, but I’m a nerd. I like to code.
But I’m not dismissing your concern. Because it is one of the reasons I’m making this decision. I’m a professional. I’m not just here to feel good I’m here to do a good job over the course of a career. I think all in, when you think about writing good maintainable, software, learning, staying mentally sharp, and speed put together. Vibe coding could be less effective and maybe even in the aggregate “slower”.
I have been building an entire operating system ( not figuratively)
Prior to ai autocomplete 500 loc a day and then with ai autocomplete I could do 2500 a day and now 50k is pretty normal. Walking around tech week with my phone yielded 150k this week
The average efficiency improvement is closer to something like 2-3x per Anthropic’s numbers and this is only the rate at which software can advance. Do you expect to notice if 12 months of software engineering on a project you’re following gets done in 6 months? I suspect not.
The root cause is that the acceleration is pareto distributed so the modern engineering team at the moment looks like one 10x engineer, one 5x engineer, and the rest are approximately 1.5x engineers.
This almost reeks of "I've never cleaned up our code base because there is too much code, and didn't even bother having agents/LLM cleaning them up".
You almost never need a million lines of code - this includes your software, infra, testing and operational tools. You didn't ship the linux kernel in 3 weeks and you know it. The code is already speghetti and it achieve the basic functions OK but it will harder and harder to simplify and untangle and maintain.
Even the linux kernel doesn't need millions of lines of code; most of the actual LOC is device drivers, and you don't need all of them, you just need the ones for the devices you have.
As a point of reference, 1MLOC is about the size of the entire Python standard library including tests, as well as stuff like IDLE. (Well, the Python part of the code. There's about half that much again of C in Modules/ .)
> should expect maybe 5x faster cycle in major software apps
To what end and what would that even look like though? Enshittifying everything at maximum speed? The apps/platforms I use regularly - GitHub, Spotify, Google maps (just to name a few), have gotten noticeably shittier in recent times.
Confirmation bias. The internet has complained about software updates decades before LLMs became ubiquitous. I made a career fixing human slop by domain experts.
We easily forget that the great majority of software engineering is fixing the mistakes of other highly capable software engineers.
It's just so easy to blame the machine instead of admitting no one here is an expert on anything and they count their hits and not misses. If they did, we would find the probability of making a mistake to be higher than a fronter coding agent.
It's a hard headed crowd and everyone, LLM pilled or not, suffers from the Dunning-Kruger. All of us.
Just look at the comments. Everyone is perfect when they do things themselves.
>GitHub, Spotify, Google maps (just to name a few), have gotten noticeably shittier in recent times.
What if AI lets you create new versions of those tools, but without the enshitification?
I say that being in the "soaking" stage of using AI to rebuild a shitty software project in 70KLOC over about 2 weeks of spare time, so this may not be as theoretical as you might think.
> What if AI lets you create new versions of those tools, but without the enshitification?
I'm not sure I fully understand what you're saying here. Isn't the value of these tools almost entirely independent of their actual software? That is, we have many good open source, self-hostable forges (Forgejo, sr.ht, etc.), lots of great music player software (Jellyfin, Symphonium, etc.), and decent maps software (OsmAnd and Organic Maps). People use GitHub, Spotify, and Google Maps -- perhaps even _put up_ with their often bad/glitchy software -- because of network effects (all three) and content/licensing partnerships (Spotify/GMaps). That proprietary data isn't something AI can help you with, right?
It really depends on the use-case. For example, my most starred github repo is a tool to convert Spotify playlists to YouTube Music (that was done pre-AI). Github depends on what issues you have with it, what your use case is, and whether you can leverage some of the network effects via API from the github source. Maps, same story.
It is likely better because AI agents make access to domain knowledge easier. However, I would wager that the problem is people don’t remember the code well. The problems are going to be long-term as the pace of change increases.
If you think about it, successful products rely on designing well-thought-out experiences, customer discovery (see all the Forward-Deployed Enginneer job listings at OpenAI) so the code velocity somewhat becomes irrelevant.
If you’re solving the right problem and you’ve got a good team then competitive advantage comes from somewhere OUTSIDE of code velocity.
The more important question I think is does faster code yield more value long-term? At the moment, it’s like yeah we do 3.5 pull requests per day.
I’m thinking, great, good for you. You could also combine three pull requests into one and then you’re doing 1 per day. This is quantitative data that doesn’t really mean anything tangible.
I've been doing the same experiment in tsz[1] for a while now (the same past five months in fact) and I have come to very similar conclusions. Lots of harness to enforce good architecture splits. Lots of tests and CI.
My point of working on tsz is to learn how to do very big projects with AI. Eventually the same workflows and attitude can be leveraged to build customer product apps with UI as well. I see that OpenAI is leveraging automated browser testing and even videos as part of their workflow. I think as models get better this direction for making software would eventually make sense. I don't think we're there yet though. But at least, unlike OpenAI vague claims I can share the output with you to see!
Most of the solutions that offer a very high level of automation like Lovable are a bit too optimistic and solutions are not tightly coupled with lots of automated testing.
Have you been able to extract libraries or tools from this project yet? If so how was that experience?
That is, do you see yourself releasing a metric harness, or sub-projects that are equivalent of ActiveRecord, zod, or similar open source tooling that frequently originate in a large in-house project - and then is exported out as a stand-alone toll, utility, library or framework?
Because while ai can reimplement minor tools, it's utility entirely depends on the existence of solid tools, libraries and frameworks.
Can you share what type of project that was? On the spectrum from a database engine to cat picture sharing web site (very high demand for correctness vs very lax).
1. Yes! Many teams internally have adopted a lot of the same practices we outlined in the blog post. Ryan has also been spending time both internally and externally helping companies figure out how to do this in their code bases.
2. Hmm, kind of. There have definitely been issues the models can’t one shot. But we still use Codex to write all the actual code with human guidance.
3. More agents :) Some teams are experimenting with centralized Agent mediated integration queues, others use normal merge queues, many have local Codex threads that monitor CI to resolve and land conflicts or failures.
4. Today’s models and codex app. We started doing all this with gpt-5 and codex-cli. The tools today, 9 months later, are so much better than what we had then.
Have you built any tooling or products around all of this and deploying it somehow? I’d love to learn more and share notes, because we’ve been doing this too. About 3100+ PRs merged across our 4 person team in 4 months. Impossible without harness engineering, and I agree, the tools are getting even better.
Have you been satisfied with the quality of code generated by the model? Or did you have to tweak some rule file or skill to improve it? Or is human-readable code not even a goal at this point?
Does it yield good results?
I found that instead of docs it’s easier just to ask ai to read code.
I feel like this is same as comments in code. Become outdated fast
I don't really use "docs" for documentation. I've prompted Claude/Codex to always write a "log" and save it in-repo to track what it did and why.
I've found this to be really helpful, e.g. "you did this last week, and now some other thing is happening" or "you tried this approach before to solve alert X but it didn't work" -- except it can discover this itself.
I've also used it to store TODOs and plans. For example I might want to explore some idea and defer it for later, or some weekend have it execute on some tech debt I've put off. One last use case is asking "what did I work on in the last 2-3 weeks, is it healthy, and what additional quality checks can/should I do; is there any follow-up work?"
I haven’t actually noticed that, but I’m not sure why. Maybe because I specifically describe it to the agent as a work log rather than documentation? I’m not sure
it does not result in great results left unattended, it’ll start creating slop or hardcoding solutions
but overtime if you adjust your verification rubric, it’s not too bad, gets pretty good, if you do make it do TDD, it gets kinda crazy and you’ll have 2000-3000 tests after awhile, or on my common case, 6000-7000 lines of code in single files (i usually have a cron to audit files for decomposition and create tickets)
i wouldn’t use it at my job yet, but it’s been fun to use for personal projects - it’s like modded minecraft automation or factorio
I like the idea of saving the work done into files - helps to prevent the llm from redoing the same work. Maybe one day instead of code in a repo it will just be a list of prompts.
The other day I came across to a video showing workers in a e-vape factory. They pick up a bunch of e-vapes from the conveyor belt (each has 6 e-vape think), stick in their mouth and vigorously vape all of them for about 5 seconds, then test the next bunch. Humans reviewing hundreds of lines of change in a PR written by AI is not very different.
You can do statistical testing of the e-vape line because you have a specific criteria and well defined tolerances that you can define on a per-sample basis and that the factory meets with some acceptable 9s of reliability.
PRs are not like this because a single bad PR can be catastrophic for your business in a way that a single bad e-vape cannot.
I would also argue that the current output from the AIs when sampled by software engineers regularly doesn't meet the bar of quality we want in our product, hence the need to review every PR and fix a substantial fraction.
If you can start to bound the impact of changes and the outputs begin to be generally acceptable unsupervised, such that all you're doing is double checking that nothing has regressed in the factory, then the sampling approach can work.
I'm not an AI skeptic but I'm skeptical of the intent of this article. It makes great claims about agent-first engineering and tries to make a real case based on a real product, with real users, and a real team that's been growing — all without even saying what was built or showing it, just like every other AI hype article.
At the time we wrote the article we hadn’t released the product and weren’t ready to talk about it. It was an internal prototype that looked very much like the current Codex app.
This might work only if you have “infinite” compute and infinite tokens.
As someone that used the $20 plan, this pure agentic approach is impossible to do because I’d hit the limit fast and I would end up with less outcome.
What I found that work incredibly well was to provide a human written code as reference, and ask it to extend it. So I scaffold the entire thing, architect it, write few samples code (controllers, services, models, components, database schema, how auth works, etc) so the LLM can have a headstart on their attention (pun intended)
I usually wrote a stub with a lot of details on how to implement it. Something like a higher abstraction pseudo code. Then ask the LLM to implement it.
When it fails, it is often better to undo the whole changes, adjust the stub so it catches what fails before, and try again.
Or, commit the changes, and use a new fresh context and only address what went wrong.
-
Whenever I tried this agentic from scratch approach, I always end up disappointed; both on the outcome and on the limit that I hit before an hour even passed.
Instead of reading articles like this one end to end, I ask AI to read them in detail and prepare a new harness for me. The important part is not to do this in a single prompt, but to first create a detailed plan and let the model think deeply about each aspect. This approach lets me build the new harness without the missing didactic information you mentioned.
Basically, I am moving from “I build products without writing or reading the code” to “I build products without writing or reading the harness.”
Once the new implementation harness is prepared, I start it, but I keep the original session open. In that original session, “we” monitor the implementation harness from the outside: how effective it is, where the bottlenecks are, what breaks down, and what could be improved. From time to time, the monitoring session suggests changes to the implementation harness. We apply those changes, restart the harness, and monitor it again.
The overall approach is not to spend X hours understanding an article like this in detail, because another similar article will appear in 3 weeks. Instead, I take immediate action, learn on the fly, and replace the harness when a better pattern emerges. And yes, I still have to spend X hours on setting up, monitoring and fine tuning the new harness, but at the end I have the latest fancy "thing" working for me.
I do quite a lot of what this post describes in a reasonably large project. Here's what works for me:
- write gherkin features for new features; update them for enhancements; don't touch them for refactors. Label your PRs with these nouns.
- use pre-push hooks for type checks, linting, unit tests, and other quick, scriptable validations.
- make a viteperess subsite in your repo, have the agents maintain it - document important principles, architecture, etc.
- make a cli command which lists all pages along with the yaml frontmatter description so agents can choose what to read without blowing up the context window.
- use ddd and monorepo - write your logic in headless layers, and compose layers into apps. agents navigate layers very successfully.
- use zod (or your language equivalent) and contract-first API development; this is my favourite bit tbh, I use orpc
- make a single skill called "code" which describes the lifecycle: open a worktree, setup .env to guarantee no conflict with other agents (choose unused ports etc - docker is good here), write or update feature file (this is where you negotiate the spec), implement, validate (e.g. using playwright mcp), pre-push checks, push and wait for review, tear down and fast forward main
- testcontainers is great for ensuring multiple agents can run tests that don't conflict
Seriously I only have one skill that's it. Everything else is in the docs. I'm feeling very productive like this, in a "making good software" sense not a LoC sense.
I agree with many of the points made by nimonian above (esp the one starting with 'make a single skill called "code" which describes the lifecycle'), based on my limited experience with these things.
That's a big ask. This kind of harness usually contains plenty of proprietary insights about their business. And also, nowadays, a good harness is a major competitive advantage.
Your hostile tone is unfortunate, especially since my post was actually friendly. I was just trying to point why it is very likely the OP won't give you what you're asking so you're not left confused if he ends up ghosting you.
Many people use the term harness to refer to the agent coding software (eg. Opencode, Claude Code...), i use this term more broadly to refer to the environment (set of skills, system prompts, constraints, memory, hooks etc...). What the OP is referring to is not just one giant skill. It's usually a comprehensive ecosystem of skills, bespoke tools to make certain agent tasks deterministic (eg localization), and so on.
I've seen someone post Github repos in this thread, these can be very useful especially if you use the same tech stack, but you won't reach the level of productivity reported by successful teams unless you invest substantial time to build your own harness. But the way to do so is to do it progressively : start with something simple to address the need you have on day 1 . And then, turn recurring prompts into skills, turn recurring coding patterns and coding style recommendations into guidelines, turn repetivive tasks for which the LLM tends to build a python script that it occasionally gets wrong into a deterministic tool documented in a skill etc...
And after a couple of days, weeks, and months, you'll have a very dependable harness giving you optimal productivity, without needing to invest weeks of work upfront or take the fun out of agent-assisted coding.
I have an example of a side-project [1] where I think I naturally applied the best practices described in this article. My goal was to see if it's possible to code an entire project using a single agent (Claude).
To do this, I "simply" asked the agent, every time it encountered an issue, how to resolve it, using a validation tool or script. I also asked it to code these tools during audits. As a result, I now have over 30+ rules [2] for validating their commits. It's working pretty well now.
A lot to these blogposts are trying to catch on the next buzzword "harness". It's almost close to the productivity porn mindset that we witnessed 10-15 years ago where creating the complicated system is more exciting than using the system for daily tasks.
I agree. I followed this article for a repo I'm working on, and I had a very hard time inferring how, specifically, they implemented "providers" and enforced import layers. A sample repo would've been nice.
> We tried this early on — used ChatGPT as "project manager" to set up the entire harness before writing any code. After a week it produced 140+ docs of rules, architecture, frameworks. Zero lines of code. When we finally brought in another tool to review, the verdict was: "a perfectly secure empty safe." The harness was immaculate. There was just nothing inside it.
>
> Harness matters, but if you're not shipping code alongside it, you're just writing fiction.
I worry most about blindspots with this kind of approach. Let's say that this repository goes on for years, at which point the docs folder is several MB in size. Would Codex be able to think outside of the box? Or would the aggregate of the Markdown content fundamentally cover enough ground to prevent it from thinking of novel new approaches to existing problems?
You tell it to update the docs: not append. I've done the same thing with a readme in the root with links to the docs. After every commit, before the push, I have my agent "update all relevant and related docs, add or remove what's needed" or something to that extent. And it works remarkably well. I also have an append only change log it's supposed to add to. Between that, good commit messages, and comprehensive testing, I've built a homebrew OS and updating it is remarkably smooth. Runs a homebrew FTP and HTTP server and can run Wolfenstein. Working on DOOM right now. Close, but sound has been difficult.
Someone else in the comments said to have it make a static website with the info instead with clickable pages and sections so it reads only the content it needs to rather than dumping a long file into context windows. Although I suppose you can have a ToC in the readme too with multiple smaller markdown files as references.
Yep. You’ve got to have it update the docs. After a few sessions, if I forget to request this, opus starts rehashing the same tasks and finds that they are complete - and sometimes still won’t update those docs unless I ask.
Another tip is to condense the doc files into the minimal required. Sometimes I’ll end up with 5 to 6 floating around in various states of staleness. Condensing to 2-3 and removing completed tasks seems to help a lot
It’s not a self coding machine. There is human in the loop, they even added MORE engineers to the team of this project! 7 engineers should be able to collaborate with the AI to find good solutions to problems.
Another breathless sales pitch selling pickaxes to miners, but where's the gold? Where's the incredible product that the chatbots-talking-to-chatbots over git generating LOC heaps have actually _created_? I just don't see it.
What I still can't understand is why is massive amount of code generated is a flex? I don't feel that software has gotten a lot better in past 3 years, only sloppier. It's surprising to me that people who know about reward hacking choose a simple objective like lines of code generated as a signal for quality. I'd argue you have to optimize for less lines generated as possible while secondary optimization should be readability for humans. I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.
And if I am working on an existing codebase then isn't a good commit often a negative sum between added and removed lines? I don't want to bloat my codebase but make it more polished and elegant. After reading that I wonder if what they have done could have been accomplished for a far fewer LoC budget.
Lines of code has always been a terrible metric. But all else being equal it is a measure. If all else is not equal, which is usually the case, then it's not.
A lot of the focus has been on AI recently.
Three years ago we didn't have software where a non-software engineer can describe what they want in English and get working (-ish) software generated by other software? Is that not "software has gotten a lot better"?
Other than that I'm not sure how we measure "software has gotten better". New applications? More features? How do we measure sloppier? Is Google Maps suddenly taking you the wrong way more often? I'm not really doubting your subjective experience but seriously how do tell? I mean a doc is a doc and a spreadsheet is a spreadsheet.
We're also only about 10 months into models that are powerful enough to potentially make a bigger difference and we are still figuring out how to use them best.
> Lines of code has always been a terrible metric. But all else being equal it is a measure.
A terrible metric is _worse_ than no metric. A terrible metric can _only_ lead you in the wrong direction. "No metric" means saying we don't know, and that leads us to stop and reconsider. But we've taken "move fast and break things" as a mantra, and we'd rather run towards any direction than stay still.
Using LoC as a metric for quality of LLMs will promote LLMs that write more code. It's better to say we have no way to compare different LLMs than it is to say "let's use the LLMs that produced more LoC because at least we can measure that". We, as an industry, should be focusing on developing better metrics for quality, not on improving LLMs based on known-bad metrics. We should be turning to the computer scientists, not to the venture capitalists.
When a pundit talks about how many lines of code an LLM has created, we should lose all respect for them. It's as if someone talking about physics measured the phlogiston, or as if a doctor started measuring our skulls. We know these theories don't work, and anyone using them should be mocked.
I personally don't view coding agents making software as "software gotten better" you are comparing a tool and the end result, these are two different things. Agent you use going down and your product going down mean two different things to you customers. I will not deny that we made incredible progress in coding and hell, even design over the past 3.5 years, this technology is here to stay.
That being said while I agree that measuring better quality of software is vague (part of the reason it is hard for models as well), there are universal things I believe every engineer will agree on. Reliability, uptime, customer feedback, legibility of your engineering, performance, these are things we often optimized for. Google Maps is a bit of a strawman because neither of us (unless you work on it), knows how much agent code there is, I think it is likely that it's little since it was working fine prior to 2023. I could bring up github reliability as an example, given how much copilot usage they promote at MS, but once again only folks there know for certain. I do, however, see scores of various AI powered SAAS that looks like it is in a perpetual MVP state. I think you are right in that even if agents give us "good enough" results and we can swallow failure rates and our increasingly lesser understanding of what we, or more so model, created, then it is still progress overall, but this is progress not to human-AI collaboration but to AI-only engineering IMO, this is good or bad depending on how you view the future.
I'm a scientist and most of code I currently write is somewhere on the intersection of critical software and machine learning, squaring these two is not easy and I guess the way I was taught to reason about engineering informs my opinions on this. Maybe it's just a matter of time before codex can help here in an unconstrained manner as well, but I am skeptical at the moment.
I also can't help but notice they didn't mention how many tokens were burned, or how much that translates to in terms of cost over the 5 months at enterprise AI prices. I'm going to guess this wasn't a cheap demo.
> Is Google Maps suddenly taking you the wrong way more often?
Funny you mention that because I had that issue in a cab just yesterday. Google decided to drive us of the main road to a series of small roads which happened to be a dead end. My guess is that the AI decided that this is a shorter road? less busier road?
That being said, Google maps have been gradually degrading. Most notably, its search function is quasi-broken now.
Well, to be fair, the amount of goalpost shifting that is going on is quite intense. AI not being able to work in a "serious" project, and being limited to "toy projects" has been a long standing critique.
But also, bigger projects need some amount of loc written and it's a bit silly to pretend that this is not the case or a bad thing.
So the answer to the question is roughly: Establishing that an agent can work in a large-ish code base is valuable, because 1) them not being able to do so has been a critique and 2) it's something that is required for a lot of software projects.
I'd argue LoC isn't actually a meaningless measure, but people use it the wrong way. The same program with the same features but less LoC is more likely to have a proper design and architecture, and is most likely easier to change and maintain in the future. Of course, only if it's less LoC because of proper design, not because you've folded everything to one line.
So if anything, we should find a way to aim for as little lines of code as possible. If you have two agents, and one can build exactly the same program as another, but with half the LoC, then most likely the first agent is better at software engineering and particularly software design.
Of course, as the author of an experiment that investigated exactly this, I'm slightly biased. Cursor's browser had millions lines of code which sounded weird to me based on the features and functionality it had. Meanwhile, I built the same thing but actually thinking about the design with the agent and ended up with ~20K lines of code instead.
I don’t think it’s solvable. And I think Anthropic etc know it. LLMs can only reconstitute things in its training data and they are so hungry they can’t do a good job in long lived codebase full of complexity and novelty. There’s never going to be enough similar code on the open internet.
> LLMs can only reconstitute things in its training data
Such as a 4D raytracing engine in Metal? Or integrating APIs for features first released months after their knowledge cut-off date?
LLMs have shown an ability to transfer "knowledge" and capabilities across domains, languages, and use-cases outside their training data.
Case in point: GPT-2 "learning" to translate English to French and vice versa despite non-English examples having been voluntarily (and almost entirely) removed from the dataset.
In "Language Models are Unsupervised Multitask Learners"[0]. Not sure whether it’s "the" GPT-2 paper.
3.7 Translation
> Performance on this task was surprising to us, since we deliberately removed non-English
webpages from WebText as a filtering step. In order to con-
firm this, we ran a byte-level language detector2 on WebText
which detected only 10MB of data in the French language […]
The "lines of code" at this point are basically the same thing as binary code that comes out of a compiler - something you almost never look at and certainly won't try to touch by hand.
The actual "code" is everything driving the harness.
The current problem for this is that the harness is not (yet) deterministic, so it's sort of like having a compiler where your output program works slightly differently every build, and then the compiler tries to just patch the binary programs when you recompile to minimise this problem, or even worse, disassembles the whole thing to figure out what it does, makes the chance, and then recompiles it.
> Because the repository is entirely agent-generated, it’s optimized first for Codex’s legibility
I asked a question from a perspective of a human engineer, as in, I will have to read the code and understand, fix it once it breaks. OpenAI approach is opposite, even if it is breaking it is the agent that will be doing the fixing, millions of lines and inelegant designs don't matter because human readability doesn't matter. In any case you use more tokens so you fork over more money.
I will say, however, that IMHO there is objectively bad and good code in terms what it can do and performance, if I can do the same thing in 50 lines as opposed to 1000 lines, this difference still matters for the model. Smaller context usage, better approach that informs downstream generation.
This is the part I think we will see become more relevant.
I created docs-cli (pypi) to manage the index of specs as source code: the framework that goes with it will first create tests for as much as it can, so reproducability becomes the goal, not readability.
>I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.
I have also grown skeptical of token usage in order to run up my bill! But since I feel like it takes me MORE effort to write LESS lines of code myself, I'd expect a quick and dirty AI-generated solution to be MORE lines of code and cost LESS to generate than a concise/elegant solution in LESS lines of code.
The latest frontier models will write code better than you and more elegant, with less lines of code, in 100th of the time, with full test coverage. Hand coding is like writing out assembly/machine code rather than using a compiler.
This hasn’t been my experience. State of the art models available to the public still do all sorts of bandaids and bad hacks. Putting code where it doesn’t belong. Stapling types onto variable (in TypeScript) when abstractions/types already exist to use. I use it to generate code, but still have to review every line and have corrections/steering basically every time.
Insisting on writing code by hand when LLMs are available is not software engineering in 2026. Engineers find the most cost-effective solution for the problem at hand that meets the requirements.
Far from everybody is using it for massive code generation - code size is still very much a liability. I currently use it (ds4-flash, local, code isn't open to non-local) to modify and trim a certain internal project for a different use case. Tons of review passes, fuzz test generation and running, "give me your thoughts on this", etc, etc - basically iterating over and over and over. Funnily enough, my intrinsic knowledge of the codebase improves massively like this. It is involved, but the acceleration is real - and I don't have to worry about wrist nerves as much.
The only people I know that have LoC/token use/etc metrics imposed on them work for big corps where such things are (or used to be) en vogue.
I think it's a rebuttal against claims that LLMs are incompatible with large code bases. It's not so much a flex about the quality of the code, it's more a flex about the complexity of the code and the LLMs ability to deal with such complexity.
Whether or not that complexity is warranted is a different story.
The codebase may be bloated by a factor of 10 but if the costs associated with that are less than the costs of developing the software from a business standpoint the choice is clear.
Yeah so all of personal computing—text editing, SVG antialiasing, etc, fits in 20,000 LOC (VPRI's STEPS project) so a million lines of code is 50 reïnventions of personal computing. BUT: it is unlikely that humans would have solved this problem in 20 kLOC. Sussman said “we really don't know how to compute!” as his talk title and LLMs had to ossify some pre-existing voice as the forever programming habitus and it chose a persona that doesn't know how to program—because we don't —and now we are stuck with it. Claude is our tickets, our implementations, our documentation... And if you tell it “hey the node role should not have those permissions, that should be a service account” it will happily do the right thing, but it has no intrinsic sense of taste and the error message it's trying to clear just says “the node role doesn't have that permission and the system prompt says “keep it short, stupid ” and graybeards might be our last bulwark.
I've heard it said that measuring productivity of a software developer by lines of code added, is akin to measuring the productivity of an aerospace engineer by mass added.
It is a metric. It is often not a good metric. But it is easy to measure.
Indeed. The more you do to add complexity or generate without ensuring it isn't pointless code adds bloat. It is still probably more MVP than junk developers will do, but not ever anywhere near as great as someone who studies a programming language.
> It's surprising to me that people who know about reward hacking choose a simple objective like lines of code generated as a signal for quality.
The simple answer is that promoting locs as a relevant metric is also reward hacking. Is it easier to promote big loc counts as a key metric, or is it easier to prove agentic engineering against harder metrics?
On a more general note, software practice marketers have been pushing in that direction for quite a while. "You need cloud", "Here's how to do agile at scale", "microservice everything", etc.
> I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.
To generate elegant code with more restrictions, it means more thinking tokens and more stronger adherence to instructions. So tha naive view that they are doing it for billing is wrong.
It's not massive amount of code for the sake of it, it's that it built a complex large piece of software itself. If you compared the linux kernel LOC to some guys toy kernel on GitHub the LOC difference should give you an idea on the complexity. "Yeah why is Tolstoy flexing about his number of pages in his book, I wrote Spot goes to school and its only 20 pages"
People want to do X, so the metric is how much X can be done.
Everyone is over-complicating the explanation. The answer for "why are we fixating on this bad metric" is almost always the same pattern.
Broad audiences need simple metrics to talk about. If the metric itself requires nuance, it's hard to communicate and hard to reason about. It's easier to push the need for nuance from understanding the metric itself down the road to where the metric is applied, which allows everyone to ignore it in immediate conversation.
It is a very valid question. My intution (no grounding) is to the model training. Optimizations traditionally have worked well in human wrote software with either experience of the developer , usage of architectural patterns or a second ir third pass of fine tuning. In case of model written code - (e/p one token at a time), only possible orchitectural optimization is either with a strict guardrail on patterns to use for a specific implementation OR by giving a second or third optmization path. All of which burns more tokens, but can lead to better software.
I don’t think the flex here is the amount of code alone. Their goal is to show that AI can improve productivity, the number of lines is just the proxy to that. This article is a marketing piece after all.
Now someone can argue that lines of code are not a good proxy of engineering productivity, but I wouldn’t be surprised if the audience they target with this content is not the HN commenters of this thread.
Correct on the first part, partially correct on the second. LOC is a bad metric, but it is at least a legible one. Lots of people working on better ways to measure Software Productivity!
You can't really do that here because one of the key arguments for this, as people in the thread focus on is "1/10th of time" estimate, the comparison with humans is here already, albeit it is just an estimate and no actual comparison has been done.
This is a problem of conflicting incentives that exists today in my opinion. Companies will market greater human-AI collaboration in science and engineering but focus on releasing things like this where it is clear that downstream goal is complete agent ownership over the product, from inception to testing to monitoring. Maybe the speculative future agents will use their own very efficient language to code that won't be readable for people at all. They focus on agent code being readable by agent in the article, as you've said. But in my mind in at least near future, there is a case where your prod will break, you won't be able to understand it or the attempted fixes. Maybe agent will fail to fix it at all and start a massive rewrite. In any case is this different from kicking technical debt down the road along with worse interpretability of what you have built?
I do think there is a way where agent can write great solid code that we can read, but with the way LLMs are built this requires something new in terms of reward that accounts for "taste" and constant refinement so it might take more than 1/10th of a time to produce something good.
Because sloppy code and doesn't matter. He wrote he completed it in 1/10th time. That's equivalent to 1/10th the engineers.
That is a business win. That is really all that matters in capitalism.
The flex is a direct insult to your face. He is shitting on the faces of all software engineers (me included). It is equivalent to saying we don't need you to code anymore. One man can produce 10x the code.
So why am i voting him up even though he's shitting on my face? Because what he says is true. I value honesty and people who say things like it is. Yes my identity as a software engineer is getting dismantled before my very eyes. But the solution to this problem isn't some delusional statement about not understanding what he's flexing about. We're not stupid. Everyone on this thread understands his flex. The difference is some people like you don't want to understand it.
Like seriously. He literally wrote it was completed in 1/10th of the time and you expect me to believe that YOU don't know what HE is flexing about? Be real. You're not stupid.
The real flex is delivering it in one-tenth of the time. Mentioning lines of code is mostly noise.
I’ve worked with 20-year-old codebases and products that grew organically over decades and still sit well below a million lines of code. Using LOC as some kind of health or success metric makes me more suspicious than impressed.
It's a huge flex if the alternative is no code at all. Reward hacking aside, LOC resonates with me in the sense that I've seen 10+ projects to fruition that wouldn't have even begun without an agentic harness and an LLM.
It's like the difference between doing stock price predictions with binary "up" or "down" histories and trying to figure out how to normalize actual price histories (basically impossible). The binary work gives a well-defined signal.
Yep thats true pre grounding is very much worth it, if you just feed the agent a quick environment brief upront instead of making it spam tool calls to figure out where it is, you save a lot of tokens.
Codex updates usually appear every few hours (i am not saying this how often it's published) but that's my perception as a user. Often i update codex just to see new update within an hour so.
Many times those updates are not properly tested, for example in one update the model selector got completely changed.
then next hotfix was pushed which restored original.
I find it so interesting "Agent legibility is the goal" picks up James C. Scott term (without defining it, so I assume that's what they mean) which is _not a good thing_. Legibility is a governance effort to box in life.
This matches quite verbatim for my cursor based agentic repo.
There isn't anything that were not already experienced and factored into constructs in the repo.
And I also find all of the bits created for an effective agentic engineering project, matches perfectly with the main stream engineering best practices. That has been one of my primary reason to all in on agentic engineering, prior to this, applying best practices is always too costly and conflict with teams daily priority.
I am at a major company that is essentially vibe coding. I’ve shipped about 100k LoC this entire half and am toward top 10% of my team. I find it likely that either
A. The code is absolute garbage and is speed for speed sake
B. They’re using an internal model that is a generation beyond GPT 5.5
I say this because we’ve attempted to do something similar using the latest gen Claude models and a significantly larger team. The code is probably along the lines of millions LoC but is an absolute mess because of vibing. There’s a price you pay for speed
Q1 - How much effort did you put into deterministic guardrails like AST linters, etc?
I find there’s a ton of slop unless hard guardrails are added, eg step 1 is just around syntax, step 2 is to enforce mental models
You still need someone steering direction and have a logically consistent idea of what you actually want to build
Q2 - I find that vibe coding really accelerates FE projects because it’s possible to run everything locally and check results
For pure distributed infra backend more investments have to be made into the devloop to be able to shift left the feedback loop and decouple it from humans or real deploys
I like how they said they were spending 20% of their time addressing slop. Sounds like they’ve tried to automate the slop correction but it’s a good honest reminder.
Additionally it’s an internal tool, which is likely much more amenable to slop.
1. What’s the job satisfaction like day to day being an engineer on this project? How have they adapted to this way of working?
2. How much did it cost? Work is being done whilst the engineers sleep but if that 6 hours overnight task cost $300 and could have been done by a person in 2 hours is it a real saving?
> The diagram below shows the rule: within each business domain (e.g. App Settings), code can only depend “forward” through a fixed set of layers (Types → Config → Repo → Service → Runtime → UI). Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: Providers. Anything else is disallowed and enforced mechanically.
Can anyone give me a simplified explanation of what they’re saying here? Having some trouble understanding.
They're describing a layered architecture enforced by some script in CI.
For example, if you had a `backend`, `common`, and `frontend` package, you would be OK having backend/frontend depending on common, but you wouldn't want common depending on backend/frontend or backend/frontend depending on each other.
If you think about JavaScript, there is nothing stopping your dependency graph from becoming spaghetti. It sounds like they built static analysis to enforce rules.
Some languages have this built in like Java (Project Jigsaw), Go, and Rust. JavaScript, Python, etc. have no such feature.
It's really nothing special -- it has existed before. It just becomes a _lot_ more important with agents since they produce a lot of code, and it is good to have lots of static analysis when heavily utilizing agents.
They mention this in the article:
> This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.
I started using chatgpt for functions and checking, then for single file changes and checking, now for multiple changes and checking. I am at a point where the only changes I correct are architectural. So it may start to become smarter to learn how to see only the architectural directions while multiple agents work, test, and commit both on unit and against live deployment.
> To drive a PR to completion, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to any human or agent given feedback, and iterate in a loop until all agent reviewers are satisfied (effectively this is a Ralph Wiggum Loop ).
ok, but you had 1x token to generate , then more 1x to review locally, 1x for the agent local, 1 x for the cloud. then ???x until all bots are satisfied.
You end-up spending at least 5x the amount of tokens for maybe prediction machine to find a discontinuity?
I would say a way better approach is 1.123x to generate code + tests + passing analysis tools + human review + 1x "simplify as much as possible", than letting the snake its own tail without boundaries.
1 million lines of code aside, I feel like anyone who seriously thought about this would eventually run their own harness.
Just like .vimrc and .zshrc, the harness "code" itself can be easy and personal. Provided that it's built on working and existing construct such as tmux.
I understand that the’ve written zero lines of code for this application, but would it kill them to write a few lines of the blog post by hand?
Forcing readers to wade through an unceasing string of LLM clichés demonstrates the opposite of the point you’re trying to make—that the consumers of your work are worse off because you exercised no human judgment in creating it.
I will do a premise: I don't like where software engineering is heading, at all. I have never been unhappier to work in this field since AI came out. And no, it is not possible to opt out of AI, especially when your teammates are all great engineers whose productivity increased a lot without any drops in quality code-wise (in fact the opposite has happened). You need to keep up. But it's tiring and the fun/interesting parts are disappearing.
That being said, it's clear that harness engineering is the most important part of our job and that task is going to take increasingly more of our time. And thus having a glimpse of how an AI company handles it is by any means interesting.
Dear OpenAI, the target audience of your blog or at least of this blog post understands English pretty well. Why won't you give them a simple way to disable the shitty ai translation and read the original content? Why translate it at all in the first place?
EDIT: found the button, all the way down in the bottom of the page... I hate this so much, give me the original content, I will decide if and when I need translation
Given that we can code at 10x speed for at least half a year, one would expect to see at least some pieces of machine-created software with 5 years' worth of equivalent human engineering work.
I wonder why we as engineers aren't protesting AI in the same way that artists and people in film and television are. This post should instill the same terror that visual artists feel.
If you're a more senior person in tech, this post is effectively saying that a large portion of your skillset is about to become completely worthless. This goes beyond the skills involved in writing the code. Everything that you've learned over years about how to determine whether code is good or bad, and what practices make an engineering team effective is not just obsolete, it's fundamentally counter-productive because it assumes a slow, human-centric process that requires you to actually review and understand the code. Even your ability to mentor junior engineers is now obsolete, because all that experience you've built up is now worthless to them.
If this is the approach the industry takes, particularly when combined with a lack of interest in quality from the business (and let's face it, consumers have shown us that they're happy to pay for cheap crap), it's hard to see much of a future for software engineers. You don't need thousands of people with deep technical expertise, you need a handful of manager-types, who will focus on defining product and business requirements and configuring how the AI gets enough context to implement the requirements.
Maybe, if we're extremely lucky, there's so much demand for software that total employment doesn't fall off a cliff, but the nature of the work will change so much that many older, more expensive engineers will become unemployable. Those who remain will have to accept that the skills they spend decades developing are now worthless, that younger engineers no longer respect or listen to them, that the business no longer sees them as experts worthy of respect, but old fogies who grew up in a different world.
Joe Biden liked to say that a job is more than just a paycheck, it's part of your identity and your sense of self-worth. We're all very used to a certain level of respect (and commensurate remuneration). If you don't think that's true, compare how a software engineer is treated to how a warehouse worker is treated. What happens when we lose that?
>a large portion of your skillset is about to become completely worthless
I'm not convinced of that.
I watched a video of an architect using AI to create architectural drawings. It became very clear to me that he has a lot of skills and terminology that helped him produce something very specific, in a few minutes. I've been working on some home improvement stuff including a studio/shed and I've struggled to produce even something simple (currently trying to get a conversation packet on the roof trusses to take the the permit department to get started). Even with my high school architecture class.
After watching that I wonder how much of what I'm doing with AI that looks easy is because I hae a deep technical knowledge, plus 3 years of heavy work with AI.
This is the case now - I can explain to the AI that I want to re-factor a component to support different implementations using a strategy pattern, and I can get a similar outcome to what I would have written, just implemented a bit faster. My expertise brings value.
But that's not what this specific article is describing. The world this article is describing is one where you describe the business requirements, and you don't think about how it's implemented. You don't write the code, you don't review the code, you don't test the code. You give the AI business requirements and you give it access to sources of context (slack, meeting notes, etc). Every place where the human would act as a gate reduces throughput, so it should be eliminated through building harnesses and providing context.
What they're doing here is the equivalent of taking a factory where you have 2 process engineers and 100 operators, and replacing all the operators with robots. They want to automate the whole process of making the software and just leave the part that figures out how to make the automation work effectively.
In this world, the average software company doesn't need people who know how to write good software, because writing, reviewing, maintaining, and testing the software will be entirely automated. There will be a small number of people at companies like OpenAI that need to know how to write good software in order to supervise training the models, and there will be a small number of people at the software companies who have expertise in setting up the automation.
Just because I'm not typing "strcat(); strcpy(); sprintf()" doesn't mean I'm not thinking about problems. I'm still doing critical thinking all over my stack, and I don't see that going away. I'm just doing different thinking.
There are people who think, and AI just isn't going to change that. There are people who don't think, and they've existed long before AI. Back in the 90s when I worked at the phone company, man, I worked with some people who didn't do a lick of work (along with some really sharp people).
What is that protest going to get us? We'll convince or force business leaders to not use a cheaper/better tool and protect our jobs? And nobody else in the world is going to pivot either? And our companies will remain competitive?
Software engineers have always adapted to new technologies. New languages, frameworks, native apps, browser apps etc. So far this doesn't seem to be close to completely removing us from the loop.
If you are smart, educated, and can adapt, you'll figure it out. The economy has to find some stable equilibrium and it's not a zero sum game. Everyone in the economy getting a paycheck is also a consumer. With no consumers there is no business. The companies who are using AI and become more productive can do more things that before were not profitable but now are. Some of the people who are getting laid off are going to start new businesses and hire people. These things always cycle, and they basically have to.
It's the other way around, unfortunately. The senior engineers will still be useful for architecture and infrastructure considerations, as well as guiding the agents. It's the junior engineers that get nailed, because there's little incentive to hire one when a LLM does a better job immediately and costs less.
That's true now. But in the world of this article, it's also the senior engineers that get nailed. In the world of this article, all code is like what machine code or bytecode is now - it's designed to be used by the machine, not the human, because the expectation is that humans will rarely, if ever, touch it.
Individual voices aren't strong enough to drown the marketing machine.
Artists and writers are unionized, why they have a more powerful collective voice.
Second, there are enough peole for which their jobs are very well paid and too cozy to dare to rock the boat.
The economy and job market isn't so hot either at the moment for people to quickly be able to jump ship.
Can you even be sure that you find a tech company that isn't jumping head first onto the AI hype train? Even politicians can't have enough of AI in their mouth.
I for one am not protesting because I know that this is bullshit marketing nonsense. Look at reliability metrics of OpenAI, they’re terrible. Everyone knew a long way ahead that it’s a scam, now they’re cranking up pricing and trying to rug pull. There will be a lot of developers who will come out very well once the stock tanks. That’s my two cents
But this is the only submission that's had any traction. Since the content is nearly the same for all submissions, it highlights how getting to the front page can be a bit random. (Though this is the only one that capitalized 'Leveraged' so maybe that's the secret)
> Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with 0 lines of manually-written code.
This is such a common thing among software engineers nowadays that I was very surprised that OpenAI would open with that line as if it were mind blowing.
But then I saw it was published in February and OP is just reposting it to farm karma.
Everyone is criticizing the number of lines of code and the lack of attention that must certainly have been applied to generate that code and push it into production. What is being ignored is this awesome prompt that is almost certainly better than having no agents.md or plans.md or whatever you've come up with, to add validation steps for committed changes. You're still free to look at your code, the changes, and ask the agent to clean up. Try it. It's really nice.
> We had weeks to ship what ended up being a million lines of code... Five months later, the repository contains on the order of a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over that period, roughly 1,500 pull requests have been opened and merged with a small team of just three engineers driving Codex. This translates to an average throughput of 3.5 PRs per engineer per day, and surprisingly the throughput has increased as the team has grown to now seven engineers. Importantly, this wasn’t output for output’s sake: the product has been used by hundreds of users internally, including daily internal power users.
That's an insane level of throughput. What's a good baseline? Prior to agentic coding, whats the typical number of PRs engineers were expected to push? Maybe a 2-10?
Do people feel the software has gotten better in the last 6 months? The number of engs is prob the same so we should expect maybe 5x faster cycle in major software apps, but I don't see it. The AI apps do change very fast but given its a very new field, I'd expect as much. But outside of that, I don't see it.
Here's a fun one: firefox lists its current count at about 2.5M LOC, from roughly 1M commits during the years.
You end up with about 3 lines added per commit, which is not ridiculous when you consider that most would be editions rather than full additions.
Here, we have 1500 PRs and 1M LOC, which is about 650 added LOC per PR. Remember, not 650 lines total in the PR, but +650 balance after additions-removals.
Fun questions for attentive readers:
- What does a project growing at a rate of one full firefox-codebase worth of LOC per year look like, a decade down the line?
- What does the line count say about the verbosity of the tool, and what does it say about outcomes that the purpose of the project isn't clearly disclosed?
- Do we have reasons to care about LOC in a world where we don't write code manually? What happens to token usage numbers when the codebase is significantly larger?
- If it was confirmed that LLM usage blows up your line count, what's the implication for codebases that want to return to manual coding after months of usage? (Say, because the tool gets expensive).
> - Do we have reasons to care about LOC in a world where we don't write code manually? What happens to token usage numbers when the codebase is significantly larger?
Yes, at least to the extent that we care about context windows and tokens consumed by coding agents processing code that is ultimately irrelevant to their assigned task.
Anecdotally, I've found keeping file sizes small has been important for agentic coding not just to maintain human readability, but also for optimizing agent performance, precisely because it limits the amount of incidental context they load while working a problem, because they generally load entire files rather than just parsing the part relevant to their current assignment as a human might. That smaller file size thus reduces input noise and the LLM generates a tighter solution, which in turn reduces input noise for future solutions. Or at least this strategy avoids a death spiral into exploding context length.
I expect (but cannot currently prove) that keeping overall LOC down yields similar benefits even when file sizes are kept small because it spares the LLM from parsing potentially relevant files that prove irrelevant to its current task.
Seconded on smaller files. I feel like I tend to get better responses faster.
A notable flaw here is that I’ve not tried large vs small files in a large codebase. Most of my experimentation there has been on personal projects where even a small file contains a significant part of the project. I could see degradation when it has to load 5 files to figure out how something works.
Total LOC (tokens, really, literal lines probably don’t matter) is interesting as a factor. That might go some way towards explaining why LLMs are weirdly good at Clojure.
Eg last I checked Anthropics one-shot performance on Clojure was about the same as Python or Go despite almost certainly being less represented in training data. The combination of density and simple primitives might be easier for an LLM to wrangle, ameliorating the impact of a less popular language.
>Eg last I checked Anthropics one-shot performance on Clojure was about the same as Python or Go despite almost certainly being less represented in training data. The combination of density and simple primitives might be easier for an LLM to wrangle, ameliorating the impact of a less popular language.
There might be tons of confounding factors there. One that comes to mind is the quality of of data, it might perfectly be that the average clojure snippet is higher quality, due to the users demographics. Very few people start writing code with clojure, whether in college or during bootcamps.
Does the Firefox LOC include ALL forms of text: infrastructure (Firefox doesn’t have), documentation, developer scripts,tests, etc? How is the test coverage of Firefox?
When I got to the 1M LOC I involuntarily paused feeling like this must be satire.
We've known for decades that output metrics like LOC/day are very bad measures of real productivity in software. But they seem to be back in vogue in the age of AI, because AI is so good at maxing these useless metrics, and we need to show how impressive our AI is and how impressive our usage of AI is.
They never specified what exactly the product was, without which it's impossible to judge the post.
For some reason most of the uses of "agents" are to build yet other AI products, it's turtles all the way down. Maybe that says more about the field of harnesses than it does about the power of "agents".
There is a sense in which it doesn’t matter at all; many of the limitations of agents in large codebases are just the context management challenges. So proving that you can cohere and progress at O(1m) is a useful scale observation. “Can I use agents in my 1m line codebase?”
There is of course another sense in which the output quality is the only thing that matters. “Can I use agents to build a 1m line codebase that I want to maintain going forward.”
I take this as being exclusively a tech demo of the former. Quality (feature velocity, bugs, scalability) is not demonstrated.
Feels like the active discovery going on is trying to understand what is computer vs what is AI, for every product.
Agents help a ton with the discovery, but the act of building a product needs a deeper level of thought and validation to make it actually better than what came before. So IMO what you see is people still learning what needs to be understood and crafted first hand to make a product better (including economics)
We’ll get there if more of us try
It feels like the update cadence has indeed sped up. But not necessarily quality.
Looking at MS Office I notice a lot of small changes recently that are mostly annoying. Things like Word comments losing the focus after you @-tagged a colleague, needing to click the Outlook search field twice before you can enter text, Outlook mobile date picker losing its ability to show your and attendee's availability.
So it looks like lots of throughput, but unfortunately breaking features that work. Or wasting time on things that don’t matter such as the status bar of OneDrive search circling around the input field.
I’ve been vibe coding a lot over the past year or so, and I think I’m going to stop. In fact, I sort of want to challenge myself to see, can I go back to a sort of the fork in the road with the old copilot autocomplete workflow and really maximize that. Be in the drivers seat for most of the code being written, but find ways to use AI to really enhance the flow state / remove blockers. Tools only minimal actual code generation.
One workflow I like is writing a comment for what I’m about to do and then waiting a few seconds and then tab through the auto-completions. Then I check what the agent came up with, make some edits, and then on to the next block. That works well, I feel in control but don’t have to type as much.
I do use claudecode totally hands off too however. Mostly for UI tasks. Like themifying css or data grids and CRUd with all the bells and whistles, I hate that stuff and cc gets it done in minutes and mostly right. It’s also super nice to say things like “user profile in the upper right hand corner” without having to fight css.
/if it’s not clear, I hate dealing with css and related frameworks.
I would be very impressed with someone who's been vibecoding "a lot" for about a year who could then go back to being fully in the loop for even 50%. I would even say I'd expect withdrawal symptoms at that point.
The dopamine hits are core to why people even do vibecoding (or vibecoding-in-a-dress/spec-driven development) and why they tend to overestimate its output so much. Hell, it's core to all forms of LLM-assisted development (because it feels like magic), but most of the other forms are more value, less delusion.
The dopamine hit is real, I feel like that was identified early on by OpenAI and probably lit a fire to get ChatGPT in the hands of the public. Bf Skinner (I think) is the one who narrowed in on variable ratio reward systems to maximize operant conditioning. An LLM, with hallucinations and imperfections, is the perfect variable ratio reward system. It’s no wonder they’re getting pushed so hard along with a consumption based pricing model. Whether you’re a human, rat, plant, bacteria there’s no real defense against that kind of conditioning.
First hit on Google
https://www.simplypsychology.org/operant-conditioning.html
I actually don’t find vibe coding satisfying is one of the many reasons I’m going back. I feel a little of what you’re talking about, but I’m a nerd. I like to code.
But I’m not dismissing your concern. Because it is one of the reasons I’m making this decision. I’m a professional. I’m not just here to feel good I’m here to do a good job over the course of a career. I think all in, when you think about writing good maintainable, software, learning, staying mentally sharp, and speed put together. Vibe coding could be less effective and maybe even in the aggregate “slower”.
I have been building an entire operating system ( not figuratively)
Prior to ai autocomplete 500 loc a day and then with ai autocomplete I could do 2500 a day and now 50k is pretty normal. Walking around tech week with my phone yielded 150k this week
The average efficiency improvement is closer to something like 2-3x per Anthropic’s numbers and this is only the rate at which software can advance. Do you expect to notice if 12 months of software engineering on a project you’re following gets done in 6 months? I suspect not.
The root cause is that the acceleration is pareto distributed so the modern engineering team at the moment looks like one 10x engineer, one 5x engineer, and the rest are approximately 1.5x engineers.
> ended up being a million lines of code
This almost reeks of "I've never cleaned up our code base because there is too much code, and didn't even bother having agents/LLM cleaning them up".
You almost never need a million lines of code - this includes your software, infra, testing and operational tools. You didn't ship the linux kernel in 3 weeks and you know it. The code is already speghetti and it achieve the basic functions OK but it will harder and harder to simplify and untangle and maintain.
Even the linux kernel doesn't need millions of lines of code; most of the actual LOC is device drivers, and you don't need all of them, you just need the ones for the devices you have.
And Linux maintainers are actively pushing to radically cut down on the LOC by eliminating drivers etc.
As a point of reference, 1MLOC is about the size of the entire Python standard library including tests, as well as stuff like IDLE. (Well, the Python part of the code. There's about half that much again of C in Modules/ .)
Yeah I cannot see how "we shipped 1 million lines of code in three weeks" is... something to be proud of haha
They directly address routine code cleanup and regularly paying down technical debt near the end of the article.
I stand corrected, but the LOC being advertised still make me doubt the efficacy of their process.
> should expect maybe 5x faster cycle in major software apps
To what end and what would that even look like though? Enshittifying everything at maximum speed? The apps/platforms I use regularly - GitHub, Spotify, Google maps (just to name a few), have gotten noticeably shittier in recent times.
Confirmation bias. The internet has complained about software updates decades before LLMs became ubiquitous. I made a career fixing human slop by domain experts.
We easily forget that the great majority of software engineering is fixing the mistakes of other highly capable software engineers.
It's just so easy to blame the machine instead of admitting no one here is an expert on anything and they count their hits and not misses. If they did, we would find the probability of making a mistake to be higher than a fronter coding agent.
It's a hard headed crowd and everyone, LLM pilled or not, suffers from the Dunning-Kruger. All of us.
Just look at the comments. Everyone is perfect when they do things themselves.
>GitHub, Spotify, Google maps (just to name a few), have gotten noticeably shittier in recent times.
What if AI lets you create new versions of those tools, but without the enshitification?
I say that being in the "soaking" stage of using AI to rebuild a shitty software project in 70KLOC over about 2 weeks of spare time, so this may not be as theoretical as you might think.
Oh I definitely agree that AI can and will help create great software.
It's just that creating great software isn't really the SV/VC/big tech business model or main goal.
> What if AI lets you create new versions of those tools, but without the enshitification?
I'm not sure I fully understand what you're saying here. Isn't the value of these tools almost entirely independent of their actual software? That is, we have many good open source, self-hostable forges (Forgejo, sr.ht, etc.), lots of great music player software (Jellyfin, Symphonium, etc.), and decent maps software (OsmAnd and Organic Maps). People use GitHub, Spotify, and Google Maps -- perhaps even _put up_ with their often bad/glitchy software -- because of network effects (all three) and content/licensing partnerships (Spotify/GMaps). That proprietary data isn't something AI can help you with, right?
It really depends on the use-case. For example, my most starred github repo is a tool to convert Spotify playlists to YouTube Music (that was done pre-AI). Github depends on what issues you have with it, what your use case is, and whether you can leverage some of the network effects via API from the github source. Maps, same story.
AI coders are great for making scrapers, possibly because AI companies use their own tools to make an awful lot of scrapers.
This is a lot tamer than what Claude Code's team claims tbf.
[flagged]
[flagged]
[dead]
It is likely better because AI agents make access to domain knowledge easier. However, I would wager that the problem is people don’t remember the code well. The problems are going to be long-term as the pace of change increases.
If you think about it, successful products rely on designing well-thought-out experiences, customer discovery (see all the Forward-Deployed Enginneer job listings at OpenAI) so the code velocity somewhat becomes irrelevant.
If you’re solving the right problem and you’ve got a good team then competitive advantage comes from somewhere OUTSIDE of code velocity.
The more important question I think is does faster code yield more value long-term? At the moment, it’s like yeah we do 3.5 pull requests per day.
I’m thinking, great, good for you. You could also combine three pull requests into one and then you’re doing 1 per day. This is quantitative data that doesn’t really mean anything tangible.
I've been doing the same experiment in tsz[1] for a while now (the same past five months in fact) and I have come to very similar conclusions. Lots of harness to enforce good architecture splits. Lots of tests and CI.
My point of working on tsz is to learn how to do very big projects with AI. Eventually the same workflows and attitude can be leveraged to build customer product apps with UI as well. I see that OpenAI is leveraging automated browser testing and even videos as part of their workflow. I think as models get better this direction for making software would eventually make sense. I don't think we're there yet though. But at least, unlike OpenAI vague claims I can share the output with you to see!
Most of the solutions that offer a very high level of automation like Lovable are a bit too optimistic and solutions are not tightly coupled with lots of automated testing.
[1] https://github.com/tsz-org/tsz
Hello! I’m one of the three engineers who write this piece. Happy to answer questions.
Interesting write up!
Have you been able to extract libraries or tools from this project yet? If so how was that experience?
That is, do you see yourself releasing a metric harness, or sub-projects that are equivalent of ActiveRecord, zod, or similar open source tooling that frequently originate in a large in-house project - and then is exported out as a stand-alone toll, utility, library or framework?
Because while ai can reimplement minor tools, it's utility entirely depends on the existence of solid tools, libraries and frameworks.
Fantastic job!
Can you share what type of project that was? On the spectrum from a database engine to cat picture sharing web site (very high demand for correctness vs very lax).
Very cool article!
- are other teams adopting this approach? What’s the blockers if not?
- have there been problems where the models alone were not enough to debug and the devs had to fix it themselves?
- as the rate of changes has increased with more devs how have you dealt with concurrent writers with merge conflicts?
- if there was anything you could change in the approach you started with, what would it be?
1. Yes! Many teams internally have adopted a lot of the same practices we outlined in the blog post. Ryan has also been spending time both internally and externally helping companies figure out how to do this in their code bases.
2. Hmm, kind of. There have definitely been issues the models can’t one shot. But we still use Codex to write all the actual code with human guidance.
3. More agents :) Some teams are experimenting with centralized Agent mediated integration queues, others use normal merge queues, many have local Codex threads that monitor CI to resolve and land conflicts or failures.
4. Today’s models and codex app. We started doing all this with gpt-5 and codex-cli. The tools today, 9 months later, are so much better than what we had then.
Have you built any tooling or products around all of this and deploying it somehow? I’d love to learn more and share notes, because we’ve been doing this too. About 3100+ PRs merged across our 4 person team in 4 months. Impossible without harness engineering, and I agree, the tools are getting even better.
Have you been satisfied with the quality of code generated by the model? Or did you have to tweak some rule file or skill to improve it? Or is human-readable code not even a goal at this point?
Were those em dashes you, or GPT
This mirrors exactly what I have been doing.
- Give Claude/Codex a way to verify its own work (browser, smoke tests, e2e tests, high-fidelity local environment)
- Keep all context (issue tracking, docs, ideas, plans, worklogs) in-repo (https://github.com/shepherdjerred/monorepo/tree/main/package...)
- Give Claude/Codex access to observability (Grafana, Prometheus, Tempo, PagerDuty)
- Have Claude/Codex follow good engineering guidelines like fail-fast, type safety, parse at boundaries
I haven't yet been able to achieve full autonomy due to cost and CI load on my homelab.
Does it yield good results? I found that instead of docs it’s easier just to ask ai to read code. I feel like this is same as comments in code. Become outdated fast
I don't really use "docs" for documentation. I've prompted Claude/Codex to always write a "log" and save it in-repo to track what it did and why.
I've found this to be really helpful, e.g. "you did this last week, and now some other thing is happening" or "you tried this approach before to solve alert X but it didn't work" -- except it can discover this itself.
https://github.com/shepherdjerred/monorepo/tree/main/package...
I've also used it to store TODOs and plans. For example I might want to explore some idea and defer it for later, or some weekend have it execute on some tech debt I've put off. One last use case is asking "what did I work on in the last 2-3 weeks, is it healthy, and what additional quality checks can/should I do; is there any follow-up work?"
I find that preserving logs that contain errors will confuse future sessions even if the errors were corrected at the time. Do you have that problem?
Essentially preserving logs extends the context window with all related problems.
I haven’t actually noticed that, but I’m not sure why. Maybe because I specifically describe it to the agent as a work log rather than documentation? I’m not sure
it does not result in great results left unattended, it’ll start creating slop or hardcoding solutions
but overtime if you adjust your verification rubric, it’s not too bad, gets pretty good, if you do make it do TDD, it gets kinda crazy and you’ll have 2000-3000 tests after awhile, or on my common case, 6000-7000 lines of code in single files (i usually have a cron to audit files for decomposition and create tickets)
i wouldn’t use it at my job yet, but it’s been fun to use for personal projects - it’s like modded minecraft automation or factorio
Static analysis can help here! Add CI checks for duplicated code or file length.
For test growth, maybe use a coverage tracker and remove redundant tests?
I like the idea of saving the work done into files - helps to prevent the llm from redoing the same work. Maybe one day instead of code in a repo it will just be a list of prompts.
The other day I came across to a video showing workers in a e-vape factory. They pick up a bunch of e-vapes from the conveyor belt (each has 6 e-vape think), stick in their mouth and vigorously vape all of them for about 5 seconds, then test the next bunch. Humans reviewing hundreds of lines of change in a PR written by AI is not very different.
You can do statistical testing of the e-vape line because you have a specific criteria and well defined tolerances that you can define on a per-sample basis and that the factory meets with some acceptable 9s of reliability.
PRs are not like this because a single bad PR can be catastrophic for your business in a way that a single bad e-vape cannot.
I would also argue that the current output from the AIs when sampled by software engineers regularly doesn't meet the bar of quality we want in our product, hence the need to review every PR and fix a substantial fraction.
If you can start to bound the impact of changes and the outputs begin to be generally acceptable unsupervised, such that all you're doing is double checking that nothing has regressed in the factory, then the sampling approach can work.
Very true. If a PR has 1000 lines I would check only a handful full of them and leave the rest for test suit .
I'm not an AI skeptic but I'm skeptical of the intent of this article. It makes great claims about agent-first engineering and tries to make a real case based on a real product, with real users, and a real team that's been growing — all without even saying what was built or showing it, just like every other AI hype article.
At the time we wrote the article we hadn’t released the product and weren’t ready to talk about it. It was an internal prototype that looked very much like the current Codex app.
So, did this internal prototype ultimately end up being used to create/influence a real product, e.g. Codex app?
And this thread too is filled with users that "I also have done this or that" but bar one user, nobody followed up with any link to anything.
This might work only if you have “infinite” compute and infinite tokens.
As someone that used the $20 plan, this pure agentic approach is impossible to do because I’d hit the limit fast and I would end up with less outcome.
What I found that work incredibly well was to provide a human written code as reference, and ask it to extend it. So I scaffold the entire thing, architect it, write few samples code (controllers, services, models, components, database schema, how auth works, etc) so the LLM can have a headstart on their attention (pun intended)
I usually wrote a stub with a lot of details on how to implement it. Something like a higher abstraction pseudo code. Then ask the LLM to implement it.
When it fails, it is often better to undo the whole changes, adjust the stub so it catches what fails before, and try again.
Or, commit the changes, and use a new fresh context and only address what went wrong.
-
Whenever I tried this agentic from scratch approach, I always end up disappointed; both on the outcome and on the limit that I hit before an hour even passed.
You are not going anywhere with $20 plan
Upgrade to $200/month and you should see more usage but even for a hardcore user for me, one can never have enough.
I'm still very jealous of those guys that got 200x usage simply by RSVP'ing to openai party
I wish these breathless blog posts would actually try to be more didactic.
For example, actually doing a walkthrough of how to set up these allegedly super powered workflows and concrete demonstrations.
I’m not an AI skeptic. Rather I’d don’t want to miss out on any actual super powers.
Instead of reading articles like this one end to end, I ask AI to read them in detail and prepare a new harness for me. The important part is not to do this in a single prompt, but to first create a detailed plan and let the model think deeply about each aspect. This approach lets me build the new harness without the missing didactic information you mentioned.
Basically, I am moving from “I build products without writing or reading the code” to “I build products without writing or reading the harness.”
Once the new implementation harness is prepared, I start it, but I keep the original session open. In that original session, “we” monitor the implementation harness from the outside: how effective it is, where the bottlenecks are, what breaks down, and what could be improved. From time to time, the monitoring session suggests changes to the implementation harness. We apply those changes, restart the harness, and monitor it again.
The overall approach is not to spend X hours understanding an article like this in detail, because another similar article will appear in 3 weeks. Instead, I take immediate action, learn on the fly, and replace the harness when a better pattern emerges. And yes, I still have to spend X hours on setting up, monitoring and fine tuning the new harness, but at the end I have the latest fancy "thing" working for me.
I do quite a lot of what this post describes in a reasonably large project. Here's what works for me:
- write gherkin features for new features; update them for enhancements; don't touch them for refactors. Label your PRs with these nouns.
- use pre-push hooks for type checks, linting, unit tests, and other quick, scriptable validations.
- make a viteperess subsite in your repo, have the agents maintain it - document important principles, architecture, etc.
- make a cli command which lists all pages along with the yaml frontmatter description so agents can choose what to read without blowing up the context window.
- use ddd and monorepo - write your logic in headless layers, and compose layers into apps. agents navigate layers very successfully.
- use zod (or your language equivalent) and contract-first API development; this is my favourite bit tbh, I use orpc
- make a single skill called "code" which describes the lifecycle: open a worktree, setup .env to guarantee no conflict with other agents (choose unused ports etc - docker is good here), write or update feature file (this is where you negotiate the spec), implement, validate (e.g. using playwright mcp), pre-push checks, push and wait for review, tear down and fast forward main
- testcontainers is great for ensuring multiple agents can run tests that don't conflict
Seriously I only have one skill that's it. Everything else is in the docs. I'm feeling very productive like this, in a "making good software" sense not a LoC sense.
Can you share your skill please?
I agree with many of the points made by nimonian above (esp the one starting with 'make a single skill called "code" which describes the lifecycle'), based on my limited experience with these things.
I'm building a skill + CLI tool along those lines (for solo devs not corporates). Here is what my "lifecycle" type skill looks like right now: https://github.com/bitkentech/shipsmooth/blob/releases/dist/... (warning, heavily work in progress). You can see a demo here: https://shipsmooth.net/
I was not happy with the default code quality generated by Claude Code. So I've been adding some skill-file rules to address that, and so far happy with the results: https://github.com/bitkentech/shipsmooth/tree/main/skills/ex.... There was a similar one on HN yesterday called opencodereview: https://news.ycombinator.com/item?id=48406358
There are many such workflows out there! Matt Pocock gave a good talk about how he approaches it: https://www.youtube.com/watch?v=-QFHIoCo-Ko
That's a big ask. This kind of harness usually contains plenty of proprietary insights about their business. And also, nowadays, a good harness is a major competitive advantage.
Good thing I wasn't asking you.
Also, a skill is not a harness.
Your hostile tone is unfortunate, especially since my post was actually friendly. I was just trying to point why it is very likely the OP won't give you what you're asking so you're not left confused if he ends up ghosting you.
Many people use the term harness to refer to the agent coding software (eg. Opencode, Claude Code...), i use this term more broadly to refer to the environment (set of skills, system prompts, constraints, memory, hooks etc...). What the OP is referring to is not just one giant skill. It's usually a comprehensive ecosystem of skills, bespoke tools to make certain agent tasks deterministic (eg localization), and so on.
I've seen someone post Github repos in this thread, these can be very useful especially if you use the same tech stack, but you won't reach the level of productivity reported by successful teams unless you invest substantial time to build your own harness. But the way to do so is to do it progressively : start with something simple to address the need you have on day 1 . And then, turn recurring prompts into skills, turn recurring coding patterns and coding style recommendations into guidelines, turn repetivive tasks for which the LLM tends to build a python script that it occasionally gets wrong into a deterministic tool documented in a skill etc...
And after a couple of days, weeks, and months, you'll have a very dependable harness giving you optimal productivity, without needing to invest weeks of work upfront or take the fun out of agent-assisted coding.
Hope this helps.
I have an example of a side-project [1] where I think I naturally applied the best practices described in this article. My goal was to see if it's possible to code an entire project using a single agent (Claude).
To do this, I "simply" asked the agent, every time it encountered an issue, how to resolve it, using a validation tool or script. I also asked it to code these tools during audits. As a result, I now have over 30+ rules [2] for validating their commits. It's working pretty well now.
[1] https://github.com/gildas-lormeau/rebuild-and-ruin (let the timer expire to see the "demo" mode)
[2] https://github.com/gildas-lormeau/rebuild-and-ruin/blob/a4c3...
A lot to these blogposts are trying to catch on the next buzzword "harness". It's almost close to the productivity porn mindset that we witnessed 10-15 years ago where creating the complicated system is more exciting than using the system for daily tasks.
I agree. I followed this article for a repo I'm working on, and I had a very hard time inferring how, specifically, they implemented "providers" and enforced import layers. A sample repo would've been nice.
> We tried this early on — used ChatGPT as "project manager" to set up the entire harness before writing any code. After a week it produced 140+ docs of rules, architecture, frameworks. Zero lines of code. When we finally brought in another tool to review, the verdict was: "a perfectly secure empty safe." The harness was immaculate. There was just nothing inside it. > > Harness matters, but if you're not shipping code alongside it, you're just writing fiction.
we interviewed Ryan here: https://www.latent.space/p/harness-eng
and he gave a talk version of it in london: https://www.youtube.com/watch?v=am_oeAoUhew
This is a great interview! I had to switch from the transcript to watching it though - the transcript had many mistakes.
I worry most about blindspots with this kind of approach. Let's say that this repository goes on for years, at which point the docs folder is several MB in size. Would Codex be able to think outside of the box? Or would the aggregate of the Markdown content fundamentally cover enough ground to prevent it from thinking of novel new approaches to existing problems?
You tell it to update the docs: not append. I've done the same thing with a readme in the root with links to the docs. After every commit, before the push, I have my agent "update all relevant and related docs, add or remove what's needed" or something to that extent. And it works remarkably well. I also have an append only change log it's supposed to add to. Between that, good commit messages, and comprehensive testing, I've built a homebrew OS and updating it is remarkably smooth. Runs a homebrew FTP and HTTP server and can run Wolfenstein. Working on DOOM right now. Close, but sound has been difficult.
https://github.com/ESikich/smallos
Someone else in the comments said to have it make a static website with the info instead with clickable pages and sections so it reads only the content it needs to rather than dumping a long file into context windows. Although I suppose you can have a ToC in the readme too with multiple smaller markdown files as references.
Yep. You’ve got to have it update the docs. After a few sessions, if I forget to request this, opus starts rehashing the same tasks and finds that they are complete - and sometimes still won’t update those docs unless I ask.
Another tip is to condense the doc files into the minimal required. Sometimes I’ll end up with 5 to 6 floating around in various states of staleness. Condensing to 2-3 and removing completed tasks seems to help a lot
It’s not a self coding machine. There is human in the loop, they even added MORE engineers to the team of this project! 7 engineers should be able to collaborate with the AI to find good solutions to problems.
I think a lot of people are sleeping on the contents of this article. There is still valuable tidbits I'm going to be applying.
Another breathless sales pitch selling pickaxes to miners, but where's the gold? Where's the incredible product that the chatbots-talking-to-chatbots over git generating LOC heaps have actually _created_? I just don't see it.
They will have to open source it. Otherwise it is impossible for anyone outside of OAI to gain any insights - basically just a Boris at this point.
What I still can't understand is why is massive amount of code generated is a flex? I don't feel that software has gotten a lot better in past 3 years, only sloppier. It's surprising to me that people who know about reward hacking choose a simple objective like lines of code generated as a signal for quality. I'd argue you have to optimize for less lines generated as possible while secondary optimization should be readability for humans. I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.
And if I am working on an existing codebase then isn't a good commit often a negative sum between added and removed lines? I don't want to bloat my codebase but make it more polished and elegant. After reading that I wonder if what they have done could have been accomplished for a far fewer LoC budget.
Lines of code has always been a terrible metric. But all else being equal it is a measure. If all else is not equal, which is usually the case, then it's not.
A lot of the focus has been on AI recently.
Three years ago we didn't have software where a non-software engineer can describe what they want in English and get working (-ish) software generated by other software? Is that not "software has gotten a lot better"?
Other than that I'm not sure how we measure "software has gotten better". New applications? More features? How do we measure sloppier? Is Google Maps suddenly taking you the wrong way more often? I'm not really doubting your subjective experience but seriously how do tell? I mean a doc is a doc and a spreadsheet is a spreadsheet.
We're also only about 10 months into models that are powerful enough to potentially make a bigger difference and we are still figuring out how to use them best.
> Lines of code has always been a terrible metric. But all else being equal it is a measure.
A terrible metric is _worse_ than no metric. A terrible metric can _only_ lead you in the wrong direction. "No metric" means saying we don't know, and that leads us to stop and reconsider. But we've taken "move fast and break things" as a mantra, and we'd rather run towards any direction than stay still.
Using LoC as a metric for quality of LLMs will promote LLMs that write more code. It's better to say we have no way to compare different LLMs than it is to say "let's use the LLMs that produced more LoC because at least we can measure that". We, as an industry, should be focusing on developing better metrics for quality, not on improving LLMs based on known-bad metrics. We should be turning to the computer scientists, not to the venture capitalists.
When a pundit talks about how many lines of code an LLM has created, we should lose all respect for them. It's as if someone talking about physics measured the phlogiston, or as if a doctor started measuring our skulls. We know these theories don't work, and anyone using them should be mocked.
I personally don't view coding agents making software as "software gotten better" you are comparing a tool and the end result, these are two different things. Agent you use going down and your product going down mean two different things to you customers. I will not deny that we made incredible progress in coding and hell, even design over the past 3.5 years, this technology is here to stay.
That being said while I agree that measuring better quality of software is vague (part of the reason it is hard for models as well), there are universal things I believe every engineer will agree on. Reliability, uptime, customer feedback, legibility of your engineering, performance, these are things we often optimized for. Google Maps is a bit of a strawman because neither of us (unless you work on it), knows how much agent code there is, I think it is likely that it's little since it was working fine prior to 2023. I could bring up github reliability as an example, given how much copilot usage they promote at MS, but once again only folks there know for certain. I do, however, see scores of various AI powered SAAS that looks like it is in a perpetual MVP state. I think you are right in that even if agents give us "good enough" results and we can swallow failure rates and our increasingly lesser understanding of what we, or more so model, created, then it is still progress overall, but this is progress not to human-AI collaboration but to AI-only engineering IMO, this is good or bad depending on how you view the future.
I'm a scientist and most of code I currently write is somewhere on the intersection of critical software and machine learning, squaring these two is not easy and I guess the way I was taught to reason about engineering informs my opinions on this. Maybe it's just a matter of time before codex can help here in an unconstrained manner as well, but I am skeptical at the moment.
I also can't help but notice they didn't mention how many tokens were burned, or how much that translates to in terms of cost over the 5 months at enterprise AI prices. I'm going to guess this wasn't a cheap demo.
> Is Google Maps suddenly taking you the wrong way more often?
Funny you mention that because I had that issue in a cab just yesterday. Google decided to drive us of the main road to a series of small roads which happened to be a dead end. My guess is that the AI decided that this is a shorter road? less busier road?
That being said, Google maps have been gradually degrading. Most notably, its search function is quasi-broken now.
Well, to be fair, the amount of goalpost shifting that is going on is quite intense. AI not being able to work in a "serious" project, and being limited to "toy projects" has been a long standing critique.
But also, bigger projects need some amount of loc written and it's a bit silly to pretend that this is not the case or a bad thing.
So the answer to the question is roughly: Establishing that an agent can work in a large-ish code base is valuable, because 1) them not being able to do so has been a critique and 2) it's something that is required for a lot of software projects.
Should we not be counting function points rather than LOC’s.
Lines of Code is a meaningless measure. It should also be easy to count function points using AI.
I'd argue LoC isn't actually a meaningless measure, but people use it the wrong way. The same program with the same features but less LoC is more likely to have a proper design and architecture, and is most likely easier to change and maintain in the future. Of course, only if it's less LoC because of proper design, not because you've folded everything to one line.
So if anything, we should find a way to aim for as little lines of code as possible. If you have two agents, and one can build exactly the same program as another, but with half the LoC, then most likely the first agent is better at software engineering and particularly software design.
Of course, as the author of an experiment that investigated exactly this, I'm slightly biased. Cursor's browser had millions lines of code which sounded weird to me based on the features and functionality it had. Meanwhile, I built the same thing but actually thinking about the design with the agent and ended up with ~20K lines of code instead.
Sure; But that's not the point that is argued about here.
(To state it in AI lingo:)
It's not about the best measure for "amount of code".
It's about wether "amount of code" is a good metric to begin with.
I don’t think it’s solvable. And I think Anthropic etc know it. LLMs can only reconstitute things in its training data and they are so hungry they can’t do a good job in long lived codebase full of complexity and novelty. There’s never going to be enough similar code on the open internet.
> LLMs can only reconstitute things in its training data
Such as a 4D raytracing engine in Metal? Or integrating APIs for features first released months after their knowledge cut-off date?
LLMs have shown an ability to transfer "knowledge" and capabilities across domains, languages, and use-cases outside their training data.
Case in point: GPT-2 "learning" to translate English to French and vice versa despite non-English examples having been voluntarily (and almost entirely) removed from the dataset.
Was this in the GPT2 paper?
In "Language Models are Unsupervised Multitask Learners"[0]. Not sure whether it’s "the" GPT-2 paper.
3.7 Translation
> Performance on this task was surprising to us, since we deliberately removed non-English webpages from WebText as a filtering step. In order to con- firm this, we ran a byte-level language detector2 on WebText which detected only 10MB of data in the French language […]
[0]: https://cdn.openai.com/better-language-models/language_model...
The "lines of code" at this point are basically the same thing as binary code that comes out of a compiler - something you almost never look at and certainly won't try to touch by hand.
The actual "code" is everything driving the harness.
The current problem for this is that the harness is not (yet) deterministic, so it's sort of like having a compiler where your output program works slightly differently every build, and then the compiler tries to just patch the binary programs when you recompile to minimise this problem, or even worse, disassembles the whole thing to figure out what it does, makes the chance, and then recompiles it.
I think the telling part is in this line:
> Because the repository is entirely agent-generated, it’s optimized first for Codex’s legibility
I asked a question from a perspective of a human engineer, as in, I will have to read the code and understand, fix it once it breaks. OpenAI approach is opposite, even if it is breaking it is the agent that will be doing the fixing, millions of lines and inelegant designs don't matter because human readability doesn't matter. In any case you use more tokens so you fork over more money.
I will say, however, that IMHO there is objectively bad and good code in terms what it can do and performance, if I can do the same thing in 50 lines as opposed to 1000 lines, this difference still matters for the model. Smaller context usage, better approach that informs downstream generation.
This is the part I think we will see become more relevant.
I created docs-cli (pypi) to manage the index of specs as source code: the framework that goes with it will first create tests for as much as it can, so reproducability becomes the goal, not readability.
https://github.com/ArtRichards/docs-cli
https://artrichards.github.io/agent-playbook-suite/blog/
>I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.
I have also grown skeptical of token usage in order to run up my bill! But since I feel like it takes me MORE effort to write LESS lines of code myself, I'd expect a quick and dirty AI-generated solution to be MORE lines of code and cost LESS to generate than a concise/elegant solution in LESS lines of code.
The latest frontier models will write code better than you and more elegant, with less lines of code, in 100th of the time, with full test coverage. Hand coding is like writing out assembly/machine code rather than using a compiler.
This hasn’t been my experience. State of the art models available to the public still do all sorts of bandaids and bad hacks. Putting code where it doesn’t belong. Stapling types onto variable (in TypeScript) when abstractions/types already exist to use. I use it to generate code, but still have to review every line and have corrections/steering basically every time.
Maybe you have access to some other model?
> The latest frontier models will write code better than you and more elegant
They often do, but they often don’t. I regularly have to push for more elegant, or less lazy solutions.
This.
Insisting on writing code by hand when LLMs are available is not software engineering in 2026. Engineers find the most cost-effective solution for the problem at hand that meets the requirements.
Far from everybody is using it for massive code generation - code size is still very much a liability. I currently use it (ds4-flash, local, code isn't open to non-local) to modify and trim a certain internal project for a different use case. Tons of review passes, fuzz test generation and running, "give me your thoughts on this", etc, etc - basically iterating over and over and over. Funnily enough, my intrinsic knowledge of the codebase improves massively like this. It is involved, but the acceleration is real - and I don't have to worry about wrist nerves as much.
The only people I know that have LoC/token use/etc metrics imposed on them work for big corps where such things are (or used to be) en vogue.
I think it's a rebuttal against claims that LLMs are incompatible with large code bases. It's not so much a flex about the quality of the code, it's more a flex about the complexity of the code and the LLMs ability to deal with such complexity.
Whether or not that complexity is warranted is a different story.
The codebase may be bloated by a factor of 10 but if the costs associated with that are less than the costs of developing the software from a business standpoint the choice is clear.
Yeah so all of personal computing—text editing, SVG antialiasing, etc, fits in 20,000 LOC (VPRI's STEPS project) so a million lines of code is 50 reïnventions of personal computing. BUT: it is unlikely that humans would have solved this problem in 20 kLOC. Sussman said “we really don't know how to compute!” as his talk title and LLMs had to ossify some pre-existing voice as the forever programming habitus and it chose a persona that doesn't know how to program—because we don't —and now we are stuck with it. Claude is our tickets, our implementations, our documentation... And if you tell it “hey the node role should not have those permissions, that should be a service account” it will happily do the right thing, but it has no intrinsic sense of taste and the error message it's trying to clear just says “the node role doesn't have that permission and the system prompt says “keep it short, stupid ” and graybeards might be our last bulwark.
> VPRI's STEPS project
The what now? Search engines failed me here.
This is a description of Alan Kay’s STEPS project:
https://worrydream.com/refs/Kay_2007_-_STEPS_2007_Progress_R...
This is the final report:
https://tinlizzie.org/VPRIPapers/tr2012001_steps.pdf
I've heard it said that measuring productivity of a software developer by lines of code added, is akin to measuring the productivity of an aerospace engineer by mass added.
It is a metric. It is often not a good metric. But it is easy to measure.
Indeed. The more you do to add complexity or generate without ensuring it isn't pointless code adds bloat. It is still probably more MVP than junk developers will do, but not ever anywhere near as great as someone who studies a programming language.
> It's surprising to me that people who know about reward hacking choose a simple objective like lines of code generated as a signal for quality.
The simple answer is that promoting locs as a relevant metric is also reward hacking. Is it easier to promote big loc counts as a key metric, or is it easier to prove agentic engineering against harder metrics?
On a more general note, software practice marketers have been pushing in that direction for quite a while. "You need cloud", "Here's how to do agile at scale", "microservice everything", etc.
> I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.
To generate elegant code with more restrictions, it means more thinking tokens and more stronger adherence to instructions. So tha naive view that they are doing it for billing is wrong.
It's not massive amount of code for the sake of it, it's that it built a complex large piece of software itself. If you compared the linux kernel LOC to some guys toy kernel on GitHub the LOC difference should give you an idea on the complexity. "Yeah why is Tolstoy flexing about his number of pages in his book, I wrote Spot goes to school and its only 20 pages"
People want to do X, so the metric is how much X can be done.
Everyone is over-complicating the explanation. The answer for "why are we fixating on this bad metric" is almost always the same pattern.
Broad audiences need simple metrics to talk about. If the metric itself requires nuance, it's hard to communicate and hard to reason about. It's easier to push the need for nuance from understanding the metric itself down the road to where the metric is applied, which allows everyone to ignore it in immediate conversation.
It is a very valid question. My intution (no grounding) is to the model training. Optimizations traditionally have worked well in human wrote software with either experience of the developer , usage of architectural patterns or a second ir third pass of fine tuning. In case of model written code - (e/p one token at a time), only possible orchitectural optimization is either with a strict guardrail on patterns to use for a specific implementation OR by giving a second or third optmization path. All of which burns more tokens, but can lead to better software.
I don’t think the flex here is the amount of code alone. Their goal is to show that AI can improve productivity, the number of lines is just the proxy to that. This article is a marketing piece after all.
Now someone can argue that lines of code are not a good proxy of engineering productivity, but I wouldn’t be surprised if the audience they target with this content is not the HN commenters of this thread.
Correct on the first part, partially correct on the second. LOC is a bad metric, but it is at least a legible one. Lots of people working on better ways to measure Software Productivity!
Don't compare human loc with machine loc.
Compare machine to machine (as these headlines come) and discount that by a factor.
You can't really do that here because one of the key arguments for this, as people in the thread focus on is "1/10th of time" estimate, the comparison with humans is here already, albeit it is just an estimate and no actual comparison has been done.
This is a problem of conflicting incentives that exists today in my opinion. Companies will market greater human-AI collaboration in science and engineering but focus on releasing things like this where it is clear that downstream goal is complete agent ownership over the product, from inception to testing to monitoring. Maybe the speculative future agents will use their own very efficient language to code that won't be readable for people at all. They focus on agent code being readable by agent in the article, as you've said. But in my mind in at least near future, there is a case where your prod will break, you won't be able to understand it or the attempted fixes. Maybe agent will fail to fix it at all and start a massive rewrite. In any case is this different from kicking technical debt down the road along with worse interpretability of what you have built?
I do think there is a way where agent can write great solid code that we can read, but with the way LLMs are built this requires something new in terms of reward that accounts for "taste" and constant refinement so it might take more than 1/10th of a time to produce something good.
Because sloppy code and doesn't matter. He wrote he completed it in 1/10th time. That's equivalent to 1/10th the engineers.
That is a business win. That is really all that matters in capitalism.
The flex is a direct insult to your face. He is shitting on the faces of all software engineers (me included). It is equivalent to saying we don't need you to code anymore. One man can produce 10x the code.
So why am i voting him up even though he's shitting on my face? Because what he says is true. I value honesty and people who say things like it is. Yes my identity as a software engineer is getting dismantled before my very eyes. But the solution to this problem isn't some delusional statement about not understanding what he's flexing about. We're not stupid. Everyone on this thread understands his flex. The difference is some people like you don't want to understand it.
Like seriously. He literally wrote it was completed in 1/10th of the time and you expect me to believe that YOU don't know what HE is flexing about? Be real. You're not stupid.
The real flex is delivering it in one-tenth of the time. Mentioning lines of code is mostly noise.
I’ve worked with 20-year-old codebases and products that grew organically over decades and still sit well below a million lines of code. Using LOC as some kind of health or success metric makes me more suspicious than impressed.
It's a huge flex if the alternative is no code at all. Reward hacking aside, LOC resonates with me in the sense that I've seen 10+ projects to fruition that wouldn't have even begun without an agentic harness and an LLM.
It's like the difference between doing stock price predictions with binary "up" or "down" histories and trying to figure out how to normalize actual price histories (basically impossible). The binary work gives a well-defined signal.
Yep thats true pre grounding is very much worth it, if you just feed the agent a quick environment brief upront instead of making it spam tool calls to figure out where it is, you save a lot of tokens.
Codex updates usually appear every few hours (i am not saying this how often it's published) but that's my perception as a user. Often i update codex just to see new update within an hour so.
Many times those updates are not properly tested, for example in one update the model selector got completely changed.
then next hotfix was pushed which restored original.
Who needs a QA team when you can just test on users and iterate instantly /s
[flagged]
I find it so interesting "Agent legibility is the goal" picks up James C. Scott term (without defining it, so I assume that's what they mean) which is _not a good thing_. Legibility is a governance effort to box in life.
Legibility has a time and place, for example, global health data around pandemics.
This matches quite verbatim for my cursor based agentic repo.
There isn't anything that were not already experienced and factored into constructs in the repo.
And I also find all of the bits created for an effective agentic engineering project, matches perfectly with the main stream engineering best practices. That has been one of my primary reason to all in on agentic engineering, prior to this, applying best practices is always too costly and conflict with teams daily priority.
I am at a major company that is essentially vibe coding. I’ve shipped about 100k LoC this entire half and am toward top 10% of my team. I find it likely that either
A. The code is absolute garbage and is speed for speed sake B. They’re using an internal model that is a generation beyond GPT 5.5
I say this because we’ve attempted to do something similar using the latest gen Claude models and a significantly larger team. The code is probably along the lines of millions LoC but is an absolute mess because of vibing. There’s a price you pay for speed
I think vibe coding is okay for a small internal tool, dashboard, etc. but it’s definitely a no-go for a production running service.
Q1 - How much effort did you put into deterministic guardrails like AST linters, etc?
I find there’s a ton of slop unless hard guardrails are added, eg step 1 is just around syntax, step 2 is to enforce mental models
You still need someone steering direction and have a logically consistent idea of what you actually want to build
Q2 - I find that vibe coding really accelerates FE projects because it’s possible to run everything locally and check results
For pure distributed infra backend more investments have to be made into the devloop to be able to shift left the feedback loop and decouple it from humans or real deploys
I like how they said they were spending 20% of their time addressing slop. Sounds like they’ve tried to automate the slop correction but it’s a good honest reminder.
Additionally it’s an internal tool, which is likely much more amenable to slop.
This would be much more convincing if the repos, issue trackers, etc. were accessible.
I’d be interested to know two things:
1. What’s the job satisfaction like day to day being an engineer on this project? How have they adapted to this way of working?
2. How much did it cost? Work is being done whilst the engineers sleep but if that 6 hours overnight task cost $300 and could have been done by a person in 2 hours is it a real saving?
> 1. What’s the job satisfaction like day to day being an engineer on this project? How have they adapted to this way of working?
The job satisfaction is looking at the bank account every time you feel your job sucks.
> The diagram below shows the rule: within each business domain (e.g. App Settings), code can only depend “forward” through a fixed set of layers (Types → Config → Repo → Service → Runtime → UI). Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: Providers. Anything else is disallowed and enforced mechanically.
Can anyone give me a simplified explanation of what they’re saying here? Having some trouble understanding.
They're describing a layered architecture enforced by some script in CI.
For example, if you had a `backend`, `common`, and `frontend` package, you would be OK having backend/frontend depending on common, but you wouldn't want common depending on backend/frontend or backend/frontend depending on each other.
If you think about JavaScript, there is nothing stopping your dependency graph from becoming spaghetti. It sounds like they built static analysis to enforce rules.
Some languages have this built in like Java (Project Jigsaw), Go, and Rust. JavaScript, Python, etc. have no such feature.
It's really nothing special -- it has existed before. It just becomes a _lot_ more important with agents since they produce a lot of code, and it is good to have lots of static analysis when heavily utilizing agents.
They mention this in the article:
> This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.
As other commenters have clarified, it's about layering, separation of concerns etc. Goes by many names. One such terminology here: https://en.wikipedia.org/wiki/Hexagonal_architecture_(softwa.... DI frameworks use terminology like "Provider": https://en.wikipedia.org/wiki/Dependency_injection#Injectors
IIUC its just strict separation of concerns
Eg UI cannot reach down and directly read config files
Configs must be only read by (im assuming) a storage interface layer called repo
There’s a strict directionality of dependency
Somewhat similar to ports and adaptors but presumably more strictly enforced by deterministic linters
I started using chatgpt for functions and checking, then for single file changes and checking, now for multiple changes and checking. I am at a point where the only changes I correct are architectural. So it may start to become smarter to learn how to see only the architectural directions while multiple agents work, test, and commit both on unit and against live deployment.
> To drive a PR to completion, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to any human or agent given feedback, and iterate in a loop until all agent reviewers are satisfied (effectively this is a Ralph Wiggum Loop ).
https://ghuntley.com/loop/
ok, but you had 1x token to generate , then more 1x to review locally, 1x for the agent local, 1 x for the cloud. then ???x until all bots are satisfied.
You end-up spending at least 5x the amount of tokens for maybe prediction machine to find a discontinuity?
I would say a way better approach is 1.123x to generate code + tests + passing analysis tools + human review + 1x "simplify as much as possible", than letting the snake its own tail without boundaries.
Codex pushed an update that made my old threads inaccessible. This takes a million of lines to put out a half baked crud app?
Leveraging a better way. No last mile.
https://github.com/space-bacon/SRT
1 million lines of code aside, I feel like anyone who seriously thought about this would eventually run their own harness.
Just like .vimrc and .zshrc, the harness "code" itself can be easy and personal. Provided that it's built on working and existing construct such as tmux.
Isn't this essentially normal AI usage and what everyone has been doing for 6 months?
I understand that the’ve written zero lines of code for this application, but would it kill them to write a few lines of the blog post by hand?
Forcing readers to wade through an unceasing string of LLM clichés demonstrates the opposite of the point you’re trying to make—that the consumers of your work are worse off because you exercised no human judgment in creating it.
But this is almost what we have been doing for the last 3/5 months, isn’t?
Article is from February so that tracks
Well to a lot of people this is still a foreign concept.
why do you have “weeks” to ship what would take “months”?
> We intentionally chose this constraint so we would build what was necessary to increase engineering velocity by orders of magnitude.
I guess orders of magnitude ain’t what they used to be.
I am not much understanding the naysayers here.
I will do a premise: I don't like where software engineering is heading, at all. I have never been unhappier to work in this field since AI came out. And no, it is not possible to opt out of AI, especially when your teammates are all great engineers whose productivity increased a lot without any drops in quality code-wise (in fact the opposite has happened). You need to keep up. But it's tiring and the fun/interesting parts are disappearing.
That being said, it's clear that harness engineering is the most important part of our job and that task is going to take increasingly more of our time. And thus having a glimpse of how an AI company handles it is by any means interesting.
Dear OpenAI, the target audience of your blog or at least of this blog post understands English pretty well. Why won't you give them a simple way to disable the shitty ai translation and read the original content? Why translate it at all in the first place?
EDIT: found the button, all the way down in the bottom of the page... I hate this so much, give me the original content, I will decide if and when I need translation
Given that we can code at 10x speed for at least half a year, one would expect to see at least some pieces of machine-created software with 5 years' worth of equivalent human engineering work.
Anyone know some?
Title should probably be marked with (February 2026).
Step 1: Be rich.
"Engineering"
These people are so delusional it feels like a mental desease by now.
I really hope no one gets hurt by all this slop code in the future by these wanna be engineers.
The world is now agent-first already?
I would never dare put that in production
I wonder why we as engineers aren't protesting AI in the same way that artists and people in film and television are. This post should instill the same terror that visual artists feel.
If you're a more senior person in tech, this post is effectively saying that a large portion of your skillset is about to become completely worthless. This goes beyond the skills involved in writing the code. Everything that you've learned over years about how to determine whether code is good or bad, and what practices make an engineering team effective is not just obsolete, it's fundamentally counter-productive because it assumes a slow, human-centric process that requires you to actually review and understand the code. Even your ability to mentor junior engineers is now obsolete, because all that experience you've built up is now worthless to them.
If this is the approach the industry takes, particularly when combined with a lack of interest in quality from the business (and let's face it, consumers have shown us that they're happy to pay for cheap crap), it's hard to see much of a future for software engineers. You don't need thousands of people with deep technical expertise, you need a handful of manager-types, who will focus on defining product and business requirements and configuring how the AI gets enough context to implement the requirements.
Maybe, if we're extremely lucky, there's so much demand for software that total employment doesn't fall off a cliff, but the nature of the work will change so much that many older, more expensive engineers will become unemployable. Those who remain will have to accept that the skills they spend decades developing are now worthless, that younger engineers no longer respect or listen to them, that the business no longer sees them as experts worthy of respect, but old fogies who grew up in a different world.
Joe Biden liked to say that a job is more than just a paycheck, it's part of your identity and your sense of self-worth. We're all very used to a certain level of respect (and commensurate remuneration). If you don't think that's true, compare how a software engineer is treated to how a warehouse worker is treated. What happens when we lose that?
>a large portion of your skillset is about to become completely worthless
I'm not convinced of that.
I watched a video of an architect using AI to create architectural drawings. It became very clear to me that he has a lot of skills and terminology that helped him produce something very specific, in a few minutes. I've been working on some home improvement stuff including a studio/shed and I've struggled to produce even something simple (currently trying to get a conversation packet on the roof trusses to take the the permit department to get started). Even with my high school architecture class.
After watching that I wonder how much of what I'm doing with AI that looks easy is because I hae a deep technical knowledge, plus 3 years of heavy work with AI.
This is the case now - I can explain to the AI that I want to re-factor a component to support different implementations using a strategy pattern, and I can get a similar outcome to what I would have written, just implemented a bit faster. My expertise brings value.
But that's not what this specific article is describing. The world this article is describing is one where you describe the business requirements, and you don't think about how it's implemented. You don't write the code, you don't review the code, you don't test the code. You give the AI business requirements and you give it access to sources of context (slack, meeting notes, etc). Every place where the human would act as a gate reduces throughput, so it should be eliminated through building harnesses and providing context.
What they're doing here is the equivalent of taking a factory where you have 2 process engineers and 100 operators, and replacing all the operators with robots. They want to automate the whole process of making the software and just leave the part that figures out how to make the automation work effectively.
In this world, the average software company doesn't need people who know how to write good software, because writing, reviewing, maintaining, and testing the software will be entirely automated. There will be a small number of people at companies like OpenAI that need to know how to write good software in order to supervise training the models, and there will be a small number of people at the software companies who have expertise in setting up the automation.
>where you describe the business requirements
That right there is what I'm talking about: that architect would write the requirements for a building way different than I would.
How do you keep your skills if you no longer engage in the activity that keeps them sharp?
See, I just don't get that angle.
Just because I'm not typing "strcat(); strcpy(); sprintf()" doesn't mean I'm not thinking about problems. I'm still doing critical thinking all over my stack, and I don't see that going away. I'm just doing different thinking.
There are people who think, and AI just isn't going to change that. There are people who don't think, and they've existed long before AI. Back in the 90s when I worked at the phone company, man, I worked with some people who didn't do a lick of work (along with some really sharp people).
What is that protest going to get us? We'll convince or force business leaders to not use a cheaper/better tool and protect our jobs? And nobody else in the world is going to pivot either? And our companies will remain competitive?
Software engineers have always adapted to new technologies. New languages, frameworks, native apps, browser apps etc. So far this doesn't seem to be close to completely removing us from the loop.
If you are smart, educated, and can adapt, you'll figure it out. The economy has to find some stable equilibrium and it's not a zero sum game. Everyone in the economy getting a paycheck is also a consumer. With no consumers there is no business. The companies who are using AI and become more productive can do more things that before were not profitable but now are. Some of the people who are getting laid off are going to start new businesses and hire people. These things always cycle, and they basically have to.
I don't have a crystal ball though.
It's the other way around, unfortunately. The senior engineers will still be useful for architecture and infrastructure considerations, as well as guiding the agents. It's the junior engineers that get nailed, because there's little incentive to hire one when a LLM does a better job immediately and costs less.
That's true now. But in the world of this article, it's also the senior engineers that get nailed. In the world of this article, all code is like what machine code or bytecode is now - it's designed to be used by the machine, not the human, because the expectation is that humans will rarely, if ever, touch it.
Individual voices aren't strong enough to drown the marketing machine.
Artists and writers are unionized, why they have a more powerful collective voice.
Second, there are enough peole for which their jobs are very well paid and too cozy to dare to rock the boat.
The economy and job market isn't so hot either at the moment for people to quickly be able to jump ship.
Can you even be sure that you find a tech company that isn't jumping head first onto the AI hype train? Even politicians can't have enough of AI in their mouth.
I for one am not protesting because I know that this is bullshit marketing nonsense. Look at reliability metrics of OpenAI, they’re terrible. Everyone knew a long way ahead that it’s a scam, now they’re cranking up pricing and trying to rug pull. There will be a lot of developers who will come out very well once the stock tanks. That’s my two cents
engineers undervalue their own process
artists overvalue their own outputs
> in an agent-first world
casual gaslighting
digression:
It's interesting this was submitted to HN over 15 times since it was published in February: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
But this is the only submission that's had any traction. Since the content is nearly the same for all submissions, it highlights how getting to the front page can be a bit random. (Though this is the only one that capitalized 'Leveraged' so maybe that's the secret)
time of day also matters
(OP here) The truth is much simpler. I mailed the mods and they bumped it up in the second chance pool [1] :)
1. https://news.ycombinator.com/item?id=26998309
[flagged]
[flagged]
[flagged]
Published Feb 11, 2026
Might as well be 2025.
[flagged]
[dead]
> Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with 0 lines of manually-written code.
This is such a common thing among software engineers nowadays that I was very surprised that OpenAI would open with that line as if it were mind blowing.
But then I saw it was published in February and OP is just reposting it to farm karma.
Everyone is criticizing the number of lines of code and the lack of attention that must certainly have been applied to generate that code and push it into production. What is being ignored is this awesome prompt that is almost certainly better than having no agents.md or plans.md or whatever you've come up with, to add validation steps for committed changes. You're still free to look at your code, the changes, and ask the agent to clean up. Try it. It's really nice.