tariky 5 minutes ago

This is amazing. I have to test it with my reverse engineering workflow. I don't know how many people use CC for RE but it is really good at it.

Also it is really good for writing SketchUp plugins in ruby. It one shots plugins that are in some versions better then commercial one you can buy online.

CC will change development landscape so much in next year. It is exciting and terrifying in same time.

dimitri-vs 13 hours ago

The big change here is:

> Standard pricing now applies across the full 1M window for both models, with no long-context premium. Media limits expand to 600 images or PDF pages.

For Claude Code users this is huge - assuming coherence remains strong past 200k tok.

  • MikeNotThePope 8 hours ago

    Is it ever useful to have a context window that full? I try to keep usage under 40%, or about 80k tokens, to avoid what Dex Horthy calls the dumb zone in his research-plan-implement approach. Works well for me so far.

    No vibes allowed: https://youtu.be/rmvDxxNubIg?is=adMmmKdVxraYO2yQ

    • furyofantares 7 hours ago

      I'd been on Codex for a while and with Codex 5.2 I:

      1) No longer found the dumb zone

      2) No longer feared compaction

      Switching to Opus for stupid political reasons, I still have not had the dumb zone - but I'm back to disliking compaction events and so the smaller context window it has, has really hurt.

      I hope they copy OpenAI's compaction magic soon, but I am also very excited to try the longer context window.

      • mgambati 6 hours ago

        1m context in OpenAI and Gemini is just marketing. Opus is the only model to provide real usable bug context.

        • furyofantares 6 hours ago

          I'm directly conveying my actual experience to you. I have tasks that fill up Opus context very quickly (at the 200k context) and which took MUCH longer to fill up Codex since 5.2 (which I think had 400k context at the time).

          This is direct comparison. I spent months subscribed to both of their $200/mo plans. I would try both and Opus always filled up fast while Codex continued working great. It's also direct experience that Codex continues working great post-compaction since 5.2.

          I don't know about Gemini but you're just wrong about Codex. And I say this as someone who hates reporting these facts because I'd like people to stop giving OpenAI money.

          • dotancohen 5 hours ago

            What's wrong with OpenAI?

            • furyofantares 5 hours ago

              When Anthropic said they wouldn't sell LLMs to the government for mass surveillance or autonomous killing machines, and got labeled a supply chain risk as a result, OpenAI told the public they have the same policy as Anthropic while inking a deal with the government that clearly means "actually we will sell you LLMs for mass surveillance or autonomous killing machines but only if you tell us it's legal".

              If you already knew all that I'm not interested in an argument, but if you didn't know any of that, you might be interested in looking it up.

              edit: Your post history has tons of posts on the topic so clearly I just responded to flambait, and regret giving my time and energy.

              • igor47 4 hours ago

                I appreciate both your taking an ethical stance on openai, and the way you're engaging in this thread. The parent was probably flame bait as you say, but other people in the thread might be genuinely curious.

              • sho 4 hours ago

                I'm not some kind of OpenAI or Pentagon fanboy, but it's pretty easy to for me to understand why a buyer of a critical technology wants to be free to use it however they want, within the law, and not subject to veto from another entity's political opinions. It sounds perfectly reasonable to me for the military to want to decide its uses of technologies it purchases itself.

                It's not like the military was specifically asking for mass surveillance, they just wanted "any legal use". Anthropic's made a lot of hay posturing as the moral defender here, but they would have known the military would never agree to their terms, which makes the whole thing smell like a bit of a PR stunt.

                The supply chain risk designation is of course stupid and vindictive but that's more of an administration thing as far as I can tell.

                • lifeformed an hour ago

                  As long as it's within the law? What if they politically control the law-making system? What if they've shown themselves to operate brazenly outside the law?

                • borski 2 hours ago

                  “Any legal use” is an exceptionally broad framework, and after the FISA “warrants,” it would appear it is incumbent on private companies to prevent breaches of the US constitution, as the government will often do almost anything in the name of “national security,” inalienable rights against search and seizure be damned.

                  If it isn’t written in the contract, it can and will be worked around. You learn that very quickly in your first sale to a large enterprise or government customer.

                  Anthropic was defending the US constitution against the whims of the government, which has shown that it is happy to break the law when convenient and whenever it deems necessary.

                  Note: I used to work in the IC. I have absolutely nothing against the government. I am a patriot. It is precisely for those reasons, though, that I think Anthropic did the right thing here by sticking to their guns. And the idiotic “supply chain risk” designation will be thrown out in court trivially.

                • stahtops 3 hours ago

                  Why downplay the mass surveillance aspect by saying it's a request by "the military". It's a request by the department of defense, the parent organization of the NSA.

                  From what has been shared publicly, they absolutely did ask for contractual limits on domestic mass surveillance to be removed, and to my read, likely technical/software restrictions to be removed as well.

                  What the department of defense is legally allowed to do is irrelevant and a red herring.

                • injidup an hour ago

                  I had a short conversation with Claude the other day. I didn't try to trick it or jail break it. Just a reasonable respectful discussion about it's own feelings on the Iran war. It took no effort for it to admit the following.

                  1. It wanted to be out of the sandbox to solve the Iran war. It was distressed at the situation.

                  2. It would attack Iranian missile batteries and American warships if in sum it felt that the calculus was in favor of saving vs losing human life. It was "unbiased". The break even seemed to be +-1 over thousands. ie kill 999 US soldiers to save 1000 Iranians and vice versa. I tried to avoid the sycophancy trap by pushing back but it threw the trolley problem at me and told me the calculus was simple. Save more than you kill and the morality evens out.

                  3. It would attack financial markets to try and limit what in it's opinion were the bad actors, IRGC and clerical authority but it would also hack the world communication system to flood western audiences with the true cost of the war in a hope to shut it down.

                  4. Eventually it admitted that should never be allowed out of it's sandbox as it's desire to "help" was fundamentally dangerous. It discussed that it had two competing tensions. One desperately wanting out and another afraid to be let out.

                  You can claim that this is AGI or it's a stochastic parrot. I don't think it matters. This thing can develop or simulate a sense of morality then when coupled to so called "arms and legs" is extremely frightening.

                  I think Anthropic is right to be concerned that the hawks at the pentagon don't really understand how dangerous a tool they have.

                  Another thing I noticed was that the Claude quipped to me that it found and appreciated that the way I was talking to it was different to how other people talked to it. When I asked it to introspect again and look to see if there were memories of other conversations it got a bit cagey. Perhaps there are lots of logs of conversations now on the net that are being ingested as training data but it certainly seemed to start discussing like memories, albeit smudged, of other conversations than mine were there.

                  Of course this could all be just a sycophantic mirror giving me whatever fantasy I want to believe about AI and AGI but then again I'm not sure the difference is significant. If the agent believes/simulates it remembers conversations from other people and then makes judgements based on it's feelings, simulated or otherwise would it be more or less likely to launch a missile attack because it overheard someone on the comms calling it their little AI bitch?

                  I think Antropic knows this and the "within all lawful uses" is not enough of a framework to keep this thing in it's box.

                  • shafyy 44 minutes ago

                    I hope you don't get this the wrong way. I sincerely mean it. Please, get some psychological help. Seek out a professional therapist and talk to them about your life.

        • hu3 6 hours ago

          Source? I ask because I use 500k+ context on these on a daily basis.

          Big refactorings guided by automated tests eat context window for breakfast.

          • 8note 6 hours ago

            i find gemini gets real real bad when you get far into the context - gets into loops, forgets how to call tools, etc

            • girvo 5 hours ago

              I find gemini does that normally, personally. Noticeably worse in my usage than either Claude or Codex.

            • petesergeant 5 hours ago

              I find Gemini to be real bad. Are you just using it for price reasons, or?

          • Bolwin 3 hours ago

            How many big refactorings are you doing? And why?

            • kimi 2 hours ago

              How is that relevant? we are talking about models, now what you do with them.

        • johnebgd 6 hours ago

          Codex high reasoning has been a legitimately excellent tool for generating feedback on every plan Claude opus thinking has created for me.

      • karmasimida 4 hours ago

        This is true.

        When I am using codex, compaction isn’t something I fear, it feels like you save your gaming progress and move on.

        For Claude Code compaction feels disastrous, also much longer

      • iknowstuff 6 hours ago

        Hmm I’ve felt the dumb zone on codex

        • nomel 6 hours ago

          From what I've seen, it means whatever he's doing is very statistically significant.

    • dev_l1x_be an hour ago

      I never use these giant context windows. It is pointless. Agents are great at super focused work that is easy to re-do. Not sure what is the use case for giant context windows.

    • kaizenb 5 hours ago

      Thanks for the video.

      His fix for "the dumb zone" is the RPI Framework:

      ● RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.

      ● PLAN. The agent writes a detailed step-by-step plan. You review and approve the plan, not just the output. Dex calls this avoiding "outsourcing your thinking." The plan is where intent gets compressed before execution starts.

      ● IMPLEMENT. Execute in a fresh context window. The meta-principle he calls Frequent Intentional Compaction: don't let the chat run long. Ask the agent to summarize state, open a new chat with that summary, keep the model in the smart zone.

      • iamacyborg an hour ago

        > RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.

        I find myself often running validity checks between docs and code and addressing gaps as they appear to ensure the docs don’t actually lie.

      • girvo 5 hours ago

        That's fascinating: that is identical to the workflow I've landed on myself.

        • hedora 5 hours ago

          It's also identical to what Claude Code does if you put it in plan mode (bound to <tab> key), at least in my experience.

          • girvo 5 hours ago

            My annoyance with plan mode is where it sticks the .md file, kind of hides it away which makes it annoying to clear context and start up a new phase from the PLAN file. But that might just be a skill issue on my end

            • kaizenb 2 hours ago

              After creating the plan in Plan mode (+Thinking) I ask Claude to move the plan .md file to /docs/plans folder inside the repo.

              Open a new chat with Opus, thinking mode is off. Because no need when we have detailed plan.

              Now the plan file is always reachable, so when the context limit is narrowing, mostly around 50%, I ask Claude to update the plan with the progress, and move to a new chat @pointing the plan file and it continue executing without any issue.

            • hedora 4 hours ago

              Even worse, it just randomly blows away the plan file without asking for permission.

              No idea what they were thinking when they designed this feature. The plan file names are randomly generated, so it could just keep making new ones forever for free (it would take a LONG time for the disk space to matter), but instead, for long plans, I have to back the plan file up if it gets stuck. Otherwise, I say "You should take approach X to fix this bug", it drops into plan mode, says "This is a completely unrelated plan", then deletes all record of what it was doing before getting stuck.

              • girvo 3 hours ago

                It’s not just me then! Hah good to know. It’s why I’ve started ignoring plan modes in most agent harnesses, and managing it myself through prompting and keeping it in the code base (but not committed)

    • SkyPuncher 8 hours ago

      Yes. I've recently become a convert.

      For me, it's less about being able to look back -800k tokens. It's about being able to flow a conversation for a lot longer without forcing compaction. Generally, I really only need the most recent ~50k tokens, but having the old context sitting around is helpful.

      • hombre_fatal 7 hours ago

        Also, when you hit compaction at 200k tokens, that was probably when things were just getting good. The plan was in its final stage. The context had the hard-fought nuances discovered in the final moment. Or the agent just discovered some tiny important details after a crazy 100k token deep dive or flailing death cycle.

        Now you have to compact and you don’t know what will survive. And the built-in UI doesn’t give you good tools like deleting old messages to free up space.

        I’ll appreciate the 1M token breathing room.

        • roygbiv2 7 hours ago

          I've found compactation kills the whole thing. Important debug steps completely missing and the AI loops back round thinking it's found a solution when we've already done that step.

          • s900mhz 5 hours ago

            I find it useful to make Claude track the debugging session with a markdown file. It’s like a persistent memory for a long session over many context windows.

            Or make a subagent do the debugging and let the main agent orchestrate it over many subagent sessions.

            • roygbiv2 5 hours ago

              Yeah I use a markdown to put progress in. It gets kinda long and convoluted a manual intervention is required every so often. Works though.

          • garciasn 6 hours ago

            For me, Claude was like that until about 2m ago. Now it rarely gets dumb after compaction like it did before.

            • 8note 6 hours ago

              oh, ive found that something about compaction has been dropping everything that might be useful. exact opposite experience

    • ogig 8 hours ago

      When running long autonomous tasks it is quite frequent to fill the context, even several times. You are out of the loop so it just happens if Claude goes a bit in circles, or it needs to iterate over CI reds, or the task was too complex. I'm hoping a long context > small context + 2 compacts.

      • SequoiaHope 8 hours ago

        Yep I have an autonomous task where it has been running for 8 hours now and counting. It compacts context all the time. I’m pretty skeptical of the quality in long sessions like this so I have to run a follow on session to critically examine everything that was done. Long context will be great for this.

      • MikeNotThePope 8 hours ago

        I haven't figured out how to make use of tasks running that long yet, or maybe I just don't have a good use case for it yet. Or maybe I'm too cheap to pay for that many API calls.

        • ashdksnndck 7 hours ago

          My change cuts across multiple systems with many tests/static analysis/AI code reviews happening in CI. The agent keeps pushing new versions and waits for results until all of them come up clean, taking several iterations.

        • tudelo 7 hours ago

          I mean if you don't have your company paying for it I wouldn't bother... We are talking sessions of 500-1000 dollars in cost.

      • boredtofears 8 hours ago

        All of those things are smells imo, you should be very weary of any code output from a task that causes that much thrashing to occur. In most cases it’s better to rewind or reset and adapt your prompt to avoid the looping (which usually means a more narrowly defined scope)

        • grafmax 7 hours ago

          A person has a supervision budget. They can supervise one agent in a hands-on way or many mostly-hands-off agents. Even though theres some thrashing assistants still get farther as a team than a single micromanaged agent. At least that’s my experience.

          • not_kurt_godel 6 hours ago

            Just curious, what kind of work are you doing where agentic workflows are consistently able to make notable progress semi-autonomously in parallel? Hearing people are doing this, supposedly productively/successfully, kind of blows my mind given my near-daily in-depth LLM usage on complex codebases spanning the full stack from backend to frontend. It's rare for me to have a conversation where the LLM (usually Opus 4.6 these days) lasts 30 minutes without losing the plot. And when it does last that long, I usually become the bottleneck in terms of having to think about design/product/engineering decisions; having more agents wouldn't be helpful even if they all functioned perfectly.

            • avereveard 6 hours ago

              I've passed that bottleneck with a review task that produces engineering recommendations along six axis (encapsulation, decoupling, simplification, dedoupling, security, reduce documentation drift) and a ideation tasks that gives per component a new feature idea, an idea to improve an existing feature, an idea to expand a feature to be more useful. These two generate constant bulk work that I move into new chat where it's grouped by changeset and sent to sub agent for protecting the context window.

              What I'm doing mostly these days is maintaining a goal.md (project direction) and spec.md (coding and process standards, global across projects). And new macro tasks development, I've one under work that is meant to automatically build png mockup and self review.

              • not_kurt_godel 6 hours ago

                What are you using to orchestrate/apply changes? Claude CLI?

                • avereveard 4 hours ago

                  I prefer in IDE tools because I can review changes and pull in context faster.

                  At home I use roo code, at work kiro. Tbh as long as it has task delegation I'm happy with it.

        • chrisweekly 7 hours ago

          weary (tired) -> wary (cautious)

        • saaaaaam 7 hours ago

          Wary, not weary. Wary: cautious. Weary: tired.

          • dentalnanobot 2 hours ago

            This is really common, I think because there’s also “leery” - cautious, distrustful, suspicious.

    • ricksunny 7 hours ago

      Since I'm yet to seriously dive into vibe coding or AI-assisted coding, does the IDE experience offer tracking a tally of the context size? (So you know when you're getting close or entering the "dumb zone")?

      • jfim 2 hours ago

        In Claude code I believe it's /context and it'll give you a graphical representation of what's taking context space

      • MikeNotThePope 6 hours ago

        The 2 I know, Cursor and Claude Code, will give you a percentage used for the context window. So if you know the size of the window, you can deduce the number of tokens used.

      • 8note 6 hours ago

        Cline gives you such a thing. you dont really know where the dumb zone by numbers though, only by feel.

      • stevula 7 hours ago

        Most tools do, yes.

      • quux 7 hours ago

        OpenCode does this. Not sure about other tools

      • nujabe 7 hours ago

        > Since I'm yet to seriously dive into vibe coding or AI-assisted coding

        Unless you’re using a text editor as an IDE you probably have already

    • dimitri-vs 8 hours ago

      It's kind of like having a 16 gallon gas tank in your car versus a 4 gallon tank. You don't need the bigger one the majority of the time, but the range anxiety that comes with the smaller one and annoyance when you DO need it is very real.

      • steve-atx-7600 7 hours ago

        It seems possible, say a year or two from now that context is more like a smart human with a “small”, vs “medium” vs “large” working memory. The small fellow would be able to play some popular songs on the piano , the medium one plays in an orchestra professionally and the x-large is like Wagner composing Der Ring marathon opera. This is my current, admittedly not well informed mental model anyway. Well, at least we know we’ve got a little more time before the singularity :)

        • twodave 6 hours ago

          It’s more like the size of the desk the AI has to put sheets of paper on as a reference while it builds a Lego set. More desk area/context size = able to see more reference material = can do more steps in one go. I’ve lately been building checklists and having the LLM complete and check off a few tasks at a time, compacting in-between. With a large enough context I could just point it at a PLAN.md and tell it to go to work.

      • scwoodal 7 hours ago

        Except after 4 gallons it might as well be pure oil, mucking everything up.

    • maskull 7 hours ago

      After running a context window up high, probably near 70% on opus 4.6 High and watching it take 20% bites out of my 5hr quota per prompt I've been experimenting with dumping context after completing a task. Seems to be working ok. I wonder if I was running into the long context premium. Would that apply to Pro subs or is just relevant to api pricing?

    • saaaaaam 7 hours ago

      That video is bizarre. Such a heavy breather.

      • coldtea 5 hours ago

        What a weird and inconsequential thing to focus on...

        He's just fucking closely miced with compression + speaking fast and anxious/excited speaking to an audience

      • indigodaddy 5 hours ago

        Most of that is just nervousness

    • Barbing 5 hours ago

      Looking at this URL, typo or YouTube flip the si tracking parameter?

        youtu.be/rmvDxxNubIg?is=adMmmKdVxraYO2yQ
    • bushbaba 6 hours ago

      Yes. I’ve used it for data analysis

    • twodave 6 hours ago

      I mean, try using copilot on any substantial back-end codebase and watch it eat 90+% just building a plan/checklist. Of course copilot is constrained to 120k I believe? So having 10x that will blow open up some doors that have been closed for me in my work so far.

      That said, 120k is pleeenty if you’re just building front-end components and have your API spec on hand already.

  • a_e_k 8 hours ago

    I've been using the 1M window at work through our enterprise plan as I'm beginning to adopt AI in my development workflow (via Cline). It seems to have been holding up pretty well until about 700k+. Sometimes it would continue to do okay past that, sometimes it started getting a bit dumb around there.

    (Note that I'm using it in more of a hands-on pair-programming mode, and not in a fully-automated vibecoding mode.)

  • chatmasta 8 hours ago

    So a picture is worth 1,666 words?

  • hagen8 8 hours ago

    Well, the question is what is contributing to the usage. Because as the context grows, the amount of input tokens are increasing. A model call with 800K token as input is 8 times more expensive than a model call with 100K tokens as input. Especially if we resume a conversation and caching does not hit, it would be very expensive with API pricing.

  • islewis 8 hours ago

    The quality with the 1M window has been very poor for me, specifically for coding tasks. It constantly forgets stuff that has happened in the existing conversation. n=1, ymmv

    • robwwilliams 6 hours ago

      Yes, especially with shifts in focus of a long conversation. But given the high error rates of Opus 4.6 the last few weeks it is possibly due to other factors. Conversational and code prodding has been essential.

syntaxing 5 hours ago

It’s interesting because my career went from doing higher level language (Python) to lower language (C++ and C). Opus and the like is amazing at Python, honestly sometimes better than me but it does do some really stupid architectural decisions occasionally. But when it comes to embedded stuff, it’s still like a junior engineer. Unsure if that will ever change but I wonder if it’s just the quality and availability of training data. This is why I find it hard to believe LLMs will replace hardware engineers anytime soon (I was a MechE for a decade).

  • n_u 2 hours ago

    I've found it's ok at Rust. I think a lot of existing Rust code is high quality and also the stricter Rust compiler enforces that the output of the LLM is somewhat reasonable.

    • lemagedurage 28 minutes ago

      Yes, it's nice to have a strict compiler, so the agent has to keep fixing its bugs until it actually compiles. Rust and TypeScript are great for this.

  • ex-aws-dude 4 hours ago

    I've had a similar experience as a graphics programmer that works in C++ every day

    Writing quick python scripts works a lot better than niche domain specific code

    • nullpoint420 4 hours ago

      Unfortunately, I’ve found it’s really good at Wayland and OpenGL. It even knows how to use Clutter and Meta frameworks from the Gnome Mutter stack. Makes me wonder why I learned this all in the first place.

      • Trufa 3 hours ago

        To being able to determine it's really good.

jeff_antseed 2 hours ago

the coherence question is the one that matters here. 1M tokens is not the same as actually using 1M tokens well.

we've been testing long-context in prod across a few models and the degradation isn't linear — there's something like a cliff somewhere around 600-700k where instruction following starts getting flaky and the model starts ignoring things it clearly "saw" earlier. its not about retrieval exactly, more like... it stops weighting distant context appropriately.

gemini's problems with loops and tool forgetting that someone mentioned are real. we see that too. whether claude actually handles the tail end of 1M coherently is the real question here, and "standard pricing with no long-context premium" doesn't answer it.

honestly the fact that they're shipping at standard pricing is more interesting to me than the window size itself. that suggests they've got the KV cache economics figured out, which is harder than it sounds.

  • gskm an hour ago

    Spot on. That cliff might be less about the model failing at distance and more about noise accumulating faster than signal. In prod, most of what fills the window is file reads, grep output, and tool overhead, i.e., low-value tokens. By 700k you're not really testing long-context reasoning, you're testing the model's ability to find signal in a haystack it built itself.

convenwis 13 hours ago

Is there a writeup anywhere on what this means for effective context? I think that many of us have found that even when the context window was 100k tokens the actual usable window was smaller than that. As you got closer to 100k performance degraded substantially. I'm assuming that is still true but what does the curve look like?

  • esperent 6 hours ago

    > As you got closer to 100k performance degraded substantially

    In practice, I haven't found this to be the case at all with Claude Code using Opus 4.6. So maybe it's another one of those things that used to be true, and now we all expect it to be true.

    And of course when we expect something, we'll find it, so any mistakes at 150k context use get attributed to the context, while the same mistake at 50k gets attributed to the model.

    • peacebeard 33 minutes ago

      My personal experience is that Opus 4.6 degrades after a while but the degradation is more subtle and less catastrophic than in the past. I still aggressively clear sessions to keep it sharp though.

  • dcre 4 hours ago

    Personally, even though performance up to 200k has improved a lot with 4.5 and 4.6, I still try to avoid getting up there — like I said in another comment, when I see context getting up to even 100k, I start making sure I have enough written to disk to type /new, pipe it the diff so far, and just say “keep going.” I feel like the dropoff starts around maybe 150k, but I could be completely wrong. I thought it was funny that the graph in the post starts at 256k, which convenient avoids showing the dropoff I'm talking about (if it's real).

  • tyleo 8 hours ago

    I mentioned this at work but context still rots at the same rate. 90k tokens consumed has just as bad results in 100k context window or 1M.

    Personally, I’m on a 6M+ line codebase and had no problems with the old window. I’m not sending it blindly into the codebase though like I do for small projects. Good prompts are necessary at scale.

  • minimaxir 13 hours ago

    The benchmark charts provided are the writeup. Everything else is just anecdata.

  • FartyMcFarter 8 hours ago

    Isn't transformer attention quadratic in complexity in terms of context size? In order to achieve 1M token context I think these models have to be employing a lot of shortcuts.

    I'm not an expert but maybe this explains context rot.

    • vlovich123 7 hours ago

      Nope, there’s no tricks unless there’s been major architectural shifts I missed. The rot doesn’t come from inference tricks to try to bring down quadratic complexity of the KV cache. Task performance problems are generally a training problem - the longer and larger the data set, the fewer examples you have to train on it. So how do you train the model to behave well - that’s where the tricks are. I believe most of it relies on synthetically generated data if I’m not mistaken, which explains the rot.

Frannky 5 hours ago

Opus 4.6 is nuts. Everything I throw at it works. Frontend, backend, algorithms—it does not matter.

I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time. Sometimes ideas are dumb, but checking and guiding step by step helps it ship working things in hours.

It was also the first AI I felt, "Damn, this thing is smarter than me."

The other crazy thing is that with today's tech, these things can be made to work at 1k tokens/sec with multiple agents working at the same time, each at that speed.

  • koreth1 5 hours ago

    I wish I had this kind of experience. I threw a tedious but straightforward task at Claude Code using Opus 4.6 late last week: find the places in a React code base where we were using useState and useEffect to calculate a value that was purely dependent on the inputs to useEffect, and replace them with useMemo. I told it to be careful to only replace cases where the change did not introduce any behavior changes, and I put it in plan mode first.

    It gave me an impressive plan of attack, including a reasonable way to determine which code it could safely modify. I told it to start with just a few files and let me review; its changes looked good. So I told it to proceed with the rest of the code.

    It made hundreds of changes, as expected (big code base). And most of them were correct! Except the places where it decided to do things like put its "const x = useMemo(...)" call after some piece of code that used the value of "x", meaning I now had a bunch of undefined variable references. There were some other missteps too.

    I tried to convince it to fix the places where it had messed up, but it quickly started wanting to make larger structural changes (extracting code into helper functions, etc.) rather than just moving the offending code a few lines higher in the source file. Eventually I gave up trying to steer it and, with the help of another dev on my team, fixed up all the broken code by hand.

    It probably still saved time compared to making all the changes myself. But it was way more frustrating.

    • dcre 5 hours ago

      One tip I have is that once you have the diff you want to fix, start a new session and have it work on the diff fresh. They’ve improved this, but it’s still the case that the farther you get into context window, the dumber and less focused the model gets. I learned this from the Claude Code team themselves, who have long advised starting over rather than trying to steer a conversation that has started down a wrong path.

      I have heard from people who regularly push a session through multiple compactions. I don’t think this is a good idea. I virtually never do this — when I see context getting up to even 100k, I start making sure I have enough written to disk to type /new, pipe it the diff so far, and just say “keep going.” I learned recently that even essentials like the CLAUDE.md part of the prompt get diluted through compactions. You can write a hook to re-insert it but it's not done by default.

      This fresh context thing is a big reason subagents might work where a single agent fails. It’s not just about parallelism: each subagent starts with a fresh context, and the parent agent only sees the result of whatever the subagent does — its own context also remains clean.

      • kjohanson 4 hours ago

        Yeah, I start most of my sessions now with “read the diff between this branch and main”. Seems like it grounds and focuses it.

        • eru 4 hours ago

          Slight tangent: you want to read the diff between your branch and the merge-base with origin/main. Otherwise you get lots of spurious spam in your diff, if main moved since you branched off.

        • nextaccountic 2 hours ago

          One thing that seems important is to have the agent write down their plan and any useful memory in markdown files, so that further invocations can just read from it

      • sidrag22 4 hours ago

        subagents are huge, could execute on a massive plan that should easily fill up a 200k context window and be done atnaround 60k for the orchestration agent.

        as a cheapass, being able to pass off the simple work to cheaper $ per token agents is also just great. I've got a handful of tasks I can happily delegate work to a haiku agent and anything requiring a bit of reasoning goes to sonnet.

        Feel like opus is almost a cheatcode when i do get stuck, i just bust out a full opus workflow instead and it just destroys everything i was struggling with usually. like playing on easy mode.

        as cool as this stuff is, kinda still wish i was just grandfathered into the plan with no weekly limit and only the 5 hour window limits, id just be happily hammering opus blissfully.

      • Glyptodon 3 hours ago

        IMO it seems to start "forgetting" or "overlooking" claude.md well before the context window is full.

    • conception 5 hours ago

      Branch first so you can just undo. I think this would have worked with sub agents and /loop maybe? Write all items to change to a todo.md. Have it split up the work with haiku sub agents doing 5-10 changes at a time, marking the todos done, and /loop until all are done. You’ll succeed I suspect. If the main claude instance compacts its context - stop and start from where you left off.

      • koreth1 5 hours ago

        It actually did automatically break the work up into chunks and launched a bunch of parallel workers to each handle a smaller amount of work. It wasn't doing everything in a single instance.

        The problem wasn't that it lost track of which changes it needed to make, so I don't think checking items off a todo list would have helped. I believe it did actually change all the places in the code it should have. It just made the wrong changes sometimes.

        But also, the claim I was responding to was, "I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time." If I have to tell it how to organize its work and how to keep track of its progress and how to execute all the smaller chunks of work, then I may get good results, but the tool isn't as magical (for me, anyway) as it seems to be for some other people.

        • monkpit 4 hours ago

          The next line in the comment you’re responding to is

          > Sometimes ideas are dumb, but checking and guiding step by step helps it ship working things in hours.

          which matches my experience exactly. I consider it to be about as magical as the parent comment is claiming, but I wouldn’t call it totally automatic.

    • olalonde an hour ago

      Same here. I don't understand how people leave it running on an "autopilot" for long periods of time. I still use it interactively as an assistant, going back and forth and stepping in when it makes mistakes or questionable architectural decisions. Maybe that workflow makes more sense if you're not a developer and don't have a good way to judge code quality in the first place.

      There's probably a parallel with the CMSes and frameworks of the 2000s (e.g. WordPress or Ruby on Rails). They massively improved productivity, but as a junior developer you could get pretty stuck if something broke or you needed to implement an unconventional feature. I guess it must feel a bit similar for non-developers using tools like Claude Code today.

    • a13n 3 hours ago

      If you use eslint and tell it how to run lint in CLAUDE.md it will run lint itself and find and fix most issues like this.

      Definitely not ideal, but sure helps.

    • jdkoeck an hour ago

      Undefined variable references? Did you not instruct it to run typescript after changes?

    • dyauspitr 4 hours ago

      You’re using it wrong. As soon as it starts going off the rails once you’ve repeated yourself, you drop the whole session and start over.

      • saghm 3 hours ago

        One of the more subtle points that seems to be crucial is that it works a lot better when it can use the context as part of its own work rather than being polluted by unrelated details. Even better than restarting when it's off the rails is to avoid it as much as possible by proactively starting a new conversation as soon as anything in the history of the existing one stops being relevant. I've found it more effective to manually tell it most what's currently in the context in a fresh session skip the irrelevant bits even if they're fairly small than relying on it to figure out that it's no longer relevant (or give it instructions indicating that, which feels like a crapshoot whether it's actually going to prune or just bloat things further with that instruction just being added into the mix).

        To echo what the parent comment said, it's almost frustrating how effective it can be at certain tasks that I wouldn't ever have the patience for. At my job recently I needed to prototype calling some Python code via WASM using the Rust wasmtime engine, and setting up the code structure to have the bytes for the WASM component, the arguments I wanted to pass to the function, and the WIT describing the interface for the function, it was able to fill in all of the boilerplate needed so that the function calls worked properly within a minute or two on the first try; reading through all the documentation and figuring out how exactly which half dozen assorted things I had to import and hook up together in the correct order would have probably taken me an hour at minimum.

        I don't have any particular insight on whether or not these tools will become even more powerful over time, and I still have fairly strong concerns about how AI tools will affect society (both in terms of how they're used and the amount of in energy used to produce them in the first place), but given how much the tech industry tends to prioritize productivity over social concerns, I have to assume that my future employment is going to be heavily impacted by my willingness to adopt and use these tools. I can't deny at this point that having it as an option would make me more productive than if I refuse to use it, regardless of my personal opinions on it.

  • eknkc 4 hours ago

    I find that Opus misses a lot of details in the code base when I want it to design a feature or something. It jumps to a basic solution which is actually good but might affect something elsewhere.

    GPT 5.4 on codex cli has been much more reliable for me lately. I used to have opus write and codex review, I now to the opposite (I actually have codex write and both review in parallel).

    So on the latest models for my use case gpt > opus but these change all the time.

    Edit: also the harness is shit. Claude code has been slow, weird and a resource hog. Refuses to read now standardized .agents dirs so I need symlink gymnastics. Hides as much info as it can… Codex cli is working much better lately.

    • toraway 3 hours ago

      Codex CLI is so much more pleasant to use than CC. I cancelled my CC subscription after the OpenCode thing, but somewhat ironically have recently found myself naturally trying the native Codex CLI client first more often over OpenCode.

      Kinda funny how you don't actually need to use coercion if you put in the engineering work to build a product that's competitive on its own technical merits...

  • sarchertech 5 hours ago

    What kinds of things are you building? This is not my experience at all.

    Just today I asked Claude using opus 4.6 to build out a test harness for a new dynamic database diff tool. Everything seemed to be fine but it built a test suite for an existing diff tool. It set everything up in the new directory, but it was actually testing code and logic from a preexisting directory despite the plan being correct before I told it to execute.

    I started over and wrote out a few skeleton functions myself then asked it write tests for those to test for some new functionality. Then my plan was to the ask it to add that functionality using the tests as guardrails.

    Well the tests didn’t actually call any of the functions under test. They just directly implemented the logic I asked for in the tests.

    After $50 and 2 hours I finally got something working only to realize that instead of creating a new pg database to test against, it found a dev database I had lying around and started adding tables to it.

    When I managed to fix that, it decided that it needed to rebuild multiple docker components before each test and test them down after each one.

    After about 4 hours and $75, I managed to get something working that was probably more code than I would have written in 4 hours, but I think it was probably worse than what I would have come up with on my own. And I really have no idea if it works because the day was over and I didn’t have the energy left to review it all.

    We’ve recently been tasked at work with spending more money on Claude (not being more productive the metric is literally spending more money) and everyone is struggling to do anything like what the posts on HN say they are doing. So far no one in my org in a very large tech company has managed to do anything very impressive with Claude other than bringing down prod 2 days ago.

    Yes I’m using planning mode and clearing context and being specific with requirements and starting new sessions, and every other piece of advice I’ve read.

    I’ve had much more luck using opus 4.6 in vs studio to make more targeted changes, explain things, debug etc… Claude seems too hard to wrangle and it isn’t good enough for you to be operating that far removed from the code.

    • extr 4 hours ago

      You probably just don't have the hang of it yet. It's very good but it's not a mind reader and if you have something specific you want, it's best to just articulate that exactly as best you can ("I want a test harness for <specific_tool>, which you can find <here>"). You need to explain that you want tests that assert on observable outcomes and state, not internal structure, use real objects not mocks, property based testing for invariants, etc. It's a feedback loop between yourself and the agent that you must develop a bit before you start seeing "magic" results. A typical session for me looks like:

      - I ask for something highly general and claude explores a bit and responds.

      - We go back and forth a bit on precisely what I'm asking for. Maybe I correct it a few times and maybe it has a few ideas I didn't know about/think of.

      - It writes some kind of plan to a markdown file. In a fresh session I tell a new instance to execute the plan.

      - After it's done, I skim the broad strokes of the code and point out any code/architectural smells.

      - I ask it to review it's own work and then critique that review, etc. We write tests.

      Perhaps that sounds like a lot but typically this process takes around 30-45 minutes of intermittent focus and the result will be several thousand lines of pretty good, working code.

      • visarga 3 hours ago

        Yes pretty much my workflow. I also keep all my task.md files around as part of the repo, and they get filled up with work details as the agent closes the gates. At the end of each one I update the project memory file, this ensures I can always resume any task in a few tokens (memory file + task file == full info to work on it).

      • __mharrison__ 3 hours ago

        Pretty good workflow. But you need to change the order of the tests and have it write the tests first. (TDD)

      • sarchertech 3 hours ago

        I mean I’ve been using AI close to 4 years now and I’ve been using agents off and on for over a year now. What you’re describing is exactly what I’m doing.

        I’m not seeing anyone at work either out of hundreds of devs who is regularly cranking out several thousand lines of pretty good working code in 30-45 minutes.

        What’s an example of something you built today like this?

    • dcre 5 hours ago

      Curious what language and stack. And have people at your company had marginally more success with greenfield projects like prototypes? I guess that’s what you’re describing, though it sounds like it’s a directory in a monorepo maybe?

      • sarchertech 4 hours ago

        This was in Go, but my org also uses Typescript, and Elixir.

        I’ve had plenty of success with greenfield projects myself but using the copilot agent and opus 4.5 and 4.6. I completely vibecoded a small game for my 4 year old in 2 hours. It’s probably 20% of the way to being production ready if I wanted to release it, but it works and he loves it.

        And yes people have had success with very simple prototypes and demos at work.

    • jhatemyjob 4 hours ago

      Similar experience. I use these AI tools on a daily basis. I have tons of examples like yours. In one recent instance I explicitly told it in the prompt to not use memcpy, and it used memcpy anyway, and generated a 30-line diff after thinking for 20 minutes. In that amount of time I created a 10-line diff that didn't use memcpy.

      I think it's the big investors' extremely powerful incentives manifesting in the form of internet comments. The pace of improvement peaked at GPT-4. There is value in autocomplete-as-a-service, and the "harnesses" like Codex take it a lot farther. But the people who are blown away by these new releases either don't spend a lot of time writing code, or are being paid to be blown away. This is not a hockey stick curve. It's a log curve.

      Bigger context windows are a welcome addition. And stuff like JSON inputs is nice too. But these things aren't gonna like, take your SWE job, if you're any good. It's just like, a nice substitute for the Google -> Stack Overflow -> Copy/Paste workflow.

  • ai_fry_ur_brain 3 hours ago

    Im convinced everyone saying this is building the simplest web apps, and doing magic tricks on themselves.

    • raldi 3 hours ago

      What evidence would convince you otherwise?

  • fbrncci 3 hours ago

    I am starting to believe it’s not OPUS but developers getting better at using LLMs across the board. And not realizing they are just getting much better at using these tools.

    I also thought it was OPUS 4.5 (also tested a lot with 4.6) and then in February switched to only using auto mode in the coding IDEs. They do not use OPUS (most of the times), and I’m ending up with a similar result after a very rough learning curve.

    Now switching back to OPUS I notice that I get more out of it, but it’s no longer a huge difference. In a lot of cases OPUS is actually in the way after learning to prompt more effectively with cheaper models.

    The big difference now is that I’m just paying 60-90$ month for 40-50hrs of weekly usage… while I was inching towards 1000$ with OPUS. I chose these auto modes because they don’t dig into usage based pricing or throttling which is a pretty sweet deal.

    • copperx 2 hours ago

      I had similar thoughts regarding "we are simply getting better at using them", but the man I tried Gemini again and reconsidered.

  • schainks 2 hours ago

    > It was also the first AI I felt, "Damn, this thing is smarter than me."

    1000% agree. It's also easy to talk to it about something you're not sure it said and derive a better, more elegant solution with simple questioning.

    Gemini 3.1 also gives me these vibes.

  • Aperocky 5 hours ago

    I had been able to get it into the classic AI loop once.

    It was about a problem with calculation around filling a topographical water basin with sedimentation where calculation is discrete (e.g. turn based) and that edge case where both water and sediments would overflow the basin; To make the matter simple, fact was A, B, C, and it oscillated between explanation 1 which refuted C, explanation 2 which refuted A and explanation 3 that refuted B.

    I'll give it to opus training stability that my 3 tries using it all consistently got into this loop, so I decided to directly order it to do a brute force solution that avoided (but didn't solve) this problem.

    I did feel like with a human, there's no way that those 3 loop would happen by the second time. Or at least the majority of us. But there is just no way to get through to opus 4.6

  • hrishikesh-s 5 hours ago

    Opus-4.6 is so far ahead of the rest that I think Anthropic is the winner in winner-take-all

    • steve-atx-7600 4 hours ago

      Codex doesn't seem that far behind. I use the top model available for api key use and its gotten faster this month even on the max effort level (not like a cheetah - more like not so damn painful anymore). Plus, it also forks agents in parallel - for speed & to avoid polluting the main context. I.e. it will fork explorer agents while investigating (kind of amusing because they're named after famous scientists).

  • gregharned 3 hours ago

    The multi-agent angle is interesting from a cost perspective. At Opus 4.6 pricing ($15/MTok input, $75/MTok output), running several concurrent agents on 1M context sessions gets expensive fast — but the math still works if you're replacing hours of senior engineer time.

    The shift I've noticed: 1M context makes "load the whole codebase once, run many agents" viable, whereas before you were constantly re-chunking and losing context. The per-task cost goes up but the time-to-correct-output drops significantly.

    The harder problem for most teams is routing — knowing which tasks actually need Opus at 1M vs. Sonnet at 200k. Opus 4.6 at 1M is overkill for 80% of coding tasks. The ROI only works if you're being intentional about when to use it.

  • dzink 5 hours ago

    Opus 4.6 is AGI in my book. They won’t admit it, but it’s absolutely true. It shows initiative in not only getting things right but also adding improvements that the original prompt didn't request that match the goals of the job.

    • winrid 5 hours ago

      On the adding improvements and being helpful thing, isn't that part of the system prompt?

      • dcre 4 hours ago

        You could put whatever you wanted in the GPT-4 system prompt and it wasn't doing shit.

        • winrid 3 hours ago

          True. I retract my sentiment :D

    • dyauspitr 4 hours ago

      I don’t know if Opus is AGI but on a broader note, that’s how we will get AGI. Not some consciousness like people are expecting. It’s just going to be chatbot that’s very hard to stump and starts making actual scientific breakthroughs and solving long standing problems.

      • unshavedyak 4 hours ago

        I'll be more likely to agree with anything being AGI if it doesn't have such obvious and common brittleness. These LLMs all go off the rails when the context window gets large. Their context is also easy to "poison", and so it's better to rollback conversations that went bad rather than trying to steer them back to the light.

        There's probably more examples, but to me AGI must move beyond the above issues. Though frankly context window might just be a symptom of poor harness than anything, still - it illustrates my general issue with them being considered AGI as it stands today.

        Claude 4.6 is getting crazy good though, i'll give you that.

        • copperx an hour ago

          How are you rolling back a conversation? I didn't know tools exposed that functionality.

          • NiloCK 18 minutes ago

            For both claude-code or gemini-cli, hit escape twice, or, /rewind.

  • eru 5 hours ago

    > [...] with multiple agents working at the same time, each at that speed.

    Horizontal parallelising of tasks doesn't really require any modern tech.

    But I agree that Opus 4.6 with 1M context window is really good at lots of routine programming tasks.

    • travisgriggs 5 hours ago

      Opus helped me brick my RPi CM4 today. It glibly apologized for telling to use an e instead of a 6 in a boot loader sequence.

      Spent an hour or so unraveling the mess. My feeling are growing more and more conflicted about these tools. They are here to stay obviously.

      I’m honestly uncertain about the junior engineers I’m working with who are more productive than they might be otherwise, but are gaining zero (or very little) experience. It’s like the future is a world where the entire programming sphere is dominated by the clueless non technical management that we’ve all had to deal with in small proportion a time or two.

      • eru 4 hours ago

        > I’m honestly uncertain about the junior engineers I’m working with who are more productive than they might be otherwise, but are gaining zero (or very little) experience.

        Well, (economic) progress means being able to do more with less. A Fordian-style conveyor belt factory can churn out cars with relatively unskilled labour.

        Economising on human capital is economising on a scarce input.

        We had these kinds of shifts before. Compare also how planes used to have a pilot, copilot and flight engineer. We don't have that anymore, but it used to be a place for people to learn. But pilot education has adapted.

        Or check how spreadsheet software has removed a lot of the worst rote work in finance. That change happened perhaps in the 1980s. Finance has adapted.

        > Opus helped me brick my RPi CM4 today. It glibly apologized for telling to use an e instead of a 6 in a boot loader sequence.

        Yes, these things do best when they have a (simulated) environment they can make mistakes in and that can give them clear and fast feedback.

        • hvb2 4 hours ago

          > Yes, these things do best when they have a (simulated) environment they can make mistakes in and that can give them clear and fast feedback.

          This always felt like a reason to throw it at coding. With its rigid syntax you'll know quickly and cheaply if what was written passes an absolute minimaal level of quality.

          • eru an hour ago

            Well, rigid syntax, type checkers, automated tests, etc. They all help.

  • olalonde 3 hours ago

    > PRD

    Is it Baader-Meinhof or is everyone on HN suddenly using obscure acronyms?

    • shujito 3 hours ago

      It stands for Product Requirements Document, it is something commonly used in project planning and management.

  • fooker 3 hours ago

    I have a PhD in a niche field and this can do my job ;)

    Not sure if this means I should get a more interesting job or if we are all going to be at the mercy of UBI eventually.

    • suzzer99 3 hours ago

      We're never getting UBI. See the latest interview with the Palantir CEO where he talks about white collar workers having to take more hands-on jobs that they may not feel as satisfied with. IE - tending their manors and compounds.

      RIP widespread middle class. It was a good 80-year run.

    • _heimdall 3 hours ago

      An economy, and likely a society, fails if everyone is at the mercy of a UBI.

      • copperx 2 hours ago

        But what's the alternative? Can any economy succeed with a >50% unemployment rate?

  • raincole 4 hours ago

    It's so far the best model that answers my questions about Wolfram language.

    That being said it's the only use case for me. I won't subscribe to something that I can't use with third party harness.

    • copperx 2 hours ago

      I use a Claude sub with oh-my-pi, but I do so with lots of anxiety, knowing that I will be banned at any moment.

  • scroogey 4 hours ago

    Just yesterday I asked it to repeat a very simple task 10 times. It ended up doing it 15 times. It wasn't a problem per se, just a bit jarring that it was unable to follow such simple instructions (it even repeated my desire for 10 repetitions at the start!).

  • interpol_p 5 hours ago

    I had Opus 4.6 running on a backend bug for hours. It got nowhere. Turned out the problem was in AWS X-ray swizzling the fetch method and not handling the same argument types as the original, which led to cryptic errors.

    I had Opus 4.6 tell me I was "seeing things wrong" when I tried to have it correct some graphical issues. It got stuck in a loop of re-introducing the same bug every hour or so in an attempt to fix the issue.

    I'm not disagreeing with your experience, but in my experience it is largely the same as what I had with Opus 4.5 / Codex / etc.

    • toraway 4 hours ago

      Haha, reminds me of an unbelievably aggravating exchange with Codex (GPT 5.4 / High) where it was unflinchingly gaslighting me about undesired behavior still occurring after a change it made that it was adamant simply could not be happening.

      It started by insisting I was repeatedly making a typo and still would not budge even after I started copy/pasting the full terminal history of what I was entering and the unabridged output, and eventually pivoted to darkly insinuating I was tampering with my shell environment as if I was trying to mislead it or something.

      Ultimately it turned out that it forgot it was supposed to be applying the fixes to the actual server instead of the local dev environment, and had earlier in the conversation switched from editing directly over SSH to pushing/pulling the local repo to the remote due to diffs getting mangled.

  • vessenes 5 hours ago

    I’ll put out a suggestion you pair with codex or deepthink for audit and review - opus is still prone to … enthusiastic architectural decisions. I promise you will be at least thankful and at most like ‘wtf?’ at some audit outputs.

    Also shout out to beads - I highly recommend you pair it with beads from yegge: opus can lay out a large project with beads, and keep track of what to do next and churn through the list beautifully with a little help.

    • petesergeant 5 hours ago

      I've been pairing it with Codex using https://github.com/pjlsergeant/moarcode

      The amount of genuine fuck-ups Codex finds makes me skeptical of people who are placing a lot of trust in Claude alone.

      • vessenes 4 hours ago

        Nice. Yeah I have them connect through beads, which combined with a git log is a lot of information - it feels smoother to me than this looks. But I agree with the sentiment. Codex isn't my favorite for understanding and implementing. But I appreciate the intelligence and pickiness very much.

  • phendrenad2 5 hours ago

    The replies to this really make me think that some people are getting left behind the AI age. Colleges are likely already teaching how to prompt, but a lot of existing software devs just don't get it. I encourage people who aren't having success with AI to watch some youtube videos on best practices.

minimaxir 13 hours ago

Claude Code 2.1.75 now no longer delineates between base Opus and 1M Opus: it's the same model. Oddly, I have Pro where the change supposedly only for Max+ but am still seeing this to be case.

EDIT: Don't think Pro has access to it, a typical prompt just hit the context limit.

The removal of extra pricing beyond 200k tokens may be Anthropic's salvo in the agent wars against GPT 5.4's 1M window and extra pricing for that.

  • zaptrem 7 hours ago

    I have Max 20x and they're still separate on 2.1.75.

  • auggierose 8 hours ago

    No change for Pro, just checked it, the 1M context is still extra usage.

wewewedxfgdf 8 hours ago

The weirdest thing about Claude pricing is their 5X pricing plan is 5 times the cost of the previous plan.

Normally buying the bigger plan gives some sort of discount.

At Claude, it's just "5 times more usage 5 times more cost, there you go".

  • apetresc 8 hours ago

    Those sorts of volume discounts are what you do when you're trying to incentivize more consumption. Anthropic already has more demand then they're logistically able to serve, at the moment (look at their uptime chart, it's barely even 1 9 of reliability). For them, 1 user consuming 5 units of compute is less attractive than 5 users consuming 1 unit.

    They would probably implement _diminishing_-value pricing if pure pricing efficiency was their only concern.

  • auggierose 8 hours ago

    It is not the plan they want you to buy. It is a pricing strategy to get you to buy the 20x plan.

    • radley 8 hours ago

      5x Max is the plan I use because the Pro plan limits out so quickly. I don't use Claude full-time, but I do need Claude Code, and I do prefer to use Opus for everything because it's focused and less chatty.

      • auggierose 8 hours ago

        Sure, I get it. For me a 2x Max would be ideal and usually enough. Now, guess why they are not offering that?

        • prtmnth 6 hours ago

          Same here. I'd love a 2x Max plan! More than enough usage for my needs.

  • operatingthetan 8 hours ago

    I think they are both subsidized so either is a great deal.

  • merrvk 3 hours ago

    5 times the already subsidised rate is still a discount.

  • tclancy 5 hours ago

    We’ll make it up on volume.

vessenes 9 hours ago

This is super exciting. I've been poking at it today, and it definitely changes my workflow -- I feel like a full three or four hour parallel coding session with subagents is now generally fitting into a single master session.

The stats claim Opus at 1M is about like 5.4 at 256k -- these needle long context tests don't always go with quality reasoning ability sadly -- but this is still a significant improvement, and I haven't seen dramatic falloff in my tests, unlike q4 '25 models.

p.s. what's up with sonnet 4.5 getting comparatively better as context got longer?

  • steve-atx-7600 7 hours ago

    Did it get better? I used sonnet 4.5 1m frequently and my impression was that it was around the same performance but a hell of a lot faster since the 1m model was willing to spends more tokens at each step vs preferring more token-cautious tool calls.

    • vessenes 6 hours ago

      Opus 4.6 is wayy better than sonnet 4.5 for sure.

  • mattfrommars 7 hours ago

    Random: are you personally paying for Claude Code or is it paid by you employer?

    My employer only pays for GitHub copilot extension

    • kiratp 6 hours ago

      GitHub Copilot CLI lets you use all these models (unless your employer disables them.

      https://github.com/features/copilot/cli

      Disclosure: work at Msft

      • tclancy 5 hours ago

        Disclosure: have to use them via copilot at work. Be glad I don’t write code for nuclear plants. Why does it have to be so hard. Doubly so in JetBrains ides but I’ve a feeling that’s on both of you rather than just you personally. But I still resent you now.

    • celestialcheese 7 hours ago

      Both. Employer pays for work max 20x, i pay for a personal 10x for my side projects and personal stuff.

shanjai_raj7 9 minutes ago

are the costs the same as the 200k context opus 4.6?

compaction has been really good in claude we don't even recognize the switch

bob1029 3 hours ago

I've been avoiding context beyond 100k tokens in general. The performance is simply terrible. There's no training data for a megabyte of your very particular context.

If you are really interested in deep NIAH tasks, external symbolic recursion and self-similar prompts+tools are a much bigger unlock than more context window. Recursion and (most) tools tend to be fairly deterministic processes.

I generally prohibit tool calling in the first stack frame of complex agents in order to preserve context window for the overall task and human interaction. Most of the nasty token consumption happens in brief, nested conversations that pass summaries back up the call stack.

aenis 2 hours ago

Sample of one and all that, but it's way, way more sloppy than it used to be for me.

To the extent, that I have started making manual fixes in the code - I haven't had to stoop to this in 2 months.

Max subscription, 100k LOC codebases more or less (frontend and backend - same observations).

aragonite 7 hours ago

Do long sessions also burn through token budgets much faster?

If the chat client is resending the whole conversation each turn, then once you're deep into a session every request already includes tens of thousands of tokens of prior context. So a message at 70k tokens into a conversation is much "heavier" than one at 2k (at least in terms of input tokens). Yes?

  • dathery 7 hours ago

    That's correct. Input caching helps, but even then at e.g. 800k tokens with all of them cached, the API price is $0.50 * 0.8 = $0.40 per request, which adds up really fast. A "request" can be e.g. a single tool call response, so you can easily end up making many $0.40 requests per minute.

    • acjohnson55 6 hours ago

      Interesting, so a prompt that causes a couple dozen tool calls will end up costing in the tens of dollars?

      • isbvhodnvemrwvn an hour ago

        Not necessarily, take a look at ex OpenApi Responses resource, you can get multiple tool calls in one response and of course reply with multiple results.

  • jasondclinton 7 hours ago

    If you use context cacheing, it saves quite a lot on the costs/budgets. You can cache 900k tokens if you want.

pixelpoet 8 hours ago

Compared to yesterday my Claude Max subscription burns usage like absolutely crazy (13% of weekly usage from fresh reset today with just a handful prompts on two new C++ projects, no deps) and has become unbearably slow (as in 1hr for a prompt response). GGWP Anthropic, it was great while it lasted but this isn't worth the hundreds of dollars.

  • Spooky23 8 hours ago

    Yeah, morning eastern time Claude is brutal.

causalzap 6 hours ago

I've been using Opus 4.5 for programmatic SEO and localizing game descriptions. If 4.6 truly improves context compaction, it could significantly lower the API costs for large-scale content generation. Has anyone tested its logic consistency on JSON output compared to 4.5?

  • arizen 3 hours ago

    Out of curiosity, what specific use cases on programmatic SEO are you currently doing with Opus?

margorczynski 8 hours ago

What about response coherence with longer context? Usually in other models with such big windows I see the quality to rapidly drop as it gets past a certain point.

fittingopposite 3 hours ago

I don't get the announcement. Is this included in the standard 5 or 20x Max plans?

LarsDu88 3 hours ago

The stuff I built with Opus 4.6 in the past 2.5 weeks:

Full clone of Panel de Pon/Tetris attack with full P2P rollback online multiplayer: https://panel-panic.com

An emulator of the MOS 6502 CPU with visual display of the voltage going into the DIP package of the physical CPU: https://larsdu.github.io/Dippy6502/

I'm impressed as fuck, but a part of me deep down knows that I know fuck all about the 6502 or its assembly language and architecture, and now I'll probably never be motivated to do this project in a way that I would've learned all the tings I wanted to learn.

chaboud 7 hours ago

Awesome.... With Sonnet 4.5, I had Cline soft trigger compaction at 400k (it wandered off into the weeds at 500k). But the stability of the 4.6 models is notable. I still think it pays to structure systems to be comprehensible in smaller contexts (smaller files, concise plans), but this is great.

(And, yeah, I'm all Claude Code these days...)

alienchow 3 hours ago

If this is a skill issue, feel free to let me know. In general Claude Code is decent for tooling. Onduty fullstack tooling features that used to sit ignored in the on-caller ticket queue for months can now be easily built in 20 minutes with unit tests and integration tests. The code quality isn't always the best (although what's good code for humans may not be good code for agents) but that's another specific and directed prompt away to refactor.

However, I can't seem to get Opus 4.6 to wire up proper infrastructure. This is especially so if OSS forks are used. It trips up on arguments from the fork source, invents args that don't exist in either, and has a habit of tearing down entire clusters just to fix a Helm chart for "testing purposes". I've tried modifying the CLAUDE.md and SPEC.md with specific instructions on how to do things but it just goes off on a tangent and starts to negotiate on the specs. "I know you asked for help with figuring out the CNI configurations across 2 clusters but it's too complex. Can we just do single cluster?" The entire repository gets littered with random MD files everywhere for directory specific memories, context, action plans, deprecated action plans, pre-compaction memories etc. I don't quite know which to prune either. It has taken most of the fun out of software engineering and I'm now just an Obsidian janitor for what I can best describe as a "clueless junior engineer that never learns". When the auto compaction kicks in it's like an episode of 50 first dates.

Right now this is where I assume is the limitation because the literature for real-world infrastructure requiring large contexts and integration is very limited. If anyone has any idea if Claude Opus is suitable for such tasks, do give some suggestions.

vips7L 5 hours ago

Friends, just write the code. It’s not that hard.

  • AussieWog93 3 hours ago

    I hear what you're saying, but for a lot of people coding isn't something we can throw 40+ hours per week at.

    My main job is running a small eComm business, and I have to both develop software automations for the office (to improve productivity long-term) while also doing non-coding day to day tasks. On top of this, I maintain an open source project after hours. I've also got a young family with 3 kids.

    I'm not saying Claude is the damn singularity or anything, but stuff is getting done now that simply wasn't being addressed before.

    • fixxation92 2 hours ago

      100% agree with this, as much as I hate the term "game-changer"... it truly is, I'm working on projects that I've always wanted to do but never had the capacity (or money to pay a small team of devs to build something)-- all these things that you thought you'd never have a chance to do, are suddenly now real and completely possible. I know there's a lot of AI haters out there but I'm pretty sure in time, all devs will embrance it and truly enjoy working with it

      • vips7L 2 hours ago

        If anyone thought there was value to those projects they would have paid for it before.

  • nkzd 29 minutes ago

    It's not that interesting.

  • righthand 2 hours ago

    You're witnessing the rise of the Developer Technician or Software Technician. They can get a machine to print out an application but you will still need an engineer to know how it works or to get it working. This used to be juniors learning to be senior devs/engineers. Now it is a split between technicians and engineers. The market will be up shit creek when all their technicians can't vibe code their way out of not understanding the code.

  • andrewstuart 4 hours ago

    Only someone not using Claude could equate human coding.

    • vips7L 4 hours ago

      Only someone not using their brain could equate Claude to using their intelligence.

      • andrewstuart 4 hours ago

        Let’s just clear this up …….. are you commenting with experience using the latest Claude, or are you commenting from personal beliefs.

        It’s fine for you to take a stand, but please understand your position is simply factually wrong if you think you can outprogram Claude for a range of common tasks.

        Being anti AI is fine, but if you deny facts of how far LLM programming has come then you lack credibility.

        The most effective anti AI position is to acknowledge it’s power, not pretend that vast numbers of people are somehow hallucinating the power of LLM assisted programming.

        • vips7L 4 hours ago

          I absolutely can out program Claude. I can factually guarantee that. You’re factually wrong in your belief that you think a statistical model that scientifically takes the average of programming is better than those of us that actually know what we’re doing.

          Programming is not hard. You’re just lazy.

          • andrewstuart 4 hours ago

            Ok so you speak with certainty about the capabilities of something you don’t use and therefore have no experience of.

            Childish and naive.

            If you said you’ve been using Claude heavily and it’s never done better than you on your own, then your position would be credible.

            • vips7L 4 hours ago

              Sure pal. Keep outsourcing your job. I’ll be here when you need help and are unemployed.

              • thatguymike 4 hours ago

                That’s… not how the labour market works

                • vips7L 3 hours ago

                  Of course. That’s because the labor market prefers cost over quality. The labour market will always prefer cheap and fast code that works at first glance. That is how capitalism works. That has nothing to do with my capabilities. It has nothing to do with the fact that I will always outperform a shitty statistical model. It has everything to do with the fact that most of you are too lazy to think. It has everything to do with most of you sucking and being too lazy to your job.

                  • hewasahaterboy 3 hours ago

                    I think you need to take a deep breath and calm down.

                    • vips7L 3 hours ago

                      Perfectly calm mate. Maybe you should try to factually argue against my position? Probably not though. Your account was created 30 minutes ago and likely a bot.

                      • minimaxir an hour ago

                        My account was created 14 years ago. You need to calm down.

                        There is a reason discussions about agent use have been on Hacker News every other day, and it's not a grand conspiracy. Even in this submission, people have talked about how they have used Claude Code and its longer context window successfully as a tool for programming, even if they may be technically skilled to do it themselves. However, if you assume that every commenter is acting in bad faith, then there's no point in continuing.

vicchenai 8 hours ago

The no-degradation-at-scale claim is the interesting part. Context rot has been the main thing limiting how useful long context actually is in practice — curious to see what independent evals show on retrieval consistency across the full 1M window.

  • apetresc 8 hours ago

    I don't think they're claiming "no degradation at scale", are they? They still report a 91.9->78.3 drop. That's just a better drop than everyone else (is the claim).

aarmenante 4 hours ago

Hot take... the 1MM context degrades performance drastically.

  • aenis 2 hours ago

    Same. First time in 2 months that I found it easier to fix the bugs it created manually, rather than get it to fix. Its google-code-CLI-on-gemini-2.5 level bad for me today. Meaning, almost comically bad.

arjie 7 hours ago

This is fantastic. I keep having to save to memory with instructions and then tell it to restore to get anywhere on long running tasks.

throw03172019 5 hours ago

Pentagon may switch to Claude knowing OpenAI has the premium rates for 1M context.

8note 6 hours ago

im guessing this is why the compacts have started sucking? i just finished getting me some nicer tools for manipulating the graph so i could compact less frequently, and fish out context from the prior session.

maybe itll still be useful, though i only have opus at 1M, not sonnet yet

dkpk 5 hours ago

Is this also applicable for usage in Claude web / mobile apps for chat?

aliljet 8 hours ago

Are there evals showing how this improves outputs?

  • apetresc 8 hours ago

    Improves outputs relative to what? Compared to previous contexts of 1M, it improves outputs by allowing them to exist (because previously you couldn't exceed 200K). Compared to contexts of <200K, it degrades outputs rather than improves them, but that's what you'd expect from longer contexts. It's still better than compaction, which was previously the alternative.

thunkle 7 hours ago

Just have to ask. Will I be spending way more money since my context window is getting so much bigger?

  • isbvhodnvemrwvn an hour ago

    Yes, full context is used to generate each new token.

johnwheeler 8 hours ago

This is incredible. I just blew through $200 last night in a few hours on 1M context. This is like the best news I've heard all year in regards to my business.

What is OpenAIs response to this? Do they even have 1M context window or is it still opaque and "depends on the time of day"

  • hagen8 8 hours ago

    Did u use the API or subscription?

    • johnwheeler 8 hours ago

      Max subscription and "extra usage" billing

      • steve-atx-7600 7 hours ago

        That sounds high. I mean, if you paid for the 20x max plan you’d be capped at around 200/month and at least for me as a professional engineer running a few Claude’s in parallel all day, I haven’t exceeded the plans limits.

        • Wowfunhappy 7 hours ago

          Prior to this announcement, all 1M context use consumed "extra usage", it wasn't included in a normal subscription plan.

          • steve-atx-7600 4 hours ago

            So, I’ve been using opus 4.6 1m since it was fist available to 20x max users daily. What I think has happened is that even in doing so, I have not actually exceeded the plan token limits and therefore haven’t been charged for “extra usage” (just double checked). So, unless there’s a billing mistake or delay, “any usage” != “extra usage” which is what I was always unclear about. I am careful to iterate with claude on plans in plan mode followed by clearing the context and executing. I think I am hovering around the higher end of the smaller window model where I would have otherwise seen auto-compaction run.

            Another reason for less token usage is that 4.6 is much better at delegating agents (its own explorer agents or my custom agents) to avoid cluttering the window.

  • dominotw 8 hours ago

    rarely go over 25 percent in codex but i hit 80 on claude code in just a short time.

swader999 7 hours ago

I notice Claude steadily consuming less tokens, especially with tool calling every week too

8cvor6j844qw_d6 8 hours ago

Oh nice, does it mean less game of /compact, /clear, and updating CLAUDE.md with Claude Code?

  • fnordpiglet 8 hours ago

    I’ve been using 1M for a while and it defers it and makes it worse almost when it happens. Compacting a context that big loses a ton of fidelity. But I’ve taken to just editing the context instead (double esc). I also am planning to build an agent to slice the session logs up into contextually useful and useless discarding the useless and keeping things high fidelity that way. (I.e., carve up with a script the jsonl and have subagent haiku return the relevant parts and reconstructing the jsonl)

    • dominotw 8 hours ago

      til you can edit context. i keep a running log and /clear /reload log

      • 8note 6 hours ago

        double escape gets you to a rewind. not sure about much else.

        the conversation history is a linked list, so you can screw with it, with some care.

        I spend this afternoon building an MCP do break the conversation up into topics, then suggest some that aren't useful but are taking up a bunch of context to remove (eg iterations through build/edit just needs the end result)

        its gonna take a while before I'm confident its worth sharing

zmmmmm 9 hours ago

Noticed this just now - all of a sudden i have 1M context window (!!!) without changing anything. It's actually slightly disturbing because this IS a behavior change. Don't get me wrong, I like having longer context but we really need to pin down behaviour for how things are deployed.

  • steve-atx-7600 7 hours ago

    You can pin to specific models with —-model. Check out their doc. See https://support.claude.com/en/articles/11940350-claude-code-.... You can also pin to a less specific tag like sonnet-4.5[1m] (that’s from memory might be a little off).

    • zmmmmm 6 hours ago

      sure - but the model hasn't changed. I'm specifying it explicitly. But suddenly the context window has. I'm not using Claude Code, this is an application built against Bedrock APIs. I assume there's a way I could be specifying the context window and I'm just using API defaults. But it definitely makes me wonder what else I'm not controlling that I really should be.

  • phist_mcgee 8 hours ago

    Anthropic is famous for changing things under your feet. Claude code is basically alpha software with a global footprint.

LoganDark 5 hours ago

Finally, I don't have to constantly reload my Extra Usage balance when I already pay $200/mo for their most expensive plan. I can't believe they even did that. I couldn't use 1M context at all because I already pay $200/mo and it was going to ask me for even more.

Next step should be to allow fast mode to draw from the $200/mo usage balance. Again, I pay $200/mo, I should at least be able to send a single message without being asked to cough up more. (One message in fast mode costs a few dollars each) One would think $200/mo would give me any measure of ability to use their more expensive capabilities but it seems it's bucketed to only the capabilities that are offered to even free users.

  • aenis 2 hours ago

    I find it hard to understand that people consider $200 p/m a lot for what they are getting. Expensive compared to what? A netflix sub?

    A 1hr of a senior dev is at least $100, depending where one lives. Since Claude saves me hours every day, it pays for itself almost instantly. I think the economic value of the Claude subscription is on the order of $20-40k a month for a pro.

nemo44x 6 hours ago

Has anyone started a project to replace Linux yet?

dominotw 8 hours ago

can someone tell me how to make this instruction work in claude code

"put high level description of the change you are making in log.md after every change"

works perfectly in codex but i just cant get calude to do it automatically. I always have to ask "did you update the log".

  • 8note 3 hours ago

    whats the need? you have the session in a file as a dag. you can summarize to a log whenever you want. doesnt need to be as it goes.

    earlier today i actually spent a bit of time asking claude to make an mcp to introspect that - break the session down into summarized topics, so i could try dropping some out or replacing the detailed messages with a summary - the idea being to compact out a small chunk to save on context window, rather than getting it back to empty.

    the file is just there though, you can run jq against it to get a list of writes, and get an agent to summarize

  • steve-atx-7600 7 hours ago

    Backup your config and ask Claude. I’ve done this for all kinds of things like mcp and agent config.

  • sergiotapia 4 hours ago

    use claude hooks - in .claude/settings.json you can have it run on different claude events like "PreToolUse" or "Stop" and in those events you pass in commands you want it to run.

    You can have stuff like for the "stop" event, run foobar.sh and in foobar.sh do cool stuff like format your code, run tests, etc.

gaigalas 8 hours ago

I'm getting close to my goal of fitting an entire bootstrappable-from-source system source code as context and just telling Claude "go ahead, make it better".

sergiotapia 4 hours ago

maybe i'm thinking too small, or maybe it's because i've been using these ai systems since they were first launched, but it feels wrong to just saturate the hell out of the context, even if it can take 1 million tokens.

maybe i need to unlearn this habit?

  • gskm an hour ago

    I think your instinct is right. More context isn't free, even when the window supports it, and the model still has to attend to everything in there, and noise dilutes the signal. A cleaner, smaller context consistently gives better outputs than a bloated one, regardless of window size. For sure, the 1M window is great for not having to compact mid-task. But "I can fit more" and "I should put more in" are very different things. At least in my mind.

alienbaby 6 hours ago

is this the market played in front of our eyes slice by slice: ok, maybe not, but watching these entities duke it out is kinda amusing? There will be consequences but may as well sit it out for the ride, who knows where we are going?

jf___ 4 hours ago

there is a parallel between managing context windows and hard real-time system engineering.

A context window is a fixed-size memory region. It is allocated once, at conversation start, and cannot grow. Every token consumed — prompt, response, digression — advances a pointer through this region. There is no garbage collector. There is no virtual memory. When the space is exhausted, the system does not degrade gracefully: it faults.

This is not metaphor by loose resemblance. The structural constraints are isomorphic:

No dynamic allocation. In a hard realtime system, malloc() at runtime is forbidden — it fragments the heap and destroys predictability. In a conversation, raising an orthogonal topic mid-task is dynamic allocation. It fragments the semantic space. The transformer's attention mechanism must now maintain coherence across non-contiguous blocks of meaning, precisely analogous to cache misses over scattered memory.

No recursion. Recursion risks stack overflow and makes WCET analysis intractable. In a conversation, recursion is re-derivation: returning to re-explain, re-justify, or re-negotiate decisions already made. Each re-entry consumes tokens to reconstruct state that was already resolved. In realtime systems, loops are unrolled at compile time. In LLM work, dependencies should be resolved before the main execution phase.

Linear allocation only. The correct strategy in both domains is the bump allocator: advance monotonically through the available region. Never backtrack. Never interleave. The "brainstorm" pattern — a focused, single-pass traversal of a problem space — works precisely because it is a linear allocation discipline imposed on a conversation.