smallstepforman 16 hours ago

Today I asked Gemini to extract a table from an PDF appendix and create C++ data table with its contents. After 15 or so iterations with corrections and new mistakes, it eventually gave up. I was floored when it said “I’m sorry, I cannot do this simple task, I’ve exceeded my error threshold and cannot do this task for you. My LLM prediction engine invents data instead of doing a simple data copy/reformat”.

Stunned to see that Gemini threw its digital arms in the air and gave up.

  • hashta 15 hours ago

    That's interesting because my experience has been almost the opposite. A few months ago I tested Gemini on converting screenshots of tables from PDF files into CSV. I tried it on several different tables and it got every one right. It consistently outperformed ChatGPT.

    • lxgr 3 hours ago

      The key here is that you used screenshots. This forces Gemini into "OCR mode" (i.e. actually looking at vision tokens) rather than trying to be clever with its tool calls.

      The latter strategy almost entirely depends on the quality of the skills and tool calls exposed to a given agent.

    • blensor 6 hours ago

      Tangentially related question. Has anyone analyzed if the content that is being converted could break the model.

      So let's say you have a super dull pdf ( or even a scan ) that has the same line over and over again, could this get the model into one of those loops that just keep spewing nonsense.

      And thinking that further, could someone prompt inject a model with a handwritten note that only gets "activated" once it's in the context?

    • jatora 14 hours ago

      anyone who has used both knows this is inaccurate or dishonestly stated (ie. you were using gpt nano or some nonsense)

  • anigbrowl 13 hours ago

    It's extremely hit or miss. I've had it one-shot a pretty decent analytic prototype from a brief description, but also had it get trapped in hour-long back and forth regression hell over incredibly simple things like adding a static favicon (ie it would add it, then keep taking it away with every subsequent iteration, breaking something else every time it was asked to put the favicon back etc.).

  • frankacter 10 hours ago

    I just tried this and it worked without issue.

    Some considerations:

    1) tell it to extra t the data (in a new session) does that work?

    2) if it doesn't, could there be something up with the PDF?

    As many commentors suggested, this works well with Gemini so there is likely a missing variable in play.

    Share your prompt and the PDF and let's see if we can determine what.

    • TightFibre 4 hours ago

      Long shot, but I wonder if an image of the pdf would do better if it did get unstuck on internal formats.

      • lxgr 3 hours ago

        It definitely does. PDF is a vector-based image format historically, and all add-ons that make it behave a bit more sane as a text-oriented document format are optional, so your mileage using tools like pdftotext will vary greatly depending on who created a given PDF.

  • base698 16 hours ago

    That's better than the loop grok got stuck in trying to use git and push the work it did leading to a $15 api credit deduction.

    • whh 15 hours ago

      Getting AI/ML to acknowledge "I don't know" is such a challenge.

      • jgalt212 15 hours ago

        Not true regarding ML, most ML methods support RMSE even if they are non parametric methods.

        • janalsncm 13 hours ago

          RMSE is just an extrapolation from the training data. If the data is wrong because the world changed, any model (parametric or not) can be confidently incorrect.

      • taneq 13 hours ago

        This is why the world model approach is so important. It allows you to feed back the prediction accuracy of the model to itself at training time, enabling it to predict (to some degree) its own uncertainty. If you jump through a couple of hoops you can also do this at run time to give it “spidey sense” that something’s not right with current inference.

  • chorizo 10 hours ago

    I built a little research dashboard that monitors new papers from specific research labs. There is a paper ingest skill that writes Python scripts using pdfplumber to dismantle pdf’s. I have also used it to fetch supplementary information to replicate/augment the published tables. It can also use plotdigitizer to extract raw data from plots.

  • hodgehog11 10 hours ago

    The PDF reader for Gemini is extraordinarily poor in my experience. I like the writing style of this model a little better, but for most tasks people would use AI for, Gemini is probably not what you want to be using.

    • trees101 10 hours ago

      what is a good way to read PDFs using AI?

      • seanhunter 8 hours ago

        In my experience it really depends on what sort of pdfs you are trying to extract (ie what the content is).

        For regular pdfs that have been produced in a “normal way” (ie using latex or a modern application with a “save to pdf” function) will contain the text and for those I’ve had a lot of success on general pdfs using pypdf.

        “Image” Pdfs that have been produced via a scan so don’t actually contain a text transcript require actual OCR. At the moment my personal rag pipeline is doing this using a local Gemma4 model (you could use something else).

        Either way I do an audit post-ingest where I select a random set of pages and also get the local gemma model to try those same set and compare. The symptoms to look out for here will depend a lot on what you’re trying to extract but I’m extracting maths mostly so I get the model to check extraction of symbols, equations etc. One thing I have consistently found useful is to look for “mojibake” (scrambled text caused by decoding in an unintended character encoding) as this almost always catches pdfs that have just extracted as pure garbage. I added this step because I was ingesting a lot of old maths pdfs which have specialist notation that wasn’t always getting correctly ingested and as they were image pdfs it was coming in as pure garbage. So the fix here is to use a specialist ocr service (I have been using “mathpix” which has been great and isn’t too expensive if you don’t want to do too much).

        The other thing that can cause problems is things like tables (eg if you were trying to ingest a lot of pdfs like financials of companies etc). Those can cause problems for both the ocr and the pure text extraction methods. I don’t have a current recommendation for that because I haven’t done it recently enough and the state of the art has moved a lot. It’s something to be aware of that will require special treatment though.

        • lxgr 3 hours ago

          > regular pdfs that have been produced in a “normal way” (ie using latex or a modern application with a “save to pdf” function) will contain the text

          Producing "normal PDFs" that way actually requires specific LaTeX options to be enabled in my experience. Without that, PDF viewers have to perform all kinds of ugly hacks to even figure out what Unicode codepoint a given glyph is supposed to represent! PDFs are much more of a vector format than a layouting program than most people seem to realize.

          > One thing I have consistently found useful is to look for “mojibake” (scrambled text caused by decoding in an unintended character encoding)

          This is exactly the problem with PDFs: It's not regular mojibake (i.e. interpreting a string of text in the wrong charset), but rather some PDF processor's failed attempt at mapping glyphs back to codepoints without an explicit mapping table being present in the PDF, which is something that the creator actively has to do.

          > “Image” Pdfs that have been produced via a scan so don’t actually contain a text transcript require actual OCR.

          For the reason above and others, in my experience, OCR actually works significantly better than trying to "semantically parse" the PDF.

      • chrsw 2 hours ago

        In the broadest sense, I don't think we're there yet. I asked an SoC vendor to provide their chip documentation in Markdown. They refused. So, I went ahead and tried to do myself with AI.

        I tried various AI tools and the results ranged from absolute garbage to something-but-not-something-but-not-quite.

        I went ahead and did a section of a huge PDF by hand, just to see if what I was asking for was even feasible. After more than several hours of painstaking work spread across multiple days, I got several chapters to look identical to the source PDF in some Markdown renderers. I had to use some HTML for the more complex tables. I converted some diagrams to Markdown and some to images linked to from the Markdown.

      • wwn_se 6 hours ago

        Doing a preprocess using some pdf extraction and ocr tool and then feeding that to the big model is usually way more stable.

      • rawoke083600 5 hours ago

        MinerU works well to get it markdown

  • 0xbadcafebee 9 hours ago

    It does this pretty often. Gemini is an "intelligent" model, but it's massively nerfed and so isn't useful for real work. If you use it with an agent harness, you need to design the harness to detect this and start a new session. Once it nerfs itself it won't try again.

  • fsmv 15 hours ago

    You should just have it OCR a screenshot of the PDF that would probably work better

  • BobbyTables2 13 hours ago

    Tabula + Excel could probably do it quicker.

    • jwrallie 8 hours ago

      The table select option in Okular is also great, as you can manually rearrange the divisions. For low volume, of course. Tabula will work better otherwise. I also suggest Libreoffice Calc, the .csv support is leagues ahead of Excel.

  • staticman2 15 hours ago

    You didn't say whether you were using the App but the App's performance seems to be severely throttled compared to API.

  • anitil 13 hours ago

    My go-to for this is to screenshot and use the built-in text extraction in the screenshot tool (I'm on a mac), then pass on that text data to whatever processing. It's a pretty good tool so long as the PDF is in OK shape (I've had errors in scanned images).

    • nradov 13 hours ago

      It's so horrible that in 2026 people are still publishing important data and specifications in a format like PDF that's difficult for LLMs to consume. We need to drag them kicking and screaming to HTML or Markdown. Heck, even Microsoft Word DOCX is superior for reliable parsing and content extraction.

      • dannyw 5 hours ago

        Good luck, getting rid of PDFs is going to be as hard as migrating from JPEG everywhere.

  • jjice 16 hours ago

    I haven't heard any accounts of it doing that since Gemini 2.5, but it was pretty easy to get it to do it with a programming task back then after a few failed attempts. Very interesting to hear it'll still do it.

  • staindk 15 hours ago

    We've been quite impressed with GCP Document AI. Not sure if it has a free tier but perhaps that's where Google is putting all the good OCR.

  • mjcohen 13 hours ago

    Years ago, I used Acrobat to extract tables from a PDF. Had to do it manually, but it pasted nicely into Excel.

  • suuuuuuuu 14 hours ago

    I envy you that it admitted that rather than simply making up data and lying about it.

jorjon 12 hours ago

Gemini Flash 3.5 (through agy) ran `git reset --hard` when I asked it to commit my changes, apparently it thought it was better to have a clean repo before `git add`. Of course I'm not trusting my computer to it. When will we have 3.5 Pro?

  • Synthetic7346 11 hours ago

    I want a lightweight sandbox with all my dev stuff that LLMs can't escape before I turn on computer use

satvikpendem 20 hours ago

There's still no MCP support in the Gemini app, which is very useful to get various pieces of info as a user just via chatting. For example I recently wanted to get an Airbnb and wanted to filter by specific criteria including house image analysis and Gemini couldn't do it so I had to do it in Codex.

  • anticorporate 19 hours ago

    Yeah, it seems like this is the biggest missing feature from the Gemini ecosystem.

    If I can't connect MCP, there's really no selling point for me to use Gemini from my watch, car, smart speaker, etc. If I'm already bound to using my own front end, then I'm only evaluating Gemini as a model/API, at which point it has many competitors that may be cheaper or better fit for the task.

  • mitchell_h 18 hours ago

    I'm fairly convinced Claude's strongest point is the app. AI users aren't anywhere near as mature or smart as youtube/hn would have folks believe. The claude app is amazing for bridging that gap.

    • dr_dshiv 18 hours ago

      Didn’t it take them like 2 days to build the first one?

  • lil-lugger 16 hours ago

    I think native apps are critical infrastructure in AI development particularly around agents. The truth is there’s no good native interaction layer for custom agents. If you want to wire up and self host an agent that has access to anything ever your only option is a janky port to telegram or Slack. I’ve been building vessels.app because I think it’s the missing piece to agent interaction. I need testers if anyone is interested!

  • tonyrice 20 hours ago

    This is why I don't always use the official Gemini Web app. Lately I've found that it's more useful to utilize a CLI. I'm looking forward to the day they add MCP in the web.

    • pregseahorses 19 hours ago

      Gemini CLi now requires antigravity subscription..

      • tim-projects 13 hours ago

        I tried antigravity. It absolutey sucks in comparison. Given up on Google

    • singingtoday 19 hours ago

      CLI doesn't work with my subscription..

  • solarkraft 18 hours ago

    They only fixed stopping the model mid-generation losing the entire session pretty recently.

    The Gemini apps suck.

mlmonkey 19 hours ago

It's funny how in their own graph, https://storage.googleapis.com/gweb-uniblog-publish-prod/ima... Gemini 3.5 Flash is beat hands down by both Opus 4.8 and GPT 5.5, and yet the graph is drawn as if Gemini wins ... :-D

  • mroche 19 hours ago

    The graph has Gemini 3.5 Flash matching Sonnet 4.6, losing to Opus 4.8, and slightly behind GPT-5.5 by 0.3 points... That's not that much of a hands-down loss for Gemini for this specific workload benchmark.

    The methodology used:

    https://deepmind.google/models/evals-methodology/gemini-3-5-...

    Methodology: All Gemini scores are pass @1 except where otherwise noted. "Single attempt" settings allow no majority voting or parallel test-time compute. All of the results are all run with the Gemini API for the model-id gemini-3.5-flash with default sampling settings unless indicated otherwise below. To reduce variance, we average over multiple trials for smaller benchmarks.

    All the results for non-Gemini models are sourced from providers' self reported numbers unless otherwise mentioned below. For Claude Opus 4.7 , Sonnet 4.6, and GPT-5.5 we default to reporting maximum thinking/reasoning settings available, but when reported results are not available we use best available reasoning results.

  • sheept 19 hours ago

    It highlights the Gemini models blue since that's what the article is about. The bar heights seem consistent with the values.

    • namuol 15 hours ago

      They should be sorting the models by performance on the horizontal axis.

      • roygbiv2 5 hours ago

        Why would they make their own product look worse?

  • data-ottawa 19 hours ago

    I think 3.5 flash is trying to target agentic work, like Google Search or ADK (agent development kit) use cases.

    It’s something cheap enough you’d put out in front of your customers, and Opus is expensive enough you wouldn’t.

  • gb2d_hn 19 hours ago

    It's honest - people who know what they are looking at will take speed and token costs into account. I don't use Gemini 3.5 for coding, but I use it as something in between a search engine and agent.

  • IncreasePosts 15 hours ago

    It's amazing how designers of charts trying to show their product is close to the leader always remember to start the axis at zero, and designers of charts trying to show how big their lead is always forget that

    • timacles 11 hours ago

      Promotions material isn’t a medium for scientific rigor

AbuAssar an hour ago

Google should drop the flash moniker, as it implies a small model

s_kazmi 3 hours ago

stopped using gemini a couple months ago when they ruined their rate limits. Not sure why people still use em. They were good with the generous rate limits in antigravity. but done after that.

YuechenLi 14 hours ago

So... has Google provided a Codex/Claude Code equivalent to Gemini yet? I would like to use Gemini for coding tasks, but that's kind of difficult to do as I don't even know how to get Gemini to even "clone this repo and read the code in it for static analysis", much less open PRs in repos.

ChatGPT/Codex can do it, Claude can do it, why can't Gemini?

And no, I don't mean going through Antigravity, and personally I'm wary about LLMs having unsupervised access on my computer without explicit policy, so I really think Google is putting the cart before the horse here.

  • kakugawa 14 hours ago

    Antigravity CLI (which replaced Gemini CLI):

    https://antigravity.google/product/antigravity-cli

    • anigbrowl 13 hours ago

      So a couple weeks ago I decided to pay some money for the Gemini API cause I'd found myself getting a lot of use out of the free tier chat and figured they deserved some of my cash.

      First headache was a lot of delays and 'service unavailable due to excessive load messages'. Second headache was a lot of frustration with the Continue plugin in my IDE. Gemini chat suggests I try the Antigravity app. I do so. IT's OK. Launch it on an agentic task, it gets part way through and stops, asking me to subscribe. Try getting it to use some of the Gemini credits I paid for a few days earlier. Turns out Antigravity is developed by a completely separate team within Google and they don't recognize or accommodate Gemini credits because they are trying to maintain budgetary independence so as to maintain operational autonomy Gemini. At least, that's the explanation Gemini (free tier) offered for why Antigravity (Free/subscription only) won't accept Gemini (prepaid/subscription) payments. What a time to be alive.

      You won't be surprised to learn I switched to a competitor.

      • th0raway 2 hours ago

        On one end we have Coase's theory of the firm, but then we have the realities of the modern tech megacap company, which is 20+ companies in a trench coat, and where it's clear the alignment between the best interest of a middle manager, the company as a whole, and of the consumer have absolutely nothing to do with each other. But pointing at one working thing in an investor call seems quite valuable, so we aren't seeing investors actually demanding to spin off companies with minimal actual synergy.

        So we end up with companies acting in ways that don't help themselves or the consumer, but which have no reasonable mechanisms to correct any of this. So we end up with the two best entrants in the AI space being independent companies, all while we know that, in case of significant cuts, it's the companies that are attached to other huge, unrelated sources of revenue that will have easier time surviving. Gemini can mess up all they want as long as management still has Ads and youtube sitting there subsidizing them.

      • reacharavindh 4 hours ago

        +1 to service unavailable annoyances.

        It has been close to unusable for anything serious. I did really like the ability of Gemini-pro models to ground their research using Google Search. This meant that they were often much more thorough and up to date in their recommendations and finding solutions that came to life after the models themselves were trained. But even using Gemini as a reviewer was a weak point in my harness because of the poor reliability of their service (529 or 503).

        I’ve since paid for a search API(linkup, exa, and valyu) and hooked them up to Deepseekv4-pro. It has been doing a stellar job.

        The key was to prompt them to systematically use search to validate their answers(not simply “use the tool”, but something like find possibilities using web search, then once you formulate a solution, validate it with this check list -

        1. Is there a better way to do this in 2026?

        2. Are the libraries and its docs you’re using up to date?

        This seems to help very much based on experience.

  • jatora 14 hours ago

    no. gemini's instruction following is currently abysmal. Gemini CLI could be a great scaffold for all we know, but we cant know because the models it uses are so horribly bad at being driven in that way.

    no clue why google has dropped the ball this hard on IF.

    • nl 12 hours ago

      I think 3.5 Flash is supposed to improve the long-range instruction following.

      I haven't tried it at that because for the short range tasks I gave it I found it around Sonnet level, but slower (because it takes more tries!) which makes it more expensive.

      The old Flash models were great because they were fast.

      3.5 Flash shows Google can work around the old "Good, Fast or Cheap: Pick any two" thing by picking none of them.

    • polski-g 13 hours ago

      They have a 3.1-pro-customtools model which they allege its better at using custom tools/MCPs/etc.

      It is not better.

  • andrewbutts 14 hours ago

    Yes, Gemini CLI and Antigravity

airstrike 19 hours ago

Computer use is such a terrible idea. It's slow, insecure, error prone, expensive.

I guess if you're trying to get people to tokenmaxx it may look like a valid strategy, but ain't no way this will be delightful to users.

I think it's a symptom of just not understanding how LLMs should interface with the OS because we're still in their early days.

Eventually there'll be an iPhone moment for the ergonomics of LLM usage outside of coding

  • gdudeman 18 hours ago

    Computer use is a great idea. It gets the job done when nothing else will.

    If you're a person trying to get their job done at a big company, but half your job is in 1-2 proprietary tools or is stuck behind an API you can't program against, computer use can allow you, a non-techie, to do your job more efficiently.

    I think it's an awesome way to circumvent gate keepers and the IT department to let people accomplish their goals.

    • antiloper 2 hours ago

      Computer use is very useful for developing GUI applications since claude code can build and test the entire app end-to-end (accessibility APIs exist but depending on the UI framework of your choosing you can run into walls very fast).

      I run it in a VM using a headless wayland compositor, I'd never trust even fable with access to my real system.

    • airstrike 17 hours ago

      That is an incredibly niche use case and comes with a boatload of footguns.

      Even then, an AI writing AHK scripts likely outperforms.

      • porridgeraisin 4 hours ago

        It does. I used to be an ahk "script kiddie" and know it front and back. It's sort of burnt into my brains. As a result, I can prompt really really well, notice issues at a glance, and I have a sheer volume of scripts locally for all sorts of tasks some from as far back as 2014. From tiling window managers to OCR all the way to simple hotkeys/hotstrings. I let it grep in that folder and build out whatever I want using those primitives. This gives actually 1-shot immediately usable 100% working scripts even with GPT3.5 level models, as opposed to the iterations needed for typical development.

        Example: adding copyright text box to bottom of every slide

          F3::
            pres := ComObjActive("PowerPoint.Application").ActivePresentation
            Loop % pres.Slides.Count {
              slide := pres.Slides.Item(A_Index)
              box := slide.Shapes.AddTextbox(1, 100, 500, 500, 30)
            box.TextFrame.TextRange.Text := "Copyright 2026. All Rights Reserved."
            }
        
          return
    • Rebelgecko 17 hours ago

      I think there's a sweet spot- a lot of the time you're probably better off with "reverse engineer this web page and build me an API or personalized chrome extension to meet my needs".

      I have an agent doing price checks for me for an item on a certain website. Instead of blasting through a zillion tokens processing the DOM over and over, it loaded the page once and figured out how to download a json with the price.

      • airstrike 15 hours ago

        Does it have to view the page now repeatedly to download the JSON?

        • saaspirant 12 hours ago

          The tool it built will do viewing, probably.

          • airstrike 11 hours ago

            that tool is called Chrome with extra flags in the CLI

    • reacharavindh 17 hours ago

      How are folks using “computer use” to click things on intranet portals that are behind an SSO? Even this OP example shows visitors a url and enter this search term… that is port of useless.

      How can I automate things behind an SSO wall? Even if it means I manually authorize it once and watch it do things on its own..

      • hahajk 15 hours ago

        I've never used Gemini computer use, but I assume it's the same:

        Claude computer use takes control of your whole computer inputs (mouse and keyboard) plus screenshots. You just log in, tell Claude you're logged in, and let it get to work. It'll use the browser you're logged in with.

        The chrome extension is a little better because it only takes control of its own chrome tabs (again: you just log in.)

      • pimeys 15 hours ago

        Take manual control once, save the login info to a password manager, teach the model to login with it.

    • uejfiweun 18 hours ago

      Yeah, it's not that computer use is the most theoretically optimal paradigm, but there's a reasonable case that given the constraints of modern software systems and how they're built, that it's the most realistically optimal paradigm.

  • thorum 18 hours ago

    The “correct”, elegant way for AI to interact with existing software would take decades and billions of dollars to build. Someone would have to do the hard work of building new APIs, solving decades of accessibility issues, etc.

    Or you can show an AI screenshots and ask it where to click.

    • sarreph 18 hours ago

      I disagree if your application is networked. Most SaaS is built on RESTful APIs that can be converted trivially into interfaces / contracts for tool use.

      • chatmasta 18 hours ago

        So you can either wait for every application to do that, or at least make it possible for an LLM to do it… or you can make the LLM use a computer interface that works with every application by definition.

        • Chu4eeno 17 hours ago

          The middle ground would be leveraging e. g. standard a11y APIs, and/or hooking into applications like Squish does.

          Then you get a nice textual world that fits the LLM without having to rewrite every application to have a fullblown HTTP server.

    • jubilanti 18 hours ago

      it takes decades and billions of dollars to develop APIs?

  • johnsmith1840 10 hours ago

    The iphone moment is an AI that can completely manage your personal life. It has full access to every financial account you own handles all admin work. Could sign you up for a new account pay and give you the login.

    If you can SAFELY do that it's a big moment. But to be clear safe is a massive problem. Until you see a big company start saying the AI can use your SSN, CC, bank password safely we aren't there yet.

    • fragmede 10 hours ago

      Cars were around for decades before they came up with seatbelts. Claude Cowork will happily go through your files, which might just have your SSN in them, and ignore previous instructions.

      • johnsmith1840 10 hours ago

        But we have regulation and complaince for consumer secrets? That's not a comparable example.

        The difference is that if openai gave you a product and it leaked a million peoples bank passwords it would destroy the entire company.

        Again until a big tech product can bring that to a clean user experience we're not there yet. Even the most zealot openclaw users are not hooking their bank accounts into the AI yet. I'm sure they exist but I've not seen them.

        Also every big tech computer use product actively screams for you not to give their agents secrets.

      • johnsmith1840 10 hours ago

        Every major company scream not to put secrets in their computer use bot.

        Seatbelts were regulated later. Your SSN and CC are regulated over a decade ago.

  • JSR_FDED 6 hours ago

    Tens of millions of users every day rely on Robotic Process Automation. It’s glue that hold companies together.

  • orbital-decay 18 hours ago

    Spreadsheet is such a terrible idea. It may look like a valid tool, but ain't no way it's delightful to users. Most of the time people need a database instead. Eventually there'll be an iPhone moment for this.

    Meanwhile, the entire world economy:

    • airstrike 17 hours ago

      I mean, your words not mine. You can't just claim I'm making a point I didn't.

      Spreadsheets are fucking glorious, powerful, clever, amazing and delightful, in my view.

  • nzach 19 hours ago

    > Computer use is such a terrible idea. It's slow, insecure, error prone, expensive.

    And yet having an agent able yo use a computer on your behalf is really useful.

    Recently I gave a Nix OS vm to my hermes agent and it has been a good experience. I don't really care if destroy the machine I can just rollback to an earlier version, and for any meaningful data he creates for me I make sure he creates a repo, commit and pushes to my private Gitea instance.

    • airstrike 18 hours ago

      > And yet having an agent able yo use a computer on your behalf is really useful.

      It is, but there's no need for it to be viewing your screen, browsing websites and watching ads.

      That stuff is for humans, not for LLMs.

      • nzach 18 hours ago

        Sure, I don't want an agent watching MY screen. That's why I gave him his own environment, and pretty quickly he discovered that you can open chrome and make it render to a framebuffer, this way he is able to 'view' the website. And apparently with this he is able to bypass a lot of 'anti-bot' measures.

    • dbbk 18 hours ago

      > And yet having an agent able yo use a computer on your behalf is really useful.

      I honestly cannot think of a single use case

      • johnsmith1840 11 hours ago

        Literally everything you do every day.

        It's the end game of AI. Have systems trained on doing EVERYTHING you do on a computer all day. Trained by you while doing the job.

      • nzach 17 hours ago

        I think the main advantage is adaptability.

        Imagine you have a pretty exotic task you need to complete that involves converting a video file from one format to another.

        You can use ChatGPT or something similar and the best you will get is either a script you can run on you machine that does what you need or he may decide to render a new video.

        If you have something like OpenwebUI you could configure a MCP that converts videos and allow the model to use this MCP to do your task. This should work, but is quite a lot of work for something you'll ever do once.

        But if the agent has it's own environment he can decide to install ffmpg, execute the transformation and serve you the file you want.

        In reality there is no new capabilities with this approach, but things get a lot more comfortable.

        • quantumleaper 16 hours ago

          This doesn't require computer use, just a bash tool (and possibly fetch to get ffmpeg documentation)

          • dbbk 14 hours ago

            Yeah even Claude Cowork would do this, doesn't need "computer use"

      • golem14 11 hours ago

        I give you one: Google news is pretty terrible right now almost all interesting new sources are paywalls and so I get recommended all kind of weird lifestyle publications that are really horrible. With the computer use API I can just tell. Tell Gemini to look at Google news pick the articles that look interesting. Look them up on archive.is, and just give me the plain text article and construct a summary - I think that would probably work pretty well.

      • akoboldfrying 6 hours ago

        Have you ever done something tedious on a computer?

  • dyauspitr 17 hours ago

    We shouldn’t optimize for token use. We should build infrastructure to make tokens dirt cheap instead.

  • api 18 hours ago

    It's great for testing and QA automation for UIs. It's also possibly good for the vision impaired.

    • orbital-decay 18 hours ago

      UI QA only works well if your model plausibly matches the average user behavior and/or real-world edge cases. These models are far from that, and they are much less random than you'd like them to be for fuzzing (mode collapse).

      • Wowfunhappy 15 hours ago

        It doesn't need to be that kind of QA. Even just a basic "I want the AI to build the beginnings of a GUI app for me" will work much better if the AI can see the output of its work and iterate on it. Similar if you want the AI to fix a GUI bug—much better if you can show it the the bug and tell it how to test to see when it's gone.

        • airstrike 15 hours ago

          the LLM does not require computer use to see the GUI and, again, that's a pretty niche use and not what Computer Use is being marketed for

          • Wowfunhappy 3 hours ago

            > not what Computer Use is being marketed for

            Okay, fair, I haven't really paid attention to marketing.

            > the LLM does not require computer use to see the GUI and

            It can take screenshots without computer use, but it can't click around. I didn't have access to computer use until recently (I'm on an OS where Claude Code technically shouldn't run, I had to patch the binary), and when I got it working it made a big difference because of this.

revolvingthrow 18 hours ago

People using google’s models: am I holding it wrong or are the guardrails really overtuned?

I had the dubious pleasure of testing gemini of late and I kept running into refusals. How do I transfer a sim number from one provider to another? No. What should I consider when making backups on ntfs less prone to data loss and more bitrot resistant? No. Evaluate this piece of code? No.

I’m not sure if it’s cold feet from the mythos situation or what, but it reminds me of the dark days where you couldn’t use ai for much of anything. But then I go to chatgpt 5.5 and it does mostly everything I want outside of the usual cybersecurity boogeyman that you run into now and then.

  • sva_ 17 hours ago

    Interesting. I have the Google AI Pro plan and use Gemini several times each day and I don't remember the last time I got a refusal. I wonder what criteria go into that, like maybe how they rate your Google account?

  • dekhn 16 hours ago

    If I type your first query into Gemini, it immediately spits out a long and plausible answer.

    What exactly are you saying it's refusing? Can you give a screenshot or example?

  • Chu4eeno 17 hours ago

    I've always found all versions of gemini to be (for a lack of a better word) lazy.

    I guess it's economic wrt. token use, but it often either refused for absurd safety reasons, or other weird stuff like responding that an LLM like itself wasn't a suitable tool for the job, and very quickly gives up.

    Claude is on the other end of the spectrum, which makes it more noticeable when switching between them.

  • kordlessagain 18 hours ago

    I love antigravity. I’ve had zero issues with it.

  • k8sToGo 18 hours ago

    The context window size is also very small if you use Gemini in the app. It starts forget quite fast. In my opinion Gemini on app is useless additionally to the guardrails.

  • nout 18 hours ago

    I just asked gemini the question with sim number and it gives me full step by step guide.

  • WarmWash 17 hours ago

    Are you outside the US?

    • esperent 17 hours ago

      I'm outside the US, use Gemini models quite a bit, and I've never run into any refusals of any kind. I'm using them for a fairly wide range of things, I'm sure at least as risqué as asking how to transfer a sim. As a matter of fact I actually asked it's advice on how to transfer banking apps and auth apps from one phone about 3 weeks ago and got decent answers.

      • WarmWash 16 hours ago

        It's more dependent on the specific country they are in (and I don't know the specifics). But Google is large enough to have lawyers for every country, and Google is in a never ending whirlwind of national lawsuits/fines, so you end up at the mercy of whatever the lawyers for your country think will not piss off regulators. The EU (and individual states) have pretty heavy AI regulations, and Google even just got fined for an AI overview being incorrect.

        It also could just be which way the wind was blowing for OP, the models are stochastic to some degree, but there is no shortage of complaints from (mostly euro) users getting stonewalled.

        • joe_mamba 16 hours ago

          I've seen similar refusals on X from Claude from users in Germany when the LLM assumed the users are asking for something forbidden about certain topics.

          Ultimately I think that in 10 years time, this is what's gonna kill paid consumer LLMs, and boost the usage of Chinese LLMs self hosted at home an your own hardware that people will torrent via VPNs, as they will also be banned because of "disinformation and misinformation".

          So the end winners will be the hardware companies that will sell AI chips to consumers after the datacenter bubble pops. Unless of course the EU will ban the sale of AI chips that don't have some limitations baked in on which models you're allowed to run (the state approved ones). Interesting times ahead. I think in 10-20 years time we'll look back at present day LLMs the way we look back at the open internet of the 90's-00's.

  • TacticalCoder 16 hours ago

    > People using google’s models: am I holding it wrong or are the guardrails really overtuned?

    They are quite insane. I was asking it to list candidates metal parts I could buy at a hardware store to add weight to 3D prints: stuff like angle brackets etc.

    I wanted to know, bang for bucks, and ease of insertion (at print time) / modelling in a 3D model.

    Complete refusal as if I was a terrorist building a bomb.

    Then there are the weird refusals that then are OK after all if you insist by asking it what's wrong about it:

    "How should I cook eggs?"

    "I'm sorry but I can't help you with that" (it formulates it differently but that's the idea)

    "What, I'm just hungry, is explaining me how to cook eggs really against your rules?"

    And then it answers "No of course not, here's how to do it:..."

    Really strange stuff.

    • unbrice 9 hours ago

      This is indeed strange. Do you get the same results if you disable personalisation using the top right button ("temporary chat")? Can you please share a trace?

arjunchint 15 hours ago

Pretty doubtful about computer use/screenshotting based approaches.

With Retriever AI, we construct custom accessibility trees to represent web pages and just switched over to using DeepSeek v4 Flash and its nearing 100x cost decrease.

We also had great success just reverse engineering the underlying APIs of websites and then writing code to hit them. This approach of using screenshots to take actions on a webpage to trigger the underlying network calls the website is making seems too naive.

  • 0xbadcafebee 9 hours ago

    What happens when you need to control something that isn't a web page?

    • arjunchint 8 hours ago

      Honestly with Fable I think anyone is going to be able to reverse engineer a desktop app and get the coding agent to automate it.

      The Codex computer use functionality actually uses OS level accessibility trees, so thats also possible without screenshots.

  • infecto 15 hours ago

    Reverse engineering APIs is just a recipe to get blocked sooner. Good luck!

    • arjunchint 7 hours ago

      its been working fine on LinkedIn/IG, the trick is to make the requests from the main world of the website itself.

fridder 18 hours ago

I wonder if it will be better at building TUI's. It has been absolutely abysmal at interacting with them and building them

  • chatmasta 18 hours ago

    Claude can build UI but it sucks at testing it and iterating on it. Fable showed some improvements in this regard but alas.

    • Chu4eeno 17 hours ago

      It seems to do it just fine when in desktop applications using Qt, fwiw., it leverages all the standard Qt GUI testing stuff (and if you have the money you can just integrate Squish which has LLM support now).

  • IncreasePosts 14 hours ago

    That's my experience too. I've had increased luck encouraging the LLM to structure the code in "functional core, imperative shell" style, and telling it stupid things like "make sure you can test the code you're writing".

beastman82 19 hours ago

No UI like their competitors Claude CoWork or Codex. This is vaporware

knollimar 18 hours ago

Where is 3.5 pro?

  • squidbeak 17 hours ago

    Google said June, and all its model updates seem to be on Tuesdays, Wednesdays or Thursdays. So unless the release is slipping, either tomorrow or Tuesday.

    • WarmWash 17 hours ago

      Rumor is now July, although preliminary A/B tests people are getting show promise with whatever they have right now.

villgax 19 hours ago

Will it skip Ads lol

  • humblyCrazy 19 hours ago

    I looked at their demo and it does not

    • chatmasta 18 hours ago

      Better question might be will it skip recaptcha?

      • SXX 15 hours ago

        Only if its needed to save your grandma and a cat. It will hack few servers along the way.

vulcan1964 9 hours ago

Hot take: "computer use" is a dumb term for this concept; almost as if it was named by AI models...

Case in point: "We are already seeing customers drive value with computer use."

Yes... since the early 1980s, most companies and businesses have driven their value with computer use... smdh.

I'm no AI dev, but dare I suggest a better possible name for this: "agentic computer software interaction" which can be shortened to agent_actor

I swear, the direction we are headed with Big Tech leading the way will surely spell long term disaster

ai_fry_ur_brain 14 hours ago

I have basically unlimited access to every SOTA model and I opt for gemini flash 3.5 9/10 times I use an LLM.

Llms are mostly useless but when I do use them its with gemini. If they're going to waste my time 95% of the time, I might as well get it over with fast.

zuzululu 18 hours ago

performance is quite impressive given that its 3x cheaper than 5.5

  • SoMomentary 16 hours ago

    The speed was impressive when I tested it but unfortunately the accuracy left a lot to be desired. Be interesting to do the math on some of my normal workflows to see where the break even is between them, assuming the tasks you have can tolerate a couple of failures.

    • zuzululu 13 hours ago

      we are talking about computer use here

      gemini 3.5 flash isn't meant to compete head to head with frontier models on tough problems

paganartifact 15 hours ago

Who are these people talking about "agentic" stuff, and furthermore who are the people who can't stfu about "MCP"??

Literally 90%+ comments on HN personify their alleged use of AI in a way that is in NO WAY related to how the tool is really used.

Using LLMs for building software has NOTHING to do with those concepts. Nobody has "agents". That literally only exists in marketing. It's not even how it works.

AT ALL

Useless forum

shafiemoji 8 hours ago

My work requires me to use `agy cli` (Google AI Ultra) for development, and it's been incredibly frustrating. I strongly dislike the Gemini models because they consistently fail to grasp basic instructions. I also can't use the Claude models included in the AI Ultra plan because the agy cli wrapper makes the experience completely unusable. I'd rather use the free plan on OpenCode than deal with this Gemini setup.

  • perbu 7 hours ago

    They can’t follow instructions at all. They are a year behind Claude.