Gemma 4 26B really is an outlier in its weight class.
In our little known, difficult to game benchmarks, it scored about as well as GPT 5.2 and Gemini 3 Pro Preview on one-shot coding problems. It had me re-reviewing our entire benchmarking methodology.
But it struggled in the other two sections of our benchmark: agentic coding and non-coding decision making. Tool use, iterative refinement, managing large contexts, and reasoning outside of coding brought the scores back down to reality. It actually performed worse when it had to use tools and a custom harness to write code for an eval vs getting the chance to one-shot it. No doubt it's been overfit on common harnesses and agentic benchmarks. But the main problem is likely scaling context on small models.
Still, incredible model, and incredible speed on an M-series Macbook. Benchmarks at https://gertlabs.com
Thats funny, it failed my usual ‘hello world’ benchmark for LLM’s:
“Write a single file web page that implements a 1 dimensional bin fitting calculator using the best fit decreasing algorithm. Allow the user to input bin size, item size, and item quantity.”
Qwen3.5, Nematron, Step 3.5, gpt-oss all passed first go..
My early takeaway is that Gemma 26B-A4B is the best tuned out of the bunch, but being a small and fast model, it's severely constrained by context (large inputs and required outputs tank Gemma 26B's performance). We're working on a clean visualization for this; the data is there.
It's not uncommon for a sub-release of a model to show improvements across the board on its model card, but actually have mixed real performance compared to its predecessor (sometimes even being worse on average).
> The finding I did not expect: model quality matters more than token speed for agentic coding.
I'm really surprised how that was not obvious.
Also, instead of limiting context size to something like 32k, at the cost of ~halving token generation speed, you can offload MoE stuff to the CPU with --cpu-moe.
It's even more strange how its not obvious to someone who uses codex extensively daily.
The rate limiting step is the LLM going down stupid rabbit holes or overthinking hard and getting decision paralysis.
The only time raw speed really matters is if you are trying to add many many lines of new code. But if you are doing that at token limiting rates you are going to be approaching the singularity of AI slop codebase in no time.
"The reason I had not done this before is that local models could not call tools. "
Rubbish, we have been calling tools locally for 2 years, and it's very false that gemma3 scored under 7% in tool calling. Hell, I was getting at least 75% tool calling with llama3.3
I was also surprised by this sentence. It sounds like this is the author's first attempt at running models locally.
Or maybe the author has been running heavily quantized small models all that time — Gemma 4 gguf he's using is Q4 and only 16 GB. In my experience quants like this tend to perform much worse.
I also recommend anyone with a GB10 device to go try out the spark-vllm-docker setup, and check the Nvidia GB10 forums for the recently released optimised Qwen 3.5 122B A10B setup: 50tk/s is quite impressive for a decent local model!
I'm currently experimenting with running google/gemma-4-26b-a4b with lm studio (https://lmstudio.ai/) and Opencode on a M3 Ultra with 48Gb RAM.
And it seems to be working. I had to increase the context size to 65536 so the prompts from Opencode would work, but no other problems so far.
I tried running the same on an M3 Max with less memory, but couldn't increase the context size enough to be useful with Opencode.
It's also easy to integrate it with Zed via ACP.
For now it's mostly simple code review tasks and generating small front-end related code snippets.
I run this model on my AMD RX7900XTX with 24GB VRAM with up to 4 concurrent chats and 512K context window in total. It is very fast (~100 t/s) and feels instant and very capable, and I have used Claude Code less and less these days.
I did the same using the mlx version on an M1 Macbook using LMStudio integrated into XCode. I had to up the context size I ran it a against a very modest iOS codebase and it didn't do well, just petered out at one point. Odd. Pretty good chatbot and maybe against other code it'll work but not useful with XCode for me
I do the same thing on a MacBook Pro with an M4 Max and 64GB. I had problems until the most recent LM Studio update (0.4.11+1), tool calling didn't work correctly.
I use the OpenAI API for everything. I think codex is more polished, but I don't really prefer anything: I haven't used them enough. I mostly use Claude Code.
Related: I have upgraded my M4 Pro 24GB to M5 Pro 48GB yesterday. The same Gemma 4 MoE model (Q4) runs about 8x more t/s on M5 Pro and loads 2x times faster from disk to memory.
As you have so much RAM I would suggest running Q8_0 directly. It's not slower (perhaps except for the initial model load), and might even be faster, while being almost identical in quality to the original model.
And just to be sure: you're are running the MLX version, right? The mlx-community quantization seemed to be broken when I tried it last week (it spit out garbage), so I downloaded the unsloth version instead. That too was broken in mlx-lm (it crashed), but has since been fixed on the main branch of https://github.com/ml-explore/mlx-lm.
I unfortunately only have 16 GiB of RAM on a Macbook M1, but I just tried to run the Q8_0 GGUF version on a 2023 AMD Framework 13 with 64 GiB RAM just using the CPU, and that works surprisingly well with tokens/s much faster than I can read the output. The prompt cache is also very useful to quickly insert a large system prompt or file to datamine although there are probably better ways to do that instead of manually through a script.
> That too was broken in mlx-lm (it crashed), but has since been fixed on the main branch
Unfortunately I have got zero success running gemma with mlx-lm main branch. Can you point me out what is the right way? I have zero experience with mlx-lm.
For coding it makes no sense to use any quantization worse than Q6_K, from my experience. More quantized models make more mistakes and if for text processing it still can be fine, for coding it's not.
I don't think most people realize that. Quality of tokens beats quantity of token. I always tell folks to go as high a quant as you can only go lower if you just don't have the memory capacity.
AI models like gemma4 are available in different quant "sizes", think about it as an image available in various compression levels.
The best image is the largest, takes up the most memory when loading, and while it is large and looks the best, it uses up much of your system resources.
On the other end of the spectrum there is a smaller much more compressed version of that same image. It loads quickly, uses less resources, but is lacking detail and clarity of the original image.
AI models are similar in that fashion, and the parent poster is suggesting you use the largest version of the AI model your system can support, even if it runs a little slower than you like.
I don't really have the hardware to try it out, but I'm curious to see how Qwen3.5 stacks up against Gemma 4 in a comparison like this. Especially this model that was fine tuned to be good at tool calling that has more than 500k downloads as of this moment:
https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-...
I'm just some guy on hackernews, but I actually did try this on my DGX Spark. I went back to Gemma 4 after a few rounds. My orchestration model kept having to send the Qwen model back to fix mistakes that Gemma wouldn't have made. I wound up with less working code per hour due to the mistakes.
Technically, I use OpenWebUI with Ollama, so I used the weights below, but it should be the same.
I'd be super interested to hear about your workflow with OpenWebUI. I haven't figured out how to use it for anything other than the basic chatbot UI. I haven't been able to hook anything else into it
What I said above was a bit confused. What I've actually done is connect OpenCode and OpenWebUI both to Ollama. I just use OpenWebUI to manage the models and for testing/etc. Once you have it working it's very nice. You can pull a new model just by typing the name and waiting while it downloads, etc.
Connecting Ollama to OpenCode and OpenWebUI is relatively trivial. In OpenWebUI there's a nice GUI. In OpenCode You just edit the ~/.config/opencode/opecode.json to look something like this. The model names have to match the ones you seen in OpenWebUI, but the friendly "name" key can be whatever you need to be able to recognize it.
It's rather surprising that a solo dev can squeeze more performance out of a model with rather humble resources vs a frontier lab. I'm skeptical of claims that such a fine-tuned model is "better" -- maybe on certain benchmarks, but overall?
> can squeeze more performance out of a model with rather humble resources vs a frontier lab.
That's the idea behind distillation. They are finetuning it on traces produced by opus. This is poor man's distillation (and the least efficient) and it still works unreasonably well for what it costs.
That link doesn't have much affiliation with Qwen or anyone who produces/trained the Qwen models. That doesn't mean it's not good or safe, but it seems quite subjective to suggest it's the latest latest or greatest Qwen iteration.
I can see huggingface turning into the same poisoned watering-hole as NPM if people fall into the same habits of dropping links and context like that.
I'm not saying it's the latest Qwen iteration - that would be Qwen3.6.
I'm saying it's the latest iteration of the finetuned model mentioned in the parent comment.
I'm also not suggesting that it's "the latest and greatest" anything. In fact, I think it's rather clear that I'm suggesting the opposite? As in - how can a small fine tune produce better results than a frontier lab's work?
I've been playing with this for the last few days. The model is fast, pretty smart, and I am hitting the same tool use issues. This blog post is unusually pertinent. The model speed isn't an issue on my dual 4090s, the productivity is mainly limited by the intelligence (while high it's still not high enough for some tasks) and getting stuck in loops.
What I would like is for it to be able detect when these things happen and to "Phone a Friend" to a smarter model to ask for advice.
I'm definitely moving into agent orchestration territory where I'll have an number of agents constantly running and working on things as I am not the bottleneck. I'll have a mix of on-prem and AI providers.
My role now is less coder and more designer / manager / architect as agents readily go off in tangents and mess that they're not smart enough to get out of.
Google has replaced chat_template.jinja and tokenizer_config.json a few days ago in gemma-4-31B-it, which is supposed to have solved some problems related to tool invocation.
So if you have not updated your model, you should do it.
I did this with qwen 3.5 - tool calling was the biggest issue but for getting it to work with vllm and mlx I just asked codex to help. The bulk of my the time was waiting on download. For vllm it created a proxy service to translate some codex idioms to vllm and vice versa. In practice I got good results on my first prompt but followup questions usually would fail due to the models trouble with tool calling - I need to try again with gemma4
With a nvidia spark or 128gb+ memory machine, you can get a good speed up on the 31B model if you use the 26B MoE as a draft model. It uses more memory but I’ve seen acceptance rate at around 70%+ using Q8 on both models
It's interesting - imo we'll soon have draft models specifically post-trained for denser, more complicated models. Wouldn't be surprised if diffusion models made a comeback for this - they can draft many tokens at once, and learning curves seem to top out at 90+% match for auto-regressive ones so quite interesting..
I've been VERY impressed with Gemma4 (26B at the moment). It's the first time I've been able to use OpenCode via a llamacpp server reliably and actually get shit done.
In fact, I started using it as a coding partner while learning how to use the Godot game engine (and some custom 'skills' I pulled together from the official docs). I purposely avoided Claude and friends entirely, and just used Gemma4 locally this week... and it's really helped me figure out not just coding issues I was encountering, but also helped me sift through the documentation quite readily. I never felt like I needed to give in and use Claude.
4090, 128gb of ram (long before you'd have to take out a loan). I'm fairly sure it would run just as fine on a 3090.
Thanks to the settings suggestions in the article, I was able to squeeze in the 31b model. Still testing, but it's real tight in 24gb of vram. A bit slower, too, but usable. Not sure I'm seeing much of a quality boost yet, but I'm still testing.
Nice walkthrough and interesting findings! The difference between the MoE and the dense models seems to be bigger than what benchmarks report. It makes sense because a small gain in toll planning and handling can have a large influence on results.
I think local models are not yet that good or fast for complex things, so I am just using local Gemma 4 for some dummy refactorings or something really simple.
Amazing. Thanks for your detailed posts on the bake-off between the Mac and GB10, Daniel, and on your learnings. I had trying similar on both compute platforms on my to-do list. Your post should save me a lot of debugs, sweat, and tears.
This is genuinely very helpful. I'm planning a MacBook pro purchase with local inference in mind and now see I'll have to aim for a slightly higher memory option because the Gemma A4 26B MoE is not all that!
I have upgraded my M4 Pro 24GB to M5 Pro 48GB yesterday. The same Gemma 4 MoE model (4bit, don't remember which version) runs about 8x faster on M5 Pro and loads 2x times faster in memory.
You don't know if it's the newer model or the increase in RAM. If someone has already got 48GB it they might not benefit much. You changed 2 things at once.
Ollama is the worst engine you could use for this. Since you are already running on an Nvidia stack for the dense model, you should serve this with vLLM. With 128GB you could try for the original safetensors even though you might need to be careful with caches and context length.
Strangely, I haven't had a lot of luck with vLLM; I finally ended up ditching Ollama and going straight to the tap with llama-serve in llamacpp. No regrets.
I recently spun up Gemma 4 26B-A4B on my local box and pointed OpenCode at it, and it did reasonably well! My machine is 8 years old, though I had the foresight to double the RAM to 32 GiB before the RAMpocalypse, and I can get a little bit of GPU oomph but not a lot, not with a mere GTX 1070. So it's slow, and nowhere near frontier model quality, but it can generate reasonable code and is good for faffing with!
I'm suprised folks are having such great coding experiences.
Using Gemma-4 on a moderately complex code base, it utterly flailed and gave a half baked implementation.
I mean OP had them code something called "parse_csv_summary", which sounds like something a junior dev can whip up in an hour or so, since CSV manipulation in Python is one of the most basic tasks you can encounter in the problem space.
Banking, scientific data analysis, sales, etc. Everything uses and manipulates csvs.
So it was firmly in the bottom quartile of difficulty - and there LLMs actually do quite well.
Gemma 4 is a strongly censored model, so much so that it refused to answer medical and health related questions, even basic ones. No one should be using it, and if this is the best that Google can do, it should stop now. Other models do not have such ridiculous self-imposed problems.
I don't quite get why you feel so strongly about it that this should be a deal breaker for everyone. It's really much better than a wrong answer, for everyone.
That is a bad premise and a false dichotomy, because most medical questions are simple, with well-known standard answers. ChatGPT and Gemini answer such questions correctly, also finding glaring omissions by doctors, even without having to look up information.
As for the medical questions that are not simple, the ones that require looking up information, the model should in principle be able to respond that it does not know the answer when this is truthfully the case, implying that the answer, or a simple extrapoloation thereof, was not in its training data.
I suspect a possible future of local models is extreme specialisation - you load a Python-expert model for Python coding, do your shopping with a model focused just on this task, have a model specialised in speech-to-text plus automation to run your smart home, and so on. This makes sense: running a huge model for a task that only uses a small fraction of its ability is wasteful, and home hardware especially isn't suited to this wastefulness. I'd rather have multiple models with a deep narrow ability in particular areas, than a general wide shallow uncertain ability.
Anyway, is it possible that this may be what lies behind Gemma 4's "censoring"? As in, Google took a deliberate choice to focus its training on certain domains, and incorporated the censor to prevent it answering about topics it hasn't been trained on?
Or maybe they're just being sensibly cautious: asking even the top models for critical health advice is risky; asking a 32B model probably orders of magnitude moreso.
> is it possible that this may be what lies behind Gemma 4's "censoring"
Your explanation would make sense if various other rare domains were also censored, but they aren't, so it doesn't.
> asking even the top models for critical health advice is risky
Not asking, and living in ignorance, is riskier. For high-stakes questions, of course I'd want references that only an online model like ChatGPT or Gemini, etc. would be able to find. If I am asking a local model for health advice, odds are that it is because I am traveling and am temporarily offline, or am preparing off-grid infrastructure. In both cases I definitely require a best-effort answer. I also require the model to be able to tell when it doesn't know the answer.
If you would, ignore health advice for a moment, and switch to electrical advice. Imagine I am putting together electrical infrastructure, and the model gives me bad advice, risking electrocution and/or a serious fire. Why is electrical advice not censored, and what makes it not be high-stakes!? The logic is the same.
For the record, various open-source Asian models do not have any such problem, so I would rather use them.
> Not asking, and living in ignorance, is riskier. For high-stakes questions, of course I'd want references that only an online model like ChatGPT or Gemini, etc. would be able to find. If I am asking a local model for health advice, odds are that it is because I am traveling and am temporarily offline, or am preparing off-grid infrastructure. In both cases I definitely require a best-effort answer. I also require the model to be able to tell when it doesn't know the answer.
If I was prepping, I’d want e.g. Wikipedia available offline and default to human-assisted decision-making, and definitely not rely on a 31B parameter model.
To be reductive, the ‘brain’ of any of these models is essentially a compression blob in an incomprehensible format. The bigger the delta between the input and the output model size, the lossier the compression must be.
It therefore follows (for me at least) that there’s a correlation between the risk of the question and the size of model I’d trust to answer it. And health questions are arguably some of the most sensitive - lots of input data required for a full understanding, vs. big downsides of inaccurate advice.
> If you would, ignore health advice for a moment, and switch to electrical advice. Imagine I am putting together electrical infrastructure, and the model gives me bad advice, risking electrocution and/or a serious fire. Why is electrical advice not censored, and what makes it not be high-stakes!? The logic is the same.
You’re correct that it’s possible to find other risky areas that might not be currently censored. Maybe this is deliberate (maybe the input data needed for expertise in electrical engineering is smaller?) or maybe this is just an evolving area and human health questions are an obvious first area to address?
Either way, I’m not trusting a small model with detailed health questions, detailed electrical questions, or the best way to fold a parachute for base jumping. :)
(Although, if in the future there’s a Gemma-5-Health 32B and a Gemma-5-Electricity 32B, and so on, then maybe this will change.)
> Imagine I am putting together electrical infrastructure, and the model gives me bad advice, risking electrocution and/or a serious fire
That's a weird demand from models. What next, "Imagine I'm doing brain surgery and the model gives me bad advice", "Imagine I'm a judge delivering a sentencing and the model gives me bad advice", ...
Requesting electrical advice is not a weird ask at all. If writing sophisticated code requires skill, then so does electrical work, and one doesn't require more or less skill than the other. I would expect that the top-ranked thinking models are wholly capable of offering correct advice on the topic. The issues arise more from the user's inability to input all applicable context which can affect the decision and output. All else being equal, bad electrical work is 10x more likely to be a result of not adequately consulting AI than from consulting AI.
Secondly, the primary point was about censorship, not accuracy, so let's not get distracted.
Weird. A great number of my medical or legal queries are actually answered, but come with a disclaimer, often at the end of the inference. (I'd offer up some examples, but I'm not at the desk.)
I also find that you can coerce a wide spectrum of otherwise declined queries by editing its initial rejection into the start of an answer. For example changing the "I'm sorry I can't answer that..." response to "Here's how..." And then resubmitting the inference, allowing it to continue from there. It's not perfect, sometimes it takes multiple attempts, but it does work. At least in my experience. (This isn't Gemma-specific tip, either. Nearly every model I've tried this with tends to bend quite a bit doing this.)
Gemma 4 26B really is an outlier in its weight class.
In our little known, difficult to game benchmarks, it scored about as well as GPT 5.2 and Gemini 3 Pro Preview on one-shot coding problems. It had me re-reviewing our entire benchmarking methodology.
But it struggled in the other two sections of our benchmark: agentic coding and non-coding decision making. Tool use, iterative refinement, managing large contexts, and reasoning outside of coding brought the scores back down to reality. It actually performed worse when it had to use tools and a custom harness to write code for an eval vs getting the chance to one-shot it. No doubt it's been overfit on common harnesses and agentic benchmarks. But the main problem is likely scaling context on small models.
Still, incredible model, and incredible speed on an M-series Macbook. Benchmarks at https://gertlabs.com
Thats funny, it failed my usual ‘hello world’ benchmark for LLM’s:
“Write a single file web page that implements a 1 dimensional bin fitting calculator using the best fit decreasing algorithm. Allow the user to input bin size, item size, and item quantity.”
Qwen3.5, Nematron, Step 3.5, gpt-oss all passed first go..
Gemma 31B scoring below 26B-A4B?
In one shot coding, surprisingly, yes, by a decent amount. And it isn't a sample size issue. In agentic, no: https://gertlabs.com/?agentic=agentic
My early takeaway is that Gemma 26B-A4B is the best tuned out of the bunch, but being a small and fast model, it's severely constrained by context (large inputs and required outputs tank Gemma 26B's performance). We're working on a clean visualization for this; the data is there.
It's not uncommon for a sub-release of a model to show improvements across the board on its model card, but actually have mixed real performance compared to its predecessor (sometimes even being worse on average).
> The finding I did not expect: model quality matters more than token speed for agentic coding.
I'm really surprised how that was not obvious.
Also, instead of limiting context size to something like 32k, at the cost of ~halving token generation speed, you can offload MoE stuff to the CPU with --cpu-moe.
Why would token speed matter for anything other than getting work done faster? It's in the name - "speed".
Yeah, it’s like drinking coffee when being really tired. You’re still tired, just “faster”, it’s a weird sensation.
It's even more strange how its not obvious to someone who uses codex extensively daily.
The rate limiting step is the LLM going down stupid rabbit holes or overthinking hard and getting decision paralysis.
The only time raw speed really matters is if you are trying to add many many lines of new code. But if you are doing that at token limiting rates you are going to be approaching the singularity of AI slop codebase in no time.
[dead]
"The reason I had not done this before is that local models could not call tools. "
Rubbish, we have been calling tools locally for 2 years, and it's very false that gemma3 scored under 7% in tool calling. Hell, I was getting at least 75% tool calling with llama3.3
I was also surprised by this sentence. It sounds like this is the author's first attempt at running models locally.
Or maybe the author has been running heavily quantized small models all that time — Gemma 4 gguf he's using is Q4 and only 16 GB. In my experience quants like this tend to perform much worse.
This entire article reads like AI slop anyway.
I also recommend anyone with a GB10 device to go try out the spark-vllm-docker setup, and check the Nvidia GB10 forums for the recently released optimised Qwen 3.5 122B A10B setup: 50tk/s is quite impressive for a decent local model!
I'm currently experimenting with running google/gemma-4-26b-a4b with lm studio (https://lmstudio.ai/) and Opencode on a M3 Ultra with 48Gb RAM. And it seems to be working. I had to increase the context size to 65536 so the prompts from Opencode would work, but no other problems so far.
I tried running the same on an M3 Max with less memory, but couldn't increase the context size enough to be useful with Opencode.
It's also easy to integrate it with Zed via ACP. For now it's mostly simple code review tasks and generating small front-end related code snippets.
I have a similar setup. It might be worth checking out pi-coding-agent [0].
The system prompt and tools have very little overhead (<2k tokens), making the prefill latency feel noticeably snappier compared to Opencode.
[0] https://www.npmjs.com/package/@mariozechner/pi-coding-agent#...
Thanks! I just ran a quick test with pi, and it's working a bit faster.
I run this model on my AMD RX7900XTX with 24GB VRAM with up to 4 concurrent chats and 512K context window in total. It is very fast (~100 t/s) and feels instant and very capable, and I have used Claude Code less and less these days.
I did the same using the mlx version on an M1 Macbook using LMStudio integrated into XCode. I had to up the context size I ran it a against a very modest iOS codebase and it didn't do well, just petered out at one point. Odd. Pretty good chatbot and maybe against other code it'll work but not useful with XCode for me
Not sure if you already tried but both GLM Flash and Qwen models are much better than Gemma for that in my experience.
I am using a 24GB GPU so it might be different in your case, but I doubt it.
I do the same thing on a MacBook Pro with an M4 Max and 64GB. I had problems until the most recent LM Studio update (0.4.11+1), tool calling didn't work correctly.
Now both codex and opencode seem to work.
Which do you prefer? And what lmstudio api works best for these tools?
I use the OpenAI API for everything. I think codex is more polished, but I don't really prefer anything: I haven't used them enough. I mostly use Claude Code.
gguf or mlx? edit, just tried a community mlx and lm studio said it didn't support loading it yet.
Related: I have upgraded my M4 Pro 24GB to M5 Pro 48GB yesterday. The same Gemma 4 MoE model (Q4) runs about 8x more t/s on M5 Pro and loads 2x times faster from disk to memory.
Gonna run some more tests later today.
> The same Gemma 4 MoE model (Q4)
As you have so much RAM I would suggest running Q8_0 directly. It's not slower (perhaps except for the initial model load), and might even be faster, while being almost identical in quality to the original model.
And just to be sure: you're are running the MLX version, right? The mlx-community quantization seemed to be broken when I tried it last week (it spit out garbage), so I downloaded the unsloth version instead. That too was broken in mlx-lm (it crashed), but has since been fixed on the main branch of https://github.com/ml-explore/mlx-lm.
I unfortunately only have 16 GiB of RAM on a Macbook M1, but I just tried to run the Q8_0 GGUF version on a 2023 AMD Framework 13 with 64 GiB RAM just using the CPU, and that works surprisingly well with tokens/s much faster than I can read the output. The prompt cache is also very useful to quickly insert a large system prompt or file to datamine although there are probably better ways to do that instead of manually through a script.
> That too was broken in mlx-lm (it crashed), but has since been fixed on the main branch
Unfortunately I have got zero success running gemma with mlx-lm main branch. Can you point me out what is the right way? I have zero experience with mlx-lm.
Gemma 4 is not supported by the MLX engine yet.
> As you have so much RAM I would suggest running Q8_0 directly
On the 48GB mac - absolutely. The 24GB one cannot run Q8, hence why the comparison.
> And just to be sure: you're are running the MLX version, right?
Nah, not yet. I have only tested in LM Studio and they don't have MLX versions recommended yet.
> but has since been fixed on the main branch
That's good to know, I will play around with it.
For coding it makes no sense to use any quantization worse than Q6_K, from my experience. More quantized models make more mistakes and if for text processing it still can be fine, for coding it's not.
I don't think most people realize that. Quality of tokens beats quantity of token. I always tell folks to go as high a quant as you can only go lower if you just don't have the memory capacity.
what do you mean with that, I’m not sure I understood what you said
AI models like gemma4 are available in different quant "sizes", think about it as an image available in various compression levels.
The best image is the largest, takes up the most memory when loading, and while it is large and looks the best, it uses up much of your system resources.
On the other end of the spectrum there is a smaller much more compressed version of that same image. It loads quickly, uses less resources, but is lacking detail and clarity of the original image.
AI models are similar in that fashion, and the parent poster is suggesting you use the largest version of the AI model your system can support, even if it runs a little slower than you like.
Thank you!
Better go for a less-quantized model even if it's slower than go for a faster, quantized one.
Thank you!
I would have liked to see quality results between the different quantization methods - Q4_K_M, Q_8_0, Q_6_K rather than tok/s
I don't really have the hardware to try it out, but I'm curious to see how Qwen3.5 stacks up against Gemma 4 in a comparison like this. Especially this model that was fine tuned to be good at tool calling that has more than 500k downloads as of this moment: https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-...
Jackrong has published the finetuning steps here. It seems to be quite thorough with notebooks etc. I am going through it myself now...
https://github.com/R6410418/Jackrong-llm-finetuning-guide
I'm just some guy on hackernews, but I actually did try this on my DGX Spark. I went back to Gemma 4 after a few rounds. My orchestration model kept having to send the Qwen model back to fix mistakes that Gemma wouldn't have made. I wound up with less working code per hour due to the mistakes.
Technically, I use OpenWebUI with Ollama, so I used the weights below, but it should be the same.
https://ollama.com/kwangsuklee/Qwen3.5-27B-Claude-4.6-Opus-R...
I'd be super interested to hear about your workflow with OpenWebUI. I haven't figured out how to use it for anything other than the basic chatbot UI. I haven't been able to hook anything else into it
What I said above was a bit confused. What I've actually done is connect OpenCode and OpenWebUI both to Ollama. I just use OpenWebUI to manage the models and for testing/etc. Once you have it working it's very nice. You can pull a new model just by typing the name and waiting while it downloads, etc.
Connecting Ollama to OpenCode and OpenWebUI is relatively trivial. In OpenWebUI there's a nice GUI. In OpenCode You just edit the ~/.config/opencode/opecode.json to look something like this. The model names have to match the ones you seen in OpenWebUI, but the friendly "name" key can be whatever you need to be able to recognize it.
It's rather surprising that a solo dev can squeeze more performance out of a model with rather humble resources vs a frontier lab. I'm skeptical of claims that such a fine-tuned model is "better" -- maybe on certain benchmarks, but overall?
FYI the latest iteration of that finetune is here: https://huggingface.co/Jackrong/Qwopus3.5-27B-v3
> can squeeze more performance out of a model with rather humble resources vs a frontier lab.
That's the idea behind distillation. They are finetuning it on traces produced by opus. This is poor man's distillation (and the least efficient) and it still works unreasonably well for what it costs.
I feel that's a little bit misleading.
That link doesn't have much affiliation with Qwen or anyone who produces/trained the Qwen models. That doesn't mean it's not good or safe, but it seems quite subjective to suggest it's the latest latest or greatest Qwen iteration.
I can see huggingface turning into the same poisoned watering-hole as NPM if people fall into the same habits of dropping links and context like that.
I'm not saying it's the latest Qwen iteration - that would be Qwen3.6.
I'm saying it's the latest iteration of the finetuned model mentioned in the parent comment.
I'm also not suggesting that it's "the latest and greatest" anything. In fact, I think it's rather clear that I'm suggesting the opposite? As in - how can a small fine tune produce better results than a frontier lab's work?
Sorry, I somehow didn't see the comment above yours, but it makes a lot more sense now.
The sentiment still applies the parent comment of yours though.
[dead]
I've been playing with this for the last few days. The model is fast, pretty smart, and I am hitting the same tool use issues. This blog post is unusually pertinent. The model speed isn't an issue on my dual 4090s, the productivity is mainly limited by the intelligence (while high it's still not high enough for some tasks) and getting stuck in loops.
What I would like is for it to be able detect when these things happen and to "Phone a Friend" to a smarter model to ask for advice.
I'm definitely moving into agent orchestration territory where I'll have an number of agents constantly running and working on things as I am not the bottleneck. I'll have a mix of on-prem and AI providers.
My role now is less coder and more designer / manager / architect as agents readily go off in tangents and mess that they're not smart enough to get out of.
Google has replaced chat_template.jinja and tokenizer_config.json a few days ago in gemma-4-31B-it, which is supposed to have solved some problems related to tool invocation.
So if you have not updated your model, you should do it.
I did this with qwen 3.5 - tool calling was the biggest issue but for getting it to work with vllm and mlx I just asked codex to help. The bulk of my the time was waiting on download. For vllm it created a proxy service to translate some codex idioms to vllm and vice versa. In practice I got good results on my first prompt but followup questions usually would fail due to the models trouble with tool calling - I need to try again with gemma4
Does the large system prompt work fine for this model? If needed, you could use a lightweight CLI like Pi, which only comes with 4 tools by default
With a nvidia spark or 128gb+ memory machine, you can get a good speed up on the 31B model if you use the 26B MoE as a draft model. It uses more memory but I’ve seen acceptance rate at around 70%+ using Q8 on both models
1 token ahead or 2?
It's interesting - imo we'll soon have draft models specifically post-trained for denser, more complicated models. Wouldn't be surprised if diffusion models made a comeback for this - they can draft many tokens at once, and learning curves seem to top out at 90+% match for auto-regressive ones so quite interesting..
flow matching is making some strides right now, too
I laughed when I saw the .md table rendering as a service. Blows my mind what people will use
Hey - I use the same, w/ both gemma4 and gpt-oss-*; some things I have to do for a good experience:
1) Pin to an earlier version of codex (sorry) - 0.55 is the best experience IME, but YMMV (see https://github.com/openai/codex/issues/11940, https://github.com/openai/codex/issues/8272).
2) Use the older completions endpoint (llama.cpp's responses support is incomplete - https://github.com/ggml-org/llama.cpp/issues/19138)
I've been VERY impressed with Gemma4 (26B at the moment). It's the first time I've been able to use OpenCode via a llamacpp server reliably and actually get shit done.
In fact, I started using it as a coding partner while learning how to use the Godot game engine (and some custom 'skills' I pulled together from the official docs). I purposely avoided Claude and friends entirely, and just used Gemma4 locally this week... and it's really helped me figure out not just coding issues I was encountering, but also helped me sift through the documentation quite readily. I never felt like I needed to give in and use Claude.
Very, very pleased.
Thanks for sharing that. What kind of hardware are you running this on?
4090, 128gb of ram (long before you'd have to take out a loan). I'm fairly sure it would run just as fine on a 3090.
Thanks to the settings suggestions in the article, I was able to squeeze in the 31b model. Still testing, but it's real tight in 24gb of vram. A bit slower, too, but usable. Not sure I'm seeing much of a quality boost yet, but I'm still testing.
Probably a silly/obvious suggestion but are you using onboard GPU for display out?
Nice walkthrough and interesting findings! The difference between the MoE and the dense models seems to be bigger than what benchmarks report. It makes sense because a small gain in toll planning and handling can have a large influence on results.
I also tried Gemma 4 on a M1 Macbook Pro. It worked but it was too slow. Great to know that it works on more advanced laptops!
I think local models are not yet that good or fast for complex things, so I am just using local Gemma 4 for some dummy refactorings or something really simple.
Using Gemma4-31B-q4_NL in open code with a 128k context and it’s been great.
How successful is it at calling tools?
Amazing. Thanks for your detailed posts on the bake-off between the Mac and GB10, Daniel, and on your learnings. I had trying similar on both compute platforms on my to-do list. Your post should save me a lot of debugs, sweat, and tears.
You can also try speculative decoding with the E2B model. Under some conditions it can result in a decent speed up
This is genuinely very helpful. I'm planning a MacBook pro purchase with local inference in mind and now see I'll have to aim for a slightly higher memory option because the Gemma A4 26B MoE is not all that!
pretty sure Nvidia GPU is better bang for buck because of usable inference speed..
I have upgraded my M4 Pro 24GB to M5 Pro 48GB yesterday. The same Gemma 4 MoE model (4bit, don't remember which version) runs about 8x faster on M5 Pro and loads 2x times faster in memory.
So yes, do purchase that new MacBook Pro.
You don't know if it's the newer model or the increase in RAM. If someone has already got 48GB it they might not benefit much. You changed 2 things at once.
Not really: it's the same model size and it fits 24GB entirely.
If you're doing it specifically for inference (or in most other situations) a Mac(book) represents very low RoE.
s/RoE/RoI
Nothing about omlx?
Ollama is the worst engine you could use for this. Since you are already running on an Nvidia stack for the dense model, you should serve this with vLLM. With 128GB you could try for the original safetensors even though you might need to be careful with caches and context length.
Strangely, I haven't had a lot of luck with vLLM; I finally ended up ditching Ollama and going straight to the tap with llama-serve in llamacpp. No regrets.
I recently spun up Gemma 4 26B-A4B on my local box and pointed OpenCode at it, and it did reasonably well! My machine is 8 years old, though I had the foresight to double the RAM to 32 GiB before the RAMpocalypse, and I can get a little bit of GPU oomph but not a lot, not with a mere GTX 1070. So it's slow, and nowhere near frontier model quality, but it can generate reasonable code and is good for faffing with!
I'm suprised folks are having such great coding experiences. Using Gemma-4 on a moderately complex code base, it utterly flailed and gave a half baked implementation.
I mean OP had them code something called "parse_csv_summary", which sounds like something a junior dev can whip up in an hour or so, since CSV manipulation in Python is one of the most basic tasks you can encounter in the problem space.
Banking, scientific data analysis, sales, etc. Everything uses and manipulates csvs.
So it was firmly in the bottom quartile of difficulty - and there LLMs actually do quite well.
Gemma 4 is a strongly censored model, so much so that it refused to answer medical and health related questions, even basic ones. No one should be using it, and if this is the best that Google can do, it should stop now. Other models do not have such ridiculous self-imposed problems.
I don't quite get why you feel so strongly about it that this should be a deal breaker for everyone. It's really much better than a wrong answer, for everyone.
> It's really much better than a wrong answer
That is a bad premise and a false dichotomy, because most medical questions are simple, with well-known standard answers. ChatGPT and Gemini answer such questions correctly, also finding glaring omissions by doctors, even without having to look up information.
As for the medical questions that are not simple, the ones that require looking up information, the model should in principle be able to respond that it does not know the answer when this is truthfully the case, implying that the answer, or a simple extrapoloation thereof, was not in its training data.
Since it's open weights there's nothing stopping you from grabbing one of the uncensored variants from huggingface.co
I suspect a possible future of local models is extreme specialisation - you load a Python-expert model for Python coding, do your shopping with a model focused just on this task, have a model specialised in speech-to-text plus automation to run your smart home, and so on. This makes sense: running a huge model for a task that only uses a small fraction of its ability is wasteful, and home hardware especially isn't suited to this wastefulness. I'd rather have multiple models with a deep narrow ability in particular areas, than a general wide shallow uncertain ability.
Anyway, is it possible that this may be what lies behind Gemma 4's "censoring"? As in, Google took a deliberate choice to focus its training on certain domains, and incorporated the censor to prevent it answering about topics it hasn't been trained on?
Or maybe they're just being sensibly cautious: asking even the top models for critical health advice is risky; asking a 32B model probably orders of magnitude moreso.
> is it possible that this may be what lies behind Gemma 4's "censoring"
Your explanation would make sense if various other rare domains were also censored, but they aren't, so it doesn't.
> asking even the top models for critical health advice is risky
Not asking, and living in ignorance, is riskier. For high-stakes questions, of course I'd want references that only an online model like ChatGPT or Gemini, etc. would be able to find. If I am asking a local model for health advice, odds are that it is because I am traveling and am temporarily offline, or am preparing off-grid infrastructure. In both cases I definitely require a best-effort answer. I also require the model to be able to tell when it doesn't know the answer.
If you would, ignore health advice for a moment, and switch to electrical advice. Imagine I am putting together electrical infrastructure, and the model gives me bad advice, risking electrocution and/or a serious fire. Why is electrical advice not censored, and what makes it not be high-stakes!? The logic is the same.
For the record, various open-source Asian models do not have any such problem, so I would rather use them.
> Not asking, and living in ignorance, is riskier. For high-stakes questions, of course I'd want references that only an online model like ChatGPT or Gemini, etc. would be able to find. If I am asking a local model for health advice, odds are that it is because I am traveling and am temporarily offline, or am preparing off-grid infrastructure. In both cases I definitely require a best-effort answer. I also require the model to be able to tell when it doesn't know the answer.
If I was prepping, I’d want e.g. Wikipedia available offline and default to human-assisted decision-making, and definitely not rely on a 31B parameter model.
To be reductive, the ‘brain’ of any of these models is essentially a compression blob in an incomprehensible format. The bigger the delta between the input and the output model size, the lossier the compression must be.
It therefore follows (for me at least) that there’s a correlation between the risk of the question and the size of model I’d trust to answer it. And health questions are arguably some of the most sensitive - lots of input data required for a full understanding, vs. big downsides of inaccurate advice.
> If you would, ignore health advice for a moment, and switch to electrical advice. Imagine I am putting together electrical infrastructure, and the model gives me bad advice, risking electrocution and/or a serious fire. Why is electrical advice not censored, and what makes it not be high-stakes!? The logic is the same.
You’re correct that it’s possible to find other risky areas that might not be currently censored. Maybe this is deliberate (maybe the input data needed for expertise in electrical engineering is smaller?) or maybe this is just an evolving area and human health questions are an obvious first area to address?
Either way, I’m not trusting a small model with detailed health questions, detailed electrical questions, or the best way to fold a parachute for base jumping. :)
(Although, if in the future there’s a Gemma-5-Health 32B and a Gemma-5-Electricity 32B, and so on, then maybe this will change.)
> Imagine I am putting together electrical infrastructure, and the model gives me bad advice, risking electrocution and/or a serious fire
That's a weird demand from models. What next, "Imagine I'm doing brain surgery and the model gives me bad advice", "Imagine I'm a judge delivering a sentencing and the model gives me bad advice", ...
Requesting electrical advice is not a weird ask at all. If writing sophisticated code requires skill, then so does electrical work, and one doesn't require more or less skill than the other. I would expect that the top-ranked thinking models are wholly capable of offering correct advice on the topic. The issues arise more from the user's inability to input all applicable context which can affect the decision and output. All else being equal, bad electrical work is 10x more likely to be a result of not adequately consulting AI than from consulting AI.
Secondly, the primary point was about censorship, not accuracy, so let's not get distracted.
> Requesting electrical advice is not a weird ask at all. If writing sophisticated code requires skill, then so does electrical work
Except with electrical stuff the unit test itself can put your life and others in danger.
Bad electrical work is more likely to burn your house down than some bad code. Bad medical advice is different again.
I assumed it was more about risk management/liability than censorship.
There are uncensored / "abliterated" / heretic versions available on Huggingface.
You can get abliterated versions that have no (or very limited) refusals.
I tend to use Huihuiai versions.
Weird. A great number of my medical or legal queries are actually answered, but come with a disclaimer, often at the end of the inference. (I'd offer up some examples, but I'm not at the desk.)
I also find that you can coerce a wide spectrum of otherwise declined queries by editing its initial rejection into the start of an answer. For example changing the "I'm sorry I can't answer that..." response to "Here's how..." And then resubmitting the inference, allowing it to continue from there. It's not perfect, sometimes it takes multiple attempts, but it does work. At least in my experience. (This isn't Gemma-specific tip, either. Nearly every model I've tried this with tends to bend quite a bit doing this.)
The setup allots around 4k of context after system prompt lol
[dead]
[dead]