Show HN: Lowfat – pluggable CLI filter that saved 91.8% of my LLM tokens

Posted by zdkaster 4 days ago

Hi HN, not sure if anyone would be interested, but just wanted to share that I've been maintaining my small tool called 'lowfat' that helps me filters some of my verbose CLI output. It's a single binary, works as an agent hook or a shell wrapper. It has a plugin system to customize filters per command.

The idea is pretty simple: agents don't need the full kubectl get -o yaml or any 10k-line dump to make decisions. So that lowfat sits in between, strips the noise, and passes through what matters. Here's my real report after 2 months of personal use:

  lowfat history --all

  lowfat plugin candidates
  ─────────────────────────────────────────────────────────

    #  command                    runs   avg raw      cost   savings  source    status  
    1  kubectl get                101x     14.4K      1.5M     93.9%  plugin    good    
    2  grep                       103x     13.5K      1.4M     96.2%  plugin    good    
    3  git diff                    81x       995     80.6K     57.9%  built-in  good    
    4  kubectl                     90x       485     43.6K     33.6%  plugin    good    
    5  docker                     127x      5.5K    693.6K     96.1%  built-in  good    
    6  ls                         489x       117     57.3K     56.2%  built-in  good    
    7  find                        30x     16.5K    495.0K     95.5%  plugin    good    
    8  git show                    63x       490     30.9K     38.0%  built-in  good    
    9  git                        177x       368     65.2K     76.1%  built-in  good    
   10  git log                     86x       556     47.8K     78.5%  built-in  good    
   11  kubectl logs                 5x      3.6K     17.8K     43.0%  plugin    good    
   12  git status                  86x       152     13.1K     58.0%  built-in  good    
   13  docker ps                   20x       467      9.3K     52.8%  plugin    good    
   14  kubectl describe             6x       656      3.9K      1.2%  plugin    weak    
   15  docker images                9x       940      8.5K     61.8%  built-in  good    
   16  k get                        2x      2.1K      4.2K     35.9%  plugin    good    
   17  terraform                   10x       395      3.9K     32.1%  plugin    good    
   18  git commit                  32x        77      2.5K      0.0%  built-in  weak    
   19  docker build                 8x       487      3.9K     37.6%  built-in  good    
   20  docker compose              22x       979     21.5K     89.4%  built-in  good    

  total: 4.4M raw → 4.1M saved (91.8%)

My toolset above is kind limited, but it works pretty well for my usecase without any interruption Kinda help me not reaching the token limit for my company Bedrock limit usage and keep optimizing the saving on the go for later usage.

But, why not alternatives (https://github.com/zdk/lowfat#alternatives) ? The answers are: - My goal is to make the core lightweight but extensible via plugins i.e. not trying to bundle every command in the installed binary so that people own their output filters. - Customizable per usecase via plugin or filter pipelines as I am using my own toolset. - Customizable for non-public CLI tools, for example, some enterprise might have their interal CLI tools that public won't have access. - People should own their data. So the design is local-first, No telemetry forever. - I kinda love UNIX-style composible pipes, so lowfat-filter has implemented this style. - Be able to adjust aggressiveness of the filter, so we can control that we won't strip something the agent needed.

GitHub: https://github.com/zdk/lowfat

Anyway, if anyone is interested, feedbacks and questions are welcome!

Thanks!

Comments

Comment by alex7o 4 days ago

I would like to have deeper comparison with alternatives like rtk, which are already fast and written in rust, also the previous comments mentioned something that has been a know problem with rtk that it sometimes strips the thing that the llm needs (or expects, causing more work to need to happan not less)

Comment by onlyrealcuzzo 4 days ago

None of these tools measure how effective they are...

It's a massive red flag to me when you could get decent data to see if your thing actually works, and they don't even attempt to...

Have the LLM use your tool, run it on several of the coding benchmarks. If you're stingy, run it on the ones that don't cost much.

Otherwise, I'm going to assume it doesn't actually work. If it did - Claude, Antigravity, Codex, Pi, or some major player would bundle tools like this into the CLI / harness.

AFAIK, none of the major players do. That's a sign to me these don't work in general.

I've tried building some tools specific to bug fixing. Intelligently feeding context massively helps smaller models. But, what I've found - surprisingly - is that a smaller, much better focused, including a lot of helpful data as well, has almost no impact on larger models compared to what they do by default.

You do save some tokens, though, which is what they're claiming - but not ~99%...

Comment by hansvm 4 days ago

> otherwise, popular solutions would integrate the idea

None of the major players are incentivized to care about this, especially not over other opportunities. Why would you expect them to integrate it?

One of the biggest wins you can institute for your own codebase if you use agents is writing your own harness, by a huge margin. The defaults are fine, but you can do better.

Comment by Cpoll 3 days ago

They're incentivised because they're offering plans at a loss and/or pricing out potential customers. All these LLM companies are competing on accuracy and price.

Comment by onlyrealcuzzo 3 days ago

> The defaults are fine, but you can do better.

Why can I do better than Pi?

I don't want to build my own harness and deal with the bugs... I want to build my project...

My understanding is that Codex / Claude / Gemini subscriptions don't work with custom harnesses.

It's pretty hard to beat 5x more usage if you have the $200/mo subscription by using the API instead.

Comment by smallerize 3 days ago

If you're looking for an efficiency-focused harness, I had a pretty good time using the Dirac agent. The line-based anchors were slightly buggy though (this was a couple months ago) and would sometimes add the same line of code multiple times or leave an anchor in the output.

Comment by hboon 3 days ago

Codex definitely does and Claude Max definitely doesn’t.

Comment by Bnjoroge 3 days ago

definitely doesnt is a strong word. it technically is possible, but you might get banned

Comment by hboon 2 days ago

That was true. But actually, I think that's changed a few weeks ago since they introduced a API credit amount equivalent to your (eg. $100, $200) that will be used for such cases. So they don't ban you, they just bill you that allocated credit and then actual API cost.

Comment by Bnjoroge 5 hours ago

Yes. That’s possible in addition to using your actual subscription. I’ve been using it via cliproxy for all harnesses and even my own code review agent hooked up to github apps. Not banned yet but I also dont do crazy stuff with openclaw or hermes

Comment by 3 days ago

Comment by doix 4 days ago

It's too hard to define what "works" even means in this case. Look at the example savings output. A lot of it is kubectl output.

Your suggestion to using coding benchmarks doesn't really capture the whole picture. I haven't seen a benchmark using kubectl.

> AFAIK, none of the major players do. That's a sign to me these don't work in general.

It's a lose/lose for major players. If it works well, it will lower their revenue. Also there's a high risk it'll significantly worsen results for some people, even if it improves results for others.

Comment by taude 4 days ago

I don't think frontier model providers are going to be incentivized to invest in this much, yet. Once inference gets more competitive, sure. I haven't looked lately, but won't be surprised if tools like OpenCode do do what you're suggesting, though. Third-party coding harnesses ARE aligned to deliver this type of feature and optimization.

Comment by no-name-here 4 days ago

> I'm going to assume it doesn't actually work. If it did - Claude, Antigravity, Codex, Pi, or some major player would bundle tools like this into the CLI / harness.

VS Code launched it as a feature in their bundled AI functionality last month: https://code.visualstudio.com/updates/v1_121

Comment by onlyrealcuzzo 4 days ago

Bundling implies interest...

Defaults imply working...

Comment by irthomasthomas 4 days ago

My partial solution to this was to store the full response in a file and prompt the agent to read that if the condensed version had stuff missing.

Comment by unphased 3 days ago

So often we will burn 20% of limit in a single ill conceived agent tool call that we're simply not going to be able to or want to be able to intercept. Where I see a tool like this being a real step forward is to add a decision point. it does not have to bubble up to hard-require user to provide permission, but it can let the LLM have an intermediate checkpoint to say that it's about to get blasted with 30k tokens and here is roughly the shape of it and do you wanna adjust or whittle it down if you know what you're looking for etc.?

There is definitely tons of value to extract from this line of thinking.

Comment by varispeed 4 days ago

You can't measure effectiveness, because you never know what kind of model will process your prompt. One request you might get full e.g. Opus and another they'll downgrade it to Sonnet or something more basic. I have this with "Opus 4.8" all the time.

Comment by jahala 4 days ago

This is the reason, when I built a tool in the same space, I chose to benchmark with cost per correct answer.

Reducing tokens and also turns is quite worthless if the LLM doesn’t solve what you put it to do.

Comment by esafak 4 days ago

Did you benchmark the competition and can we see?

Comment by jahala 3 days ago

No I don't have the funds to benchmark the competition, but would be happy to put the numbers up if any token whales feel like having a go.

https://github.com/jahala/tilth/tree/main/benchmark

Comment by alex7o 3 days ago

Oh that is a nice approach whish more benchmarks did cost per successful

Comment by onlyrealcuzzo 3 days ago

The problem even attempting to develop a tool for the frontier model space is that the cost to run a statistically significant benchmark is almost certainly going to be over $100 - for a single model.

Unless something is like 25%+ more cost effective on Gemini for a task, I would not assume those savings are going to transfer to GPT.

If you need to run a test this expensive and slow for every release, hobbiests aren't going to do it.

And if you wanted any broadly specific improvements to coding like they all claim, the costs would be in the thousands per release even for a single for a single model.

And they almost certainly would not be eye popping.

If the models could be SUBSTANTIALLY better, Google and Anthropic and OpenAI wouldn't be finding that out from a hobbiest making wildly unscientific claims.

Comment by jahala 3 days ago

Yup, this is hitting it on the nose. But, despite the cost - the benchmark is the vital ingredient that cant be skipped. Otherwise, you don't know if what you're building is actually helping the agent rather than hindering it.

On the previous large benchmark run, i proved 40-50% cost reduction per correct answer.

I'm not sure why the vendors aren't using token filtering/compression more in their tooling, but perhaps they don't mind users feeding them more data and using more data.

Comment by poelzi 3 days ago

[dead]

Comment by zdkaster 4 days ago

In term of token saving performance, it should be on par with rtk since it is basically the same idea. The major different is rtk bundled hundreds of filter logic and no room for user to adjust without maintaing user owned fork or opening the pull request while lowfat is using opposite architectural approach by removing almost all filter logic in the binary and seperate user filters as a plugin system

Comment by zdkaster 3 days ago

I have just put the comparison in the repo in case you want to checkout.

Comment by giancarlostoro 4 days ago

Yeah I use rtk and would love to see a comparison.

Comment by jemmyw 4 days ago

I've tried rtx and lean-ctx and these tools seem to end up confusing the agent more than helping. Any saving is irrelevant if the agent decides to work around the tool and makes even more calls than it would otherwise.

I don't know about cost saving, but if it's keeping the context size down I've had a lot better results using subagents to keep a higher order conversation clean for longer.

Comment by lxn 4 days ago

I looked into lean-ctx and decided not to use it. It has a very specific use case, and it's good when your interaction with the repository is read-only. When you want to edit, then the model has to read the whole file anyway. It's a cool tool, but it has a very narrow use case where it delivers the performance it claims.

Comment by exitb 4 days ago

Subagents help with costs too, as they can run on much cheaper models.

Comment by mywittyname 4 days ago

Also, most of the time when I'm having an agent look through logs or output, it's grepping for the bits of data relevant to its actions.

Comment by threecheese 4 days ago

The docs are missing any examples of what this does, instead showing _how_ it works - and only for the codebase itself, rather than the behavior of the app.

What would be useful:

  - examples of text that can be filtered, and why that would be valuable
  - a data flow diagram of runtime behavior, showing how filtering removes unnecessary context

Comment by zdkaster 4 days ago

Thanks for your feedback. Will put this in place. Meanwhile, please checkout architecture doc and plugin. The plugin doc could a little bit giving insight of what it does.

Comment by mbreese 4 days ago

I have to agree. I’m interested in the project, so congrats. It’s something I might really like using.

But the one thing I expected to see in the Readme was an example of: takes this tool run output: XXXXXX and converts it to: XX for a savings of 40% of tokens.

This looks like a nice (and useful) project, so thanks for sharing!

Comment by naorsabag 1 day ago

Agreed, try to use OpenHop to create the data flow diagram

Comment by alkh 4 days ago

Thanks for your effort! I also think having examples of raw output before vs after using lowfat would be useful as well

Comment by zdkaster 4 days ago

Got it! thanks for your feedback.

Comment by threecheese 2 days ago

[dead]

Comment by wood_spirit 4 days ago

I have my own llm wrapping harness, which does this and has a few more tricks. For example, it doesn’t have a lot of mcp but it does have search_mcp and load_mcp tools (and search_skills) so the llm can find what it needs when it needs it without bloating the normal baseline context. The LLMs have proved really good at using them. There is also a waypoint tool they can use to record their thinking in the context without it being the final output. Am thinking about a search_expert to find colleagues it can bring into conversations too. And a lot of other stuff.

Pro tip they worked well for me with response truncation: in the truncated output, say that the full text is available in /tmp/whereever.txt - that way, the llm will be able to query and read more using built in tools without reissuing the big tool call.

Comment by unphased 3 days ago

great approach. I did that with my opencode based setup as well, it's neat and fun to tune skills and mcp loaders and stuff. Then i got fed up with opencode's design limitations. And then, my own harness work is on hold in favor of a harness-puppeteer paradigm, but that one has also been on hold! I'm mostly currently pulling on the thread of making it easier just to review the voluminous conversation turns!

Comment by zdkaster 4 days ago

Interesting approach. Thanks for sharing.

Comment by devdoc83 4 days ago

How do you handle the risk of stripping out the exact stack trace the agent needed? That seems like the hard tradeoff here.

Comment by zdkaster 4 days ago

It has the strip aggressiveness level suport. You can tune up 3 levels for each template output of your stacktrace using lowfat-filter dsl, shellscript or python.

Comment by itsthecourier 4 days ago

gonna ask the same... do far it's has been manually choosing what's useful in each command for the agents?

Comment by zdkaster 4 days ago

It requires a bit effort in doing long-term adjustment and tuning for your agent common cli tools commands called. kinda need to evolve on day-to-day basis. But, agent itself can be useful to help tuning this.

Comment by ramon156 4 days ago

In a perfect world the LLM needs to be very explicit on what it wants to read

Comment by nixpulvis 4 days ago

The LLMs already do that themselves with `tail` all the time. There's a lot of room for improvement on top of that. Though they usually figure it out after a few tries. I often just paste manual runs errors myself anyway.

Comment by itsdesmond 4 days ago

Have terms been established to describe these types of tools? How do I refer to small utilities to perform specific transformations to LLM behavior? CLI filter seems pretty good to describe this tool conversationally but not so much when searching, they some low cardinality keywords.

Comment by mf_kevintruong 14 hours ago

The most waste tokens is about : code reading/grep to search the roundtrip on search/ retrival the logic

Comment by fcanesin 4 days ago

I am thinking that a small tool that simply refuses to pass large CLI output to the LLM and warns it to filter the results before reading would achieve this better as the LLM would be forced into thinking and writting the filter itself.

Comment by zdkaster 4 days ago

I simply use LLM to create filter for my personal use. I have already put that specific instruction in the plugin doc in case you are interested.

Comment by jondwillis 4 days ago

I think GP is basically saying, bitter lesson applies here.

Comment by cityofdelusion 4 days ago

This is a nice little project but I’m weary of sensationally inaccurate titles for stuff like this and the infamous caveman mode. It doesn’t save 91% of tokens: it reduced in one user case 91% of output tokens on the raw CLI output. I am being pedantic about this because these sorts of claims go viral and are inaccurate.

A proper benchmark will compare a large sample of identical prompting with and without the tool, against a specific harness. Once you apply Amdahl’s law, there is no way this saves 91% of tokens holistically, which the title implies.

I work in a non-tech company and these sorts of things keep going viral, with no understanding and with no comprehension of what is actually going on. Engineering is gone and cargo cult magical incantations are in.

Comment by zdkaster 3 days ago

Understood. Didn't mean as a click-bait or something. Just sharing my cli report summarize.

Target user here in HN should be tech-savy and this tool is not designed for non-tech because it is required highly customized from user to get the result user want.

Anway, would you mind putting the correct title here ? I will consider to update.

Comment by rahulyc 4 days ago

Great idea. I'm thinking if it could make sense to send the output to a cheap / local model to filter out only the bits that "matter" and pass that through - for the cost some extra time, but maybe it's worth it for saving tokens in the larger model.

Comment by 0xCAP 2 days ago

Tbh I'll wait for first party LLM providers to build this kind of stuff. If they're not first class citizens they end up corrupting the workflow more than enriching it.

Comment by clutter55561 4 days ago

Tools that remove the fat seem like a good idea, but I’m highly suspicious of their effect on the LLM’s reasoning.

LLMs were trained in the typical full-fat output found everywhere on the internet, and all of sudden they get a slightly different response that may look like nothing they have seen before.

Does that really save tokens in the long run?

Comment by zdkaster 3 days ago

I have just been using it for 2 months, so... lmao. might need a year and with more users to test out how it will go.

Comment by pradeep1177 3 days ago

Wait, do the coding agents fire `kubectl get -o yaml`??? Most of the harness agents, like CC or codes, are very precise about command construction. For example, the harness add - o and look for the status, for example.

Comment by tegiddrone 4 days ago

Still learning myself, but I've seen MCP tools just lightly wrap upstream json-body REST APIs. Works. But not only is the json structure more tokens but often the model just needs a small subset of fields in the payload.

Comment by zdkaster 4 days ago

To be safe if you need a full json, would make conditonal passthrough as the original raw output. Or, need to handle selective object using python via the filter plugin.

Comment by avocadoking 4 days ago

Do you have any insight if LLMs sometimes get confused by your filters?

Comment by tim-projects 4 days ago

He says he adds an output message, but I've tried this myself and I find that quite a lot of the time the agent prefers its own internal monologue over the output of a command.

Comment by KuhlMensch 3 days ago

I'm glad this class of tool exists.

But it'll be a case of measuring first, then perhaps a staged integration of a tool like this.

Comment by tuo-lei 4 days ago

the bigger problem is agents defaulting to the broadest command possible. kubectl get -o yaml when a jsonpath query would give 1/50th the tokens. filtering after the fact works, but you're still paying for the round trip. better to teach the agent to ask narrow questions in the first place.

Comment by CuriouslyC 4 days ago

Hooks are great for this.

Comment by urax 1 day ago

[flagged]

Comment by davidetroiani 4 days ago

Add a comparison table between your repo and alternatives like rtk. I’m interested.

Comment by neuralkoi 4 days ago

Great! Now, you should slap a logo to this, boostrap this as a service, and get you some YC funding. [0]

[0] https://thetokencompany.com

Comment by sakuraiben 4 days ago

Would be interested to see what kind of eval results you get from this

Comment by pradeep1177 4 days ago

Would this have any impact on the response quality from the agent?

Comment by CharlesW 4 days ago

Yes, and never for the better.

Comment by zdkaster 4 days ago

Can you elaborate more on why would it so ?

Comment by esafak 4 days ago

Because it could discard things the agent needs.

Comment by zdkaster 3 days ago

You can control what you want to feed to the agent. Keep what it needs, discard what it doesn't.

Comment by zdkaster 4 days ago

Frankly, not at all.

Comment by pradeep1177 4 days ago

I have a suspicion that the model would miss more context unless you are very precise about what FAT means in each context. However, loved the idea.

Comment by zdkaster 4 days ago

Understood. Let me give some examples, most of the time we don't need spaces between table output, git diff produce bunch of unnessary info we just need filename and actual diff lines, kubectl describe we would mostly check for events, image etc etc. This is the reason why I make it as composable filters as it very depends on your specific ops to optimize the token.

Comment by pradeep1177 3 days ago

Yes, it also depends on how a model harness uses a tool.

Harness: I'm about to commit. Good use case Harness: What has changed from X to Y. Bad use case NO?

Comment by anoop4bhat 3 days ago

Is this different from caveman?

Comment by zdkaster 3 days ago

Afaik, caveman does shorten sentences in coversation but lowfat is picking up what matter from cli ouput. That's a different output target.