LM Studio 0.4
Posted by jiqiren 23 hours ago
Comments
Comment by tarruda 6 hours ago
Comment by embedding-shape 3 hours ago
Comment by roger_ 4 hours ago
Comment by mycall 3 hours ago
Comment by tarruda 1 hour ago
Comment by embedding-shape 3 hours ago
Both have their places and are complementary, rather than competitors :)
Comment by syntaxing 22 hours ago
Comment by Imustaskforhelp 4 hours ago
I think I am fairly technical but I still prefer how Ollama is simple but I know all the complaints about Ollama and I am really just wishing for a better alternative for the most part.
Maybe just a direct layer on top of vllm or llama.cpp itself?
Comment by embedding-shape 3 hours ago
My dream would be something like vLLM, but without all the Python mess, packaged as a single binary that has both HTTP server + desktop GUI, and can browse/download models. Llama.cpp is like 70% there, but large performance difference between llama.cpp and vLLM for the models I use.
Comment by PlatoIsADisease 20 hours ago
I had used oobabooga back in the day and found ollama unnecessary.
Comment by embedding-shape 6 hours ago
One decision that was/is very integral to their architecture is trying to copy how Docker handled registries and storage of blobs. Docker images have layers, so the registry could store one layer that is reused across multiple images, as one example.
Ollama did this too, but I'm unsure of why. I know the author used to work at Docker, but almost no data from weights can be shared in that way, so instead of just storing "$model-name.safetensor/.gguf" on disk, Ollama splits it up into blobs, has it's own index, and so on. For seemingly no gain except making it impossible to share weights between multiple applications.
I guess business-wise, it was easier for them to now make people use their "cloud models" so they earn money, because it's just another registry the local client connects to. But also means Ollama isn't just about running local models anymore, because that doesn't make them money, so all their focus now is on their cloud instead.
At least as a LM Studio, llama.cpp and vLLM user, I can have one directory with weights shared between all of them (granted the format of the weight works in all of them), and if I want to use Ollama, it of course can't use that same directory and will by default store things it's own way.
Comment by plagiarist 3 hours ago
What I want is to have a directory with models and bind mount that readonly into inference containers. But Ollama would force me to either prime the pump by importing with Modelfiles (where do I even get these?) every time I start the container, or store their specific version of files?
I had trying out vLLM and llama.cpp as my next step in this, I'm glad to hear you are able to share a directory between them.
Comment by embedding-shape 2 hours ago
Yeah, that's basically what I'm doing, + over network (via Samba). My weights all live on a separate host, which has two Samba shares, one with write access and one read-only. The write one is mounted on my host, and the container where I run the agent mounts the read-only one (and have the source code it works on copied over to the container on boot).
The directory that LM Studio ends up creating and maintaining for the weights, works with most of the tooling I come across, except of course Ollama.
Comment by fud101 10 hours ago
Nothing, it was always going to be a rug pull. They leached off llama.cpp.
Comment by garyfirestorm 4 hours ago
Comment by embedding-shape 3 hours ago
Maybe it is today, but initially ollama was only a cli, so obviously not for "non technical people" who would have no idea how to even use a terminal. If you hang out in the Ollama Discord (unlikely, as the mods are very ban-happy), you'd see constantly people asking for very trivial help, like how to enter commands in the terminal, and the community stringing them along, instead of just directing them to LM Desktop or something that would be much better for that type of user.
Comment by stuaxo 3 hours ago
I can be in a non-technical team, and put the LLM code inside docker.
The local dev instruction is to install ollama and use it to pull the models and set some env vars.
The same code can point at bedrock when deployed there.
Using straight llamacpp at the time I wrote that it wasn't as straightforward.
Comment by embedding-shape 2 hours ago
> Exactly. I can be in a non-technical team, and put the blah inside blah. The blah is to install blah and use it to blah and blah. The same blah can point at blah when blah there. Using blah at the time I wrote that it wasn't as straightforward.
I think when people say "non-technical", it feels like they're talking about "People who work in tech startups, but aren't developers" instead of actually people who aren't technical one bit, the ones who don't know the difference between "desktop" and a "browser" for example. Where you tell them to press any key, and they replied with "What key is that?".
Comment by TomMasz 4 hours ago
Comment by minimaxir 22 hours ago
Comment by Helithumper 22 hours ago
`lms chat` has existed, `lms daemon up` / "llmster" is the new command.
Comment by embedding-shape 22 hours ago
Ah, this is great, been waiting for this! I naively created some tooling on top of the API from the desktop app after seeing they had a CLI, then once I wanted to deploy and run it on a server, I got very confused that the desktop app actually installs the CLI and it requires the desktop app running.
Great that they finally got it working fully headless now :)
Comment by secult 7 hours ago
Comment by thousand_nights 22 hours ago
Comment by konart 22 hours ago
"looks like a toy" has very little to do with its use anyway.
Comment by keyle 9 hours ago
Comment by chris_st 3 hours ago
Comment by embedding-shape 3 hours ago
I've only use LM Desktop on Linux and Windows, never seen anything asking for elevated permissions.
Comment by pzo 6 hours ago
Comment by embedding-shape 6 hours ago
Compared to models downloaded with LM Studio, which are just the directories + the weights as made, you just point llama.cpp/$tool-of-choice and it works.
Comment by arajnoha 2 hours ago
Comment by hnlmorg 8 hours ago
The impression I get is that LM Studio is basically an Ollama-type of solution but with an IDE included -- is that a fair approximation?
Things change so fast in the AI space that I really cannot keep up :(
Comment by martinald 8 hours ago
Comment by james_marks 6 hours ago
Without much background, you’re finding models, chatting with them, have an OpenAI-compatible API w/logging. Haven’t seen the new version, but LM Studio was already pretty great.
Comment by anhner 8 hours ago
Comment by hnlmorg 8 hours ago
Comment by atwrk 8 hours ago
llama.cpp is the actual engine running the llms, ollama is a wrapper around it.
Comment by embedding-shape 6 hours ago
How far did they get with their own inference engine? I seem to recall for the launch of Gemma (or some other model), they also launched their own Golang backend (I think), but never heard anything more about it. I'm guessing they'll always use llama.cpp for anything before that, but did they continue iterating on their own backend and how is it today?
Comment by jiqiren 23 hours ago
Comment by observationist 22 hours ago
Thanks for the updates!
Comment by nubg 20 hours ago
Comment by anon373839 6 hours ago
Comment by doanbactam 13 hours ago
Comment by saberience 22 hours ago
I get that I can run local models, but all the paid for (remote) models are superior.
So is the use-case just for people who don’t want to use big tech’s models? Is this just for privacy conscious people? Or is this just for “adult” chats, ie porn bots?
Not being cynical here, just wanting to understand the genuine reasons people are using it.
Comment by biddit 22 hours ago
I've invested heavily in local inference. For me, it's a mixture privacy, control, stability, cognitive security.
Privacy - my agents can work on tax docs, personal letters, etc.
Control - I do inference steering with some projects: constraining which token can be generated next at any point in time. Not possible with API endpoints.
Stability - I had many bad experiences with frontier labs' inference quality shifting within the same day, likely due to quantization due to system load. Worse, they retire models, update their own system prompts, etc. They're not stable.
Cognitive Security - This has become more important as I rely more on my agents for performing administrative work. This is intermixed with the Control/Stability concerns, but the focus is on whether I can trust it to do what I intended it to do, and that it's acting on my instructions, rather than the labs'.
Comment by metalliqaz 20 hours ago
Comment by samarthr1 9 hours ago
My computer is now worth more than when I bought it
Comment by dragonwriter 20 hours ago
Running weights available models.
> I get that I can run local models, but all the paid for (remote) models are superior.
If that's clearly true for your use cases, then maybe this isn’t for you.
> So is the use-case just for people who don’t want to use big tech’s models?
Most weights available models are also “big tech’s”, or finetunes of them.
> Is this just for privacy conscious people? Or is this just for “adult” chats, ie porn bots?
Sure, those are among the use cases. And there can be very good reasons to be concerned about privacy in some applications. But they aren’t the only reasons.
There’s a diversity of weights-available models available, with a variety of specialized strengths. Sure, for general use, the big commercial models may generally be more capable, but they may not be optimal for all uses (especially when cost effectiveness is considered, given that capable weights-available models for some uses are very lightweight.)
Comment by PeterStuer 10 hours ago
So yes, the tradeoff is security vs capability. The former always comes at a cost.
Comment by maxkfranz 13 hours ago
Some non-programming use cases are interesting though, e.g. text to speech or speech to text.
Run a TTS model overnight on a book, and in the morning you’ll get an audiobook. With a simple approach, you’d get something more like the old books on tape (e.g. no chapter skipping), but regardless, it’s a valid use case.
Comment by numpad0 6 hours ago
Comment by nxobject 10 hours ago
Comment by reactordev 22 hours ago
I exclusively run local models. On par with Opus 4.5 for most things. gpt-oss is pretty capable. Qwen3 as well.
Comment by nubg 21 hours ago
?
Are you asking it for capital cities or what?
Comment by reactordev 20 hours ago
I’m asking it to write C code
Comment by konart 22 hours ago
Comment by hickelpickle 21 hours ago
Thats what convinced me they are ready to do real work, are they going to replace claude code...not currently. But it is insane to me that such a small model can follow those explicit directions and consistently perform that workflow.
I've during that experimentation, even when not putting the sql explicit it was able to craft the queries on its own from just text description, and has no issue navigating the cli and file system doing basic day to day things.
I'm sure there are a lot of people doing "adult" things, but my interest is sparked because they finally at the level they can be a tool in a homelab, and no longer is llm usage limits subsidized like they used to be. Not to mention I am really disillusioned with big tech having my data or exposing a tool making API calls to them that then can make actions on my system.
I'll still keep using claude code day to day coding. But for small system based tasks I plan on moving to local llms. Their capabilities have inspired me to write my own agentic framework to see what work flows can be put together for just management and automation of day to day task. Ideally it would be nice to just chat with an llm and tell it to add an appointment or call at x time or make sure I do it that day and it can read my schedule and remind-me at a chill time of my day to make the call, and then check up that I followed through. I also plan on seeing if I can also set it up to remind me and help to practice mindfulness and just general stress management I should do. While sure a simple reminder might work, but as someone with adhd who easily forgets reminders as soon as they pop up if I can get to them now, being pestered by an agent that wakes up and engages with me seems like it might be an interesting workflow.
And the hacker aspect, now that they are capable I really want to mess around with persistent knowledge in databases and making them intercommunicate and work together. Might even give them access to rewrite themselves and access the application during run time with a lisp. But to me local llms have gotten to the point they are fun and not annoying. I can run a model that is better than chatgpt 3.5 for the most part, its knowledge is more distilled and narrower, but for what they do understand their correctness is much better.
Comment by tiderpenger 22 hours ago
Comment by mk89 22 hours ago
For me the main BIG deal is that cloud models have online search embedded etc, while this one doesn't.
However, if you don't need that (e.g., translate, summarize text, writing code) probably is good enough.
Comment by prophesi 21 hours ago
Comment by mk89 21 hours ago
Comment by dragonwriter 20 hours ago
Models do not have online search embedded, they have tool use capabilities (possibly with specialized training for a web search tool), but that's true of many open and weights-available models, and they are run with harnesses that support tools and provide a web search tool (lmstudio is such a harness, and can easily be supplied with a web search tool.)
Comment by nunodonato 20 hours ago
Comment by mark_l_watson 19 hours ago
Comment by PlatoIsADisease 20 hours ago
But then I decided I'm just a chemical reaction and a product of my environment, so I gave chatGPT all my dirt anyway.
But before, I cared about my privacy.
Comment by anon373839 6 hours ago
That doesn’t address the practical significance of privacy, though. The real risk isn’t that OpenAI employees will read your chats for personal amusement. The risk is that OpenAI will exploit the secrets you’ve entrusted to them, to manipulate you, or to enable others to manipulate you.
The more information an unscrupulous actor has about you, the more damage they can do.
Comment by gostsamo 11 hours ago
Comment by marak830 19 hours ago
Without a ton of hassle I cannot do that with a public model(without paying API pricing).
My responses may be slower, but I know the historical context is going to be there. As well as the model overrides.
In addition I can bolt on modules as I feel like it(voice, avatar, silly tavern to list a few).
I get to control my model by selecting specific ones for tasks, I can upgrade as they are released.
These are the reasons I use local.
I do use Claude for a coding junior so I can assign tasks and review it, purely because I do not have something that can replicate that locally on my setup(hardware wise, but from what I have read local coding models are not matching Claude yet)
That's more than likely a temporary issue(years not weeks with the expensive of things and state of open models specialising in coding).
Comment by anonym29 22 hours ago
You don't need LM Studio to run local models, it just (was, formerly), a nice UI to download and manage HF models and llama.cpp updates, quickly and easily manually switch between CPU / Vulkan / ROCm / CUDA (depending on your platform).
Regarding your actual question, there are several reasons.
First off, your allusion to privacy - absolutely, yes, some people use it for adult role-play, however, consider the more productive motivations for privacy, too: a lot of businesses with trade secrets they may want to discuss or work on with local models without ever releasing that information to cloud providers, no matter how much those cloud providers pinky promise to never peek at it. Google, Microsoft, Meta, et al have consistently demonstrated that they do not value or respect customer privacy expectations, that they will eagerly comply with illegal, unconstitutional NSA conspiracies to facilitate bulk collection of customer information / data. There is no reason to believe Anthropic, OpenAI, Google, xAI would act any differently today. In fact, there is already a standing court order forcing OpenAI to preserve all customer communications, in a format that can be delivered to the court (i.e. plaintext, or encryption at rest + willing to provide decryption keys to the court), in perpetuity (https://techstartups.com/2025/06/06/court-orders-openai-to-p...)
There are also businesses which have strict, absolute needs for 24/7 availability and low latency, which remote APIs never have offered. Even if the remote APIs were flawless, and even if the businesses have a robust multi-WAN setup with redundant UPS systems, network downtime or even routing issues are more or less an inevitable fact of life, sooner or later. Having local models means you have inference capability as long as you have electricity.
Consider, too, the integrity front: frontier labs may silently modify API-served models to be lower quality for heavy users with little means of detection by end users (multiple labs have been suspected / accused of this; a lack of proof isn't evidence that it didn't happen) or that the API-served models can be modified over time to patch behaviors that may have been previously relied upon for legitimate workloads (imagine a red team that used a jailbreak to get a model to produce code for process hollowing, for instance). This second example absolutely has happened with almost every inference provider.
The open weight local models also have zero marginal cost besides electricity once the hardware is present, unlike PAYG API models, which create financial lock-in and dependency that is in direct contrast with the financial interests of the customers. You can argue about the amortized costs of hardware, but that's a decision for the customer to make using their specific and personal financial and capex / hardware information that you don't have at the end of the day.
Further, the gap between frontier open weight models and frontier proprietary models has been rapidly shrinking and continues to. See Kimi K2.5, Xiaomi MiMo v2, GLM 4.7, etc. Yes, Opus 4.5, Gemini 3 Pro, GPT-5.2-xhigh are remarkably good models and may beat these at the margin, but most work done via LLMs does not need the absolute best model; many people will opt for a model that gets 95% of the output quality of the absolute frontier model when it can be had for 1/20th the cost (or less).
Comment by pram 13 hours ago
Comment by PeterStuer 2 hours ago
Comment by khimaros 21 hours ago
Comment by adastra22 21 hours ago
Comment by jckahn 20 hours ago
Comment by PeterStuer 2 hours ago
https://www.librechat.ai/docs/configuration/librechat_yaml/a...
Comment by echelon 12 hours ago
Is this like "OpenRouter" where they don't have any of the core product actually available?
Comment by tildef 7 hours ago
>> You agree that You will not permit any third party to, and You will not itself:[..] (e) reverse engineer, decompile, disassemble, or otherwise attempt to derive the source code for the Software[..]
Comment by ssalka 21 hours ago
https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1...
Comment by dmd 21 hours ago
Comment by jermaustin1 21 hours ago
Comment by sfifs 20 hours ago
Comment by maxkfranz 13 hours ago
You don’t want some random person to find your LMStudio service and then point their Opencode at it.
Comment by makeramen 21 hours ago
Comment by whalesalad 12 hours ago
analog could be car infotainment systems: don't give me your half-baked shitty infotainment, i have carplay, let me use it.
Comment by Nijikokun 21 hours ago
Comment by neves 7 hours ago
Comment by embedding-shape 6 hours ago
Comment by auscompgeek 7 hours ago
Comment by ai_critic 20 hours ago
Comment by alasr 19 hours ago
With lms, LM Studio's frontend GUI/desktop application and its backend LLM API server (for OpenAI compatibility API endpoints) are tightly coupled: stopping LM Studio's GUI/desktop application will trigger stopping of LM Studio's backend LLM API server.
With llmsterm, they've been decoupled now; it (llmsterm) enables one, as LM Studio announcement says, to "deploy on servers, deploy in CI, deploy anywhere" (where having a GUI/desktop application doesn't make sense).
Comment by chocobaby15 21 hours ago
Comment by embedding-shape 2 hours ago
Comment by behnamoh 22 hours ago
but it's a bit too little too late. people running this probably can already setup llama.cpp pretty easily.
lmstudio also has some overhead like ollama; llama.cpp or mlx alone are always faster.
Comment by Der_Einzige 13 hours ago
The closest thing we have to an LLM front end where you can actually CONTROL your model (i.e. advanced sampling settings) is oobabooga/sillytavern - both ultimately UIs designed mostly for "roleplay/cooming". It's the same shit with image gen and ComfyUI too!!!
LM Studio purported to be something like those two, but it has NEVER properly supported even a small fraction of the settings that LLMs use, and thus it's DOA for prosumer/pros.
I'm glad that claude code and moltbot are killing this whole genre of Software since apparently VC backed developers can't be trusted to make it.
Comment by redrove 12 hours ago
Comment by Der_Einzige 6 hours ago
Comment by echelon 12 hours ago
https://github.com/storytold/artcraft
Roadmap: Auth with all frontier AI image/video model providers, FAL, other aggregators. Focus on tangible creation rather than node graphs (for now).
I'm a filmmaker, so I'm making this for my studio and colleagues.
Comment by MarginalGainz 6 hours ago
Comment by anonym29 22 hours ago
Comment by nunodonato 22 hours ago
Comment by webdevver 21 hours ago
although, as an amd user, he should know that both vulkan and rocm backends have equal propensity to crap the bed...
Comment by anonym29 22 hours ago
Comment by ffftttfffttt 21 hours ago
Comment by webdevver 21 hours ago
Comment by anonym29 21 hours ago
To your point though, if the successors to Strix Halo, Serpent Lake (x86 intel CPU + Nvidia iGPU) and Medusa Halo (x86 AMD CPU + AMD iGPU) come in at a similar price point, I'll probably go with Serpent Lake, given the specs are otherwise similar (both are looking at 384-bit unified memory bus to LPDDR6 with 256GB unified memory options). CUDA is better than ROCm, no argument there.
That said, this has nothing to do with the (now resolved) issue I was experiencing with LM Studio not respecting existing Developer Mode settings with this latest update. There are good reasons to want to switch between different back-ends (e.g. debugging whether early model release issues, like those we saw with GLM-4.7-Flash, are specific to Vulkan - some of them were in that specific example). Bugs like that do exist, but I've had even fewer stability issues on Vulkan than I've had on CUDA on my 4080.
Comment by webdevver 18 hours ago
sure you can load big(-ish) models on it, but if youre getting <10 tokens per second, that severely limits how useful it is.
Comment by anonym29 16 hours ago
Even moderately large and capable models like gpt-oss:120b and Qwen3-Next-80B have pretty good TG speeds - think 50+ tok/s TG on gpt-oss:120b.
PP is the main thing that suffers due to memory bandwidth, particularly for very long PP stretches on typical transformers models, per the quadratic attention needs, but like I said, with KV caching, not a big deal.
Additionally, newer architectures like hybrid linear attention (Qwen3-Next) and hybrid mamba (Nemotron) exhibit much less PP degradation over longer contexts, not that I'm doing much long context processing thanks to KV caching.
My 4080 is absolutely several times faster... on the teeny tiny models that fit on it. Could I have done something like a 5090 or dual 3090 setup? Sure. Just keep in mind I spent considerably less on my entire Strix Halo rig (a Beelink GTR 9 Pro, $1980 w/ coupon + pre-order pricing) than a single 5090 ($3k+ for just the card, easily $4k+ for a complete PCIe 5 system), it draws ~110W on Vulkan workloads, and idles below 10W, taking up about as much space as a Gamecube. Comparing it to an $8500 RTX 6000 Pro is a completely nonsensical comparison and was outside of my budget in the first place.
Where I will absolutely give your argument credit: for AI outside of LLMs (think genAI, text2img, text2vid, img2img, img2vid, text2audio, etc), Nvidia just works while Strix Halo just doesn't. For ComfyUI workloads, I'm still strictly using my 4080. Those aren't really very important to me, though.
Also, as a final note, Strix Halo's theoretical MBW is 256 GB/s, I routinely see ~220 GB/s real world, not 200 GB/s. Small difference when comparing to GDDR7 on a 512 bit bus, but point stands.
Comment by snvzz 11 hours ago
Comment by huydotnet 22 hours ago
Comment by anonym29 21 hours ago
On your inference machine:
you@yourbox:~/Downloads/llama.cpp/bin$ ./llama-server -m <path/to/your/model.gguf> --alias <your-alias> --jinja --ctx-size 32768 --host 0.0.0.0 --port 8080 -fa on
Obviously, feel free to change your port, context size, flash attention, other params, etc.Then, on the system you're running Claude Code on:
export ANTHROPIC_BASE_URL=http://<ip-of-your-inference-system>:<port>
export ANTHROPIC_AUTH_TOKEN="whatever"
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
claude --model <your-alias> [optionally: --system "your system prompt here"]
Note that the auth token can be whatever value you want, but it does need to be set, otherwise a fresh CC install will still prompt you to login / auth with Anthropic or Vertex/Azure/whatever.Comment by huydotnet 21 hours ago