Llama.vim – Local LLM-assisted text completion

By kgwgk at 10:06 AM

ggerganov | 9 comments | 2 weeks ago

Hi HN, happy to see this here!

I highly recommend to take a look at the technical details of the server implementation that enables large context usage with this plugin - I think it is interesting and has some cool ideas [0].

Also, the same plugin is available for VS Code [1].

Let me know if you have any questions about the plugin - happy to explain. Btw, the performance has improved compared to what is seen in the README videos thanks to client-side caching.

[0] - https://github.com/ggerganov/llama.cpp/pull/9787

[1] - https://github.com/ggml-org/llama.vscode

amrrs | 3 comments | 2 weeks ago

For those who don't know, He is the gg of `gguf`. Thank you for all your contributions! Literally the core of Ollama, LMStudio, Jan and multiple other apps!

kennethologist | 0 comments | 2 weeks ago

A. Legend. Thanks for having DeepSeek available so quickly in LM Studio.

sergiotapia | 0 comments | 2 weeks ago

well hot damn! killing it!

bangaladore | 1 comment | 2 weeks ago

Quick testing on vscode to see if I'd consider replacing Copilot with this. Biggest showstopper right now for me is the output length is substantially small. The default length is set to 256, but even if I up it to 4096, I'm not getting any larger chunks of code.

Is this because of a max latency setting, or the internal prompt, or am I doing something wrong? Or is it only really make to try to autocomplete lines and not blocks like Copilot will.

Thanks :)

ggerganov | 1 comment | 2 weeks ago

There are 4 stopping criteria atm:

- Generation time exceeded (configurable in the plugin config)

- Number of tokens exceeded (not the case since you increased it)

- Indentation - stops generating if the next line has shorter indent than the first line

- Small probability of the sampled token

Most likely you are hitting the last criteria. It's something that should be improved in some way, but I am not very sure how. Currently, it is using a very basic token sampling strategy with a custom threshold logic to stop generating when the token probability is too low. Likely this logic is too conservative.

bangaladore | 0 comments | 2 weeks ago

Hmm, interesting.

I didn't catch T_max_predict_ms and upped that to 5000ms for fun. Doesn't seem to make a difference, so I'm guessing you are right.

eklavya | 0 comments | 2 weeks ago

Thanks for sharing the vscode link. After trying I have disabled the continue.dev extension and ollama. For me this is wayyyyy faster.

jerpint | 0 comments | 2 weeks ago

Thank you for all of your incredible contributions!

liuliu | 1 comment | 2 weeks ago

KV cache shifting is interesting!

Just curious: how much of your code nowadays completed by LLM?

ggerganov | 3 comments | 2 weeks ago

Yes, I think it is surprising that it works.

I think a fairly large amount, though can't give a good number. I have been using Github Copilot from the very early days and with the release of Qwen Coder last year have fully switched to using local completions. I don't use the chat workflow to code though, only FIM.

menaerus | 1 comment | 2 weeks ago

Interesting approach.

Am I correct to understand that you're basically minimizing the latencies and required compute/mem-bw by avoiding the KV cache? And encoding the (local) context in the input tokens instead?

I ask this because you set the prompt/context size to 0 (--ctx-size 0) and the batch size to 1024 (-b 1024). Former would mean that llama.cpp will only be using the context that is already encoded in the model itself but no local (code) context besides the one provided in the input tokens but perhaps I misunderstood something.

Thanks for your contributions and obviously the large amount of time you take to document your work!

ggerganov | 1 comment | 2 weeks ago

The primary tricks for reducing the latency are around context reuse, meaning that the computed KV cache of tokens from previous requests is reused for new requests and thus computation is saved.

To get high-quality completions, you need to provide a large context of your codebase so that the generated suggestion is more inline with your style and implementation logic. However, naively increasing the context will quickly hit a computation limit because each request would need to compute (a.k.a prefill) a lot of tokens.

The KV cache shifts used here is an approach to reuse the cache of old tokens by "shifting" them in new absolute positions in the new context. This way a request that would normally require a context of lets say 10k tokens, could be processed more quickly by computing just lets say 500 tokens and reusing the cache of the other 9.5k tokens, thus cutting the compute ~10 fold.

The --ctx-size 0 CLI arg simply tells the server to allocate memory buffers for the maximum context size supported by the model. For the case of Qwen Coder models, this corresponds to 32k tokens.

The batch sizes are related to how much local context around your cursor to use, along with the global context from the ring buffer. This is described in more detail in the links, but simply put: decreasing the batch size will make the completion faster, but with less quality.

menaerus | 1 comment | 2 weeks ago

Ok, so --ctx-size with a value != 0 means that we can override the default model context size. Since for obvious computation cost reasons we cannot use the 32k fresh context per each request, the trick you make is to use the 1k context (batch that includes local and semi-local code) that you enrich with the previous model responses by keeping them in and feeding them from KV cache? To increase the correlation between the current request and previous responses you do the shifting in KV cache?

ggerganov | 1 comment | 2 weeks ago

Yes, exactly. You can set --ctx-size to a smaller value if you know that you will not hit the limit of 32k - this will save you VRAM.

To control how much global context to keep in the ring buffer (i.e. the context that is being reused to enrich the local context), you can adjust the "ring_n_chunks" and "rink_chunk_size". With the default settings, this amounts to about 8k tokens of context on our codebases when the ring buffer is full, which is a conservative setting. Increasing these numbers will make the context bigger, will improve the quality but will affect the performance.

There are a few other tricks to reduce the compute for the local context (i.e. the 1k batch of tokens), so that in practice, a smaller amount is processed. This further saves compute during the prefill.

menaerus | 0 comments | 2 weeks ago

Since qwen 2.5 turbo with 1M context size is advertised to be able to crunch ~30k LoC, I guess we can say then that the 32k qwen 2.5 model is capable of ~960 LoC and therefore 32k model with an upper bound of context set to 8k is capable of ~250 LoC?

Not bad.

gloflo | 2 comments | 2 weeks ago

What is FIM?

jjnoakes | 0 comments | 2 weeks ago

Fill-in-the-middle. If your cursor is in the middle of a file instead of at the end, then the LLM will consider text after the cursor in addition to the text before the cursor. Some LLMs can only look before the cursor; for coding,.ones that can FIM work better (for me at least).

rav | 0 comments | 2 weeks ago

FIM is "fill in middle", i.e. completion in a text editor using context on both sides of the cursor.

LoganDark | 0 comments | 2 weeks ago

llama.cpp supports FIM?

attentive | 1 comment | 2 weeks ago

Is it correct to assume this plugin won't work with ollama?

If so, what's ollama missing?

mistercheph | 0 comments | 2 weeks ago

this plugin is designed specifically for the llama.cpp server api, if you want copilot like features with ollama, you can use an ollama instance as a drop-in replacement for github copilot with this plugin: https://github.com/bernardo-bruning/ollama-copilot

There is also https://github.com/olimorris/codecompanion.nvim which doesn't have text completion, but supports a lot of other AI editor workflows that I believe are inspired by Zed and supports ollama out of the box

nancyp | 1 comment | 2 weeks ago

TIL: VIM has it's own language. Thanks Georgi for LLAMA.cpp!

nacs | 0 comments | 2 weeks ago

Vim is incredibly extensible.

You can use C or VIMscript but programs like Neovim support Lua as well which makes it really easy to make plugins.

halyconWays | 0 comments | 2 weeks ago

Please make one for Jetbrains' IDEs!

eigenvalue | 3 comments | 2 weeks ago

This guy is a national treasure and has contributed so much value to the open source AI ecosystem. I hope he’s able to attract enough funding to continue making software like this and releasing it as true “no strings attached” open source.

nacs | 0 comments | 2 weeks ago

> This guy is a national treasure

Agreed but he's an international treasure (his Github profile states Bulgaria).

feznyng | 1 comment | 2 weeks ago

They have: https://ggml.ai/ under the Company heading.

cosmojg | 1 comment | 2 weeks ago

Georgi Gerganov is the "gg" in "ggml"

acters | 0 comments | 5 days ago

Also the gg in gguf

frankfrank13 | 0 comments | 2 weeks ago

Hard agree. This alone replaces GH Copilot/Cursor ($10+ a month)

estreeper | 2 comments | 2 weeks ago

Very exciting - I'm a long-time vim user but most of my coworkers use VSCode, and I've been wanting to try out in-editor completion tools like this.

After using it for a couple hours (on Elixir code) with Qwen2.5-Coder-3B and no attempts to customize it, this checks a lot of boxes for me:

  - I pretty much want fancy autocomplete: filling in obvious things and saving my fingers the work, and these suggestions are pretty good
  - the default keybindings work for me, I like that I can keep current line or multi-line suggestions
  - no concerns around sending code off to a third-party
  - works offline when I'm traveling
  - it's fast!

So I don't need to remember how to run the server, I'll probably set up a script to check if it's running and if not start it up in the background and run vim, and alias vim to use that. I looked in the help documents but didn't see a way to disable the "stats" text after the suggestions, though I'm not sure it will bother me that much.

ggerganov | 1 comment | 2 weeks ago

Appreciate the feedback!

Currently, there isn't a user-friendly way to disable the stats from showing apart from modifying the "'show_info': 0" value directly in the plugin implementation. These things will be improved with time and will become more user-friendly.

A few extra optimizations will soon land which will further improve the experience:

- Speculative FIM

- Multiple suggestions

tomnipotent | 0 comments | 2 weeks ago

First extension I've used that perfectly autocompletes Go method receivers.

First tab completes just "func (t *Type)" so then I can type the first few characters of something I'm specifically looking for or wait for the first recommendation to kick in. I hope this isn't just a coincidence from the combination of model and settings...

douglee650 | 0 comments | 2 weeks ago

So I assume you have tried vscode vim mode. Would love to hear your thoughts. Are you on Mac/Linux or Windows?

msoloviev | 0 comments | 2 weeks ago

I wonder how the "ring context" works under the hood. I have previously had (and recently messed around with again) a somewhat similar project designed for a more toy/exploratory setting (https://github.com/blackhole89/autopen - demo video at https://www.youtube.com/watch?v=1O1T2q2t7i4), and one of the main problems to address definitively is the question of how to manage your KV cache cleverly so you don't have to constantly perform too much expensive recomputation whenever the buffer undergoes local changes.

The solution I came up with involved maintaining a tree of tokens branching whenever an alternative next token was explored, with full LLM state snapshots at fixed depth intervals so that the buffer would only have to be "replayed" for a few tokens when something changed. I wonder if there are some mathematical properties of how the important parts of the state (really, the KV cache, which can be thought of as a partial precomputation of the operation that one LLM iteration performs on the context) work that could have made this more efficient, like to avoid saving full snapshots or perhaps to be able to prune the "oldest" tokens out of a state efficiently.

(edit: Georgi's comment that beat me by 3 minutes appears to be pointing at information that would go some way to answer my questions!)

h14h | 2 comments | 2 weeks ago

A little bit of a tangent, but I'm really curious what benefits could come from integrating these LLM tools more closely with data from LSPs, compilers, and other static analysis tools.

Intuitively, it seems like you could provide much more context and better output as a result. Even better would be if you could fine-tune LLMs on a per-language basis and ship them alongside typical editor tooling.

A problem I see w/ these AI tools is that they work much better with old, popular languages, and I worry that this will grow as a significant factor when choosing a language. Anecdotally, I see far better results when using TypeScript than Gleam, for example.

It would be very cool to be able to install a Gleam-specific model that could be fed data from the LSP and compiler, and wouldn't constantly hallucinate invalid syntax. I also wonder if, with additional context & fine-tuning, you could make these models smaller and more feasible to run locally on modest hardware.

sdesol | 0 comments | 2 weeks ago

> work much better with old, popular languages

I think this will improve significantly over time when when hardware becomes cheaper. As long as newer languages can map to older languages (syntax/function wise), we should be able generate enough synthetic data to make working with less known languages easier.

rabiescow | 0 comments | 2 weeks ago

you are free to contribute your own gleam llms. They're only as good as their inputs, thus if there's very few publicly available packages for a certain language that's all the input they got...

mijoharas | 2 comments | 2 weeks ago

Can anyone compare this to Tabbyml?[0] I just set that up yesterday for emacs to check it out.

The context gathering seems very interesting[1], and very vim-integrated, so I'm guessing there isn't anything very similar for Tabby. I skimmed the docs and saw some stuff about context for the Tabby chat feature[2] which I'm not super interested in using even if the docs adding sounds nice, but nothing obvious for the auto completion[3].

Does anyone have more insight or info to compare the two?

As a note, I quite like that the LLM context here "follows" what you're doing. It seems like a nice idea. Does anyone know if anyone else does something similar?

[0] https://www.tabbyml.com/

[1] https://github.com/ggerganov/llama.cpp/pull/9787#issue-25729... "global context onwards"

[2] https://tabby.tabbyml.com/docs/administration/context/

[3] https://tabby.tabbyml.com/docs/administration/code-completio...

mijoharas | 0 comments | 2 weeks ago

Ahhh, it seems that tabby does use RAG and context providers for the code completion. Interesting:

> During LLM inference, this context is utilized for code completion

Hmmm... I wonder what's better. As I'm coding I jump and search to relevant parts of the codebase to build up my own context for solving the problem, and I expect that's likely better than RAG. Llama.vim seems to follow this model, while tabby could theoretically get at things I'm not looking at/haven't looked at recently...

ghthor | 1 comment | 2 weeks ago

I’ve been using Tabby happily since May 2023

mijoharas | 0 comments | 2 weeks ago

Have you setup the context providers, and do you find that important for getting good completions?

dingnuts | 9 comments | 2 weeks ago

Is anyone actually getting value out of these models? I wired one up to Emacs and the local models all produce a huge volume of garbage output.

Occasionally I find a hosted LLM useful but I haven't found any output from the models I can run in Ollama on my gaming PC to be useful.

It's all plausible-looking but incorrect. I feel like I'm taking crazy pills when I read about others' experiences. Surely I am not alone?

remexre | 2 comments | 2 weeks ago

I work on compilers. A friend of mine works on webapps. I've seen Cursor give him lots of useful code, but it's never been particularly useful on any of the code of mine that I've tried it on.

It seems very logical to me that there'd be orders of magnitude more training data for some domains than others, and that existing models' skill is not evenly distributed cross-domain.

dkga | 0 comments | 2 weeks ago

This. Also across languages. For example, I suppose there is a lot more content in python and javascript than Apple script, for example. (And to be fair not a lot of the python suggestions I receive are actually mindblowing good)

q0uaur | 3 comments | 2 weeks ago

i'm still patiently waiting for an easy way to point a model at some documentation, and make it actually use that.

My usecase is gdscript for godot games, and all the models i've tried so far use godot 2 stuff that's just not around anymore, even if you tell it to use godot 4 it gives way too much wrong output to be useful.

I wish i could just point it at the latest godot docs and have it give up to date answers. but seeing as that's still not a thing, i guess it's more complicated than i expect.

psytrx | 0 comments | 2 weeks ago

There's llms.txt [0], but it's not gaining much popularity.

My web framework of choice provides these [1], but they're not easily injected into the LLM context without much fuss. It would be a game changer if more LLM tools implemented them.

[0] https://llmstxt.org/ [1] https://svelte.dev/docs/llms

doctoboggan | 0 comments | 2 weeks ago

It's definitely a thing already. Look up "RAG" (Retrieval Augmented Generation). Most of the popular closed source companies offer RAG services via their APIs, and you can also do it with local llms using open-webui and probably many other local UIs.

mohsen1 | 0 comments | 2 weeks ago

Cursor can follow links

fovc | 2 comments | 2 weeks ago

> I feel like I'm taking crazy pills when I read about others' experiences. Surely I am not alone?

You're not alone :-) I asked a very similar question about a month ago: https://news.ycombinator.com/item?id=42552653 and have continued researching since.

My takeaway was that autocomplete, boiler plate, and one-off scripts are the main use cases. To use an analogy, I think the code assistants are more like an upgrade from handsaw to power tools and less like hiring a carpenter. (Which is not what the hype engine will claim).

For me, only the one-off script (write-only code) use-case is useful. I've had the best results on this with Claude.

Emacs abbrevs/snippets (+ choice of language) virtually eliminate the boiler plate problem, so I don't have a use for assistants there.

For autocomplete, I find that LSP completion engines provide 95% of the value for 1% of the latency. Physically typing the code is a small % of my time/energy, so the value is more about getting the right names, argument order, and other fiddly details I may not remember exactly. But I find, that LSP-powered autocomplete and tooltips largely solve those challenges.

sdesol | 1 comment | 2 weeks ago

> like an upgrade from handsaw to power tools and less like hiring a carpenter. (Which is not what the hype engine will claim).

I 100% agree with the not hiring a carpenter part but we need a better way to describe the improvement over just a handsaw. If you have domain knowledge, it can become an incredible design aid/partner. Here is a real world example as to how it is changing things for me.

I have a TreeTable component which I built 100% with LLM and when I need to update it, I just follow the instructions in this chat:

http://beta.gitsense.com/?chat=dd997ccd-5b37-4591-9200-b975f...

Right now, I am thinking about adding folders to organize chats, and here is the chat with DeepSeek for that feature:

http://beta.gitsense.com/?chat=3a94ce40-86f2-4e68-b5d7-88d33...

I'm thoroughly impressed as it suggested data structures and more for me to think about. And here I am asking it to review what was discussed to make the information easier to understand.

http://beta.gitsense.com/?chat=8c6bf5db-49a7-4511-990c-5e6ad...

All of this cost me less than a penny. I'm still waiting for my Anthropic API limit to reset and I'm going to ask Sonnet for feedback as well, and I figure that will cost me 5 cents.

I fully understand the not hiring a carpenter part, but I think what LLMs bring to the table is SO MUCH more than an upgrade to a power tool. If you know what you need and can clearly articulate it well enough, there really is no limit to what you can build with proper instructions, provided the solution is in its training data and you have a good enough BS detector.

strogonoff | 1 comment | 2 weeks ago

> If you know what you need and can clearly articulate it well enough, there really is no limit to what you can build with proper instructions, provided the solution is in its training data and you have a good enough BS detector.

In other words: you must already know how to do what you are asking the LLM to do.

In other words: it may make sense if typing speed is your bottleneck and you are dealing with repetitive tasks that have well been solved many times (i.e., you want an advanced autocomplete).

This basically makes it useless for me. Typing speed is not a bottleneck, I automate or abstract away repetition, and I seek novel tasks that have not yet been well solved—or I just reuse those existing solutions (maybe even contributing to respective OSS projects).

The cases where something new is needed in areas that I don’t know well it completely failed me. NB: I never actually used it myself, I only gave into a suggestion by a friend (whom LLM reportedly helps) to use his LLM wrangling skills in a thorny case.

sdesol | 1 comment | 2 weeks ago

> In other words: you must already know how to do what you are asking the LLM to do.

Those that will benefit the most will be senior developers. They might not know the exact problem or language, but they should know enough to guide the LLM.

> In other words: it may make sense if typing speed is your bottleneck and you are dealing with repetitive tasks that have well been solved many times (i.e., you want an advanced autocomplete).

I definitely use a LLM as a typist and I love it. I've come to a point now where I mentally ask myself, "Will it take more time to do it myself or to explain it?" Another factor is cost, as you can rack up a bill pretty quickly with Claude Sonnet if you ask it to generate a lot of code.

But honestly, what I love about integrating LLM into my workflow is, I'm better able to capture and summarize my thought process. I've also found LLMs can better articulate my thoughts most of the time. If you know how to prompt a LLM, it almost feels like you are working with a knowledgeable colleague.

> I never actually used it myself, I only gave into a suggestion by a friend (whom LLM reportedly helps) to use his LLM wrangling skills in a thorny case.

LLMs are definitely not for everyone, but I personally cannot see myself coding without LLMs now. Just asking for variable name suggestions is pretty useful. Or describing something vague and having it properly articulate my thoughts is amazing. I think we like to believe what we do is rather unique, but I think a lot of things that we need to do have already been done. Whether it is in the training data is another thing, though.

strogonoff | 0 comments | 2 weeks ago

> They might not know the exact problem or language, but they should know enough to guide the LLM.

I was in this exact situation. I worked with an unfamiliar area with a hardware SDK in C that I needed to rewrite for my runtime, or at least call its C functions from my runtime, or at least understand how the poorly written (but working) example SDK invocation works in C by commenting it. The LLMs failed to help with any of that, they produced code that was 1) incorrect (literally doing the opposite of what’s expected) and 2) full of obvious comments and missing implementetions (like “cleanup if needed” comment in the empty deinit function).

Later it turned out there is actually an SDK for my runtime, I just failed to find it at first, so the code the LLM could use or tell me about actually existed (just not very easy to find).

Those were two top LLMs as of December 2024. It left me unimpressed.

I don’t think I would be compelled to guide them, once I understood how the code works it is faster to just write it or read relevant reference.

My friend, who volunteered to waste those precious tokens to help with my project, does use chatbots a lot while coding, but he’s more of an intermediate than senior developer.

> Just asking for variable name suggestions is pretty useful.

I can’t see myself asking anyone, much less an LLM, for the name of a variable. I am known to ask about and/or look up, say, subject domain terminology that I then use when naming things, but to name things well you first need to have a full picture of what you are making. Our job is to have one…

barrell | 0 comments | 2 weeks ago

I think you make a very good point about your existing devenv. I recently turned off GitHub copilot after maybe 2 years of use — I didn’t realize how often I was using its completions over LSPs.

Quality of Life went up massively. LSPs and nvim-cmp have come a long way (although one of these days I’ll try blink.cmp)

sangnoir | 0 comments | 2 weeks ago

> Is anyone actually getting value out of these models?

I've found incredible value in having LLMs help me write unit tests. The quality of the test code is far from perfect, but AI tooling - Claude Sonnet specifically - is good at coming up with reasonable unit test cases after I've written the code under test (sue me, TDD zealots). I probably have to fix 30% of the tests and expand the test cases, but I'd say it cuts the number if test code lines I author by more than 80%. This has decreased the friction so much, I've added Continuous Integration to small, years-old personal projects that had no tests before.

I've found lesser value with refactoring and adding code docs, but that's more of autocomplete++ using natural language rather than AST-derived code.

coder543 | 4 comments | 2 weeks ago

> I wired one up

“One”? Wired up how? There is a huge difference between the best and worst. They aren’t fungible. Which one? How long ago? Did it even support FIM (fill in middle), or was it blindly guessing from the left side? Did the plugin even gather appropriate context from related files, or was it only looking at the current file?

If you try Copilot or Cursor today, you can experience what “the best” looks like, which gives you a benchmark to measure smaller, dumber models and plugins against. No, Copilot and Cursor are not available for emacs, as far as I know… but if you want to understand if a technology is useful, you don’t start with the worst version and judge from that. (Not saying emacs itself is the worst… just that without more context, my assumption is that whatever plugin you probably encountered was probably using a bottom tier model, and I doubt the plugin itself was helping that model do its best.)

There are some local code completion models that I think are perfectly fine, but I don’t know where you will draw the line on how good is good enough. If you can prove to yourself that the best models are good enough, then you can try out different local models and see if one of those works for you.

Lanedo | 0 comments | 2 weeks ago

There is https://github.com/copilot-emacs/copilot.el that gets copilot to work in emacs via JS glue code and binaries provided by copilot.vim.

I hacked up a slim alternative localpilot.js layer that uses llama-server instead of the copilot API, so copilot.el can be used with local LLMs, but I find the copilot.el overlays kinda buggy... It'd probably be better to instead write a llamapilot.el for local LLMs from scratch for emacs.

b5n | 0 comments | 2 weeks ago

Emacs has had multiple llm integration packages available for quite awhile (relative to the rise of llms). `gptel` supports multiple providers including anthropic, openai, ollama, etc.

https://github.com/karthink/gptel

yoyohello13 | 0 comments | 2 weeks ago

https://github.com/copilot-emacs/copilot.el

whimsicalism | 0 comments | 2 weeks ago

there's avante.nvim

colonial | 0 comments | 2 weeks ago

I do not, but I suspect that's mainly because I far and away prefer statically typed languages with powerful LSP implementations.

Copilot, Ollama, and the others have all been strictly inferior to rust-analyzer. The suggested code is often just straight up invalid and takes just long enough to be annoying. Compare that to just typing '.'/'::' + a few characters to fuzzy-select what I'm looking for + enter.

ETA: Both did save me a few seconds here and there when obvious "repetition with a few tweaks each line" was involved, but to me that is not worth a monthly subscription or however much wall power my GPU was consuming.

codingdave | 0 comments | 2 weeks ago

Yep - I don't get a ton of value out of autocompletions, but I get decent value from asking an LLM how they would approach more complex functions or features. I rarely get code back that I can copy/paste, but reading their output is something I can react to - whether it is good or bad, just having a starting point speeds up the design of new features vs. me burning time creating my first/worst draft. And that is the goal here, isn't it? To get some productivity gains?

So maybe it is just a difference in perspective? Even incorrect code and bad ideas can still be helpful. It is only useless if you expect them to hand you working code.

whimsicalism | 0 comments | 2 weeks ago

i don't find value from the models that it makes economical sense to self-host. i do get value out of a llama 70b, for instance, though

righthand | 0 comments | 2 weeks ago

Honestly just disabled my TabNine plugin and have found that LSP server is good enough for 99% of what I do. I really don’t need hypothetical output suggested to me. I’m comfortable reading docs though so others may feel different.

tomr75 | 0 comments | 2 weeks ago

try cursor

frankfrank13 | 0 comments | 2 weeks ago

Is this more or less the same as your VSCode version? (https://github.com/ggml-org/llama.vscode)

binary132 | 0 comments | 2 weeks ago

I am curious to see what will be possible with consumer grade hardware and more improvements to quantization over the next decade. Right now, even a 24GB gpu with the best models isn’t able to match the barely acceptable performance of hosted services I’m not willing to even pay $20 a month for.

mohsen1 | 0 comments | 2 weeks ago

Terminal coding FTW!

And when you're really stuck you can use DeepSeek R1 for a deeper analysis in your terminal using `askds`

https://github.com/bodo-run/askds

opk | 5 comments | 2 weeks ago

Has anyone actually got this llama stuff to be usable on even moderate hardware? I find it just crashes because it doesn't find enough RAM. I've got 2G of VRAM on an AMD graphics card and 16G of system RAM and that doesn't seem to be enough. The impression I got from reading up was that it worked for most Apple stuff because the memory is unified and other than that, you need very expensive Nvidia GPUs with lots of VRAM. Are there any affordable options?

horsawlarway | 0 comments | 2 weeks ago

Yes. Although I suspect my definition of "moderate hardware" doesn't really match yours.

I can run 2b-14b models just fine on the CPU on my laptop (framework 13 with 32gb ram). They aren't super fast, and the 14b models have limited context length unless I run a quantized version, but they run.

If you just want generation and it doesn't need to be fast... drop the $200 for 128gb of system ram, and you can run the vast majority of the available models (up to ~70b quantized). Note - it won't be quick (expect 1-2 tokens/second, sometimes less).

If you want something faster in the "low end" range still - look at picking up a pair of Nvidia p40s (~$400) which will give you 16gb of ram and be faster for 2b to 7b models.

If you want to hit my level for "moderate", I use 2x3090 (I bought refurbed for ~$1600 a couple years ago) and they do quite a bit of work. Ex - I get ~15t/s generation for 70b 4 quant models, and 50-100t/s for 7b models. That's plenty usable for basically everything I want to run at home. They're faster than the m2 pro I was issued for work, and a good chunk cheaper (the m2 was in the 3k range).

That said - the m1/m2 macs are generally pretty zippy here, I was quite surprised at how well they perform.

Some folks claim to have success with the k80s, but I haven't tried and while 24g vram for under $100 seems nice (even if it's slow), the linux compatibility issues make me inclined to just go for the p40s right now.

I run some tasks on much older hardware (ex - willow inference runs on an old 4gb gtx 970 just fine)

So again - I'm not really sure we'd agree on moderate (I generally spend ~$1000 every 4-6 years to build a machine to play games, and the machine you're describing would match the specs for a machine I would have built 12+ years ago)

But you just need literal memory. bumping to 32gb of system ram would unlock a lot of stuff for you (at low speeds) and costs $50. Bumping to 124gb only costs a couple hundred, and lets you run basically all of them (again - slowly).

zamadatix | 1 comment | 2 weeks ago

2G is pretty low and the sizes things you can get to run fast on that set up probably aren't particularly attractive. "moderate hardware" varies but you can grab a 12 GB RTX 3060 on ebay for ~$200. You can get a lot more RAM for $200 but it'll be so slow compare the the GPU I'm not sure I'd recommend it if you actually want to use things like this interactively.

If "moderate hardware" is your average office PC then it's unlikely to be very usable. Anyone with a gaming GPU from the last several years should be workable though.

horsawlarway | 0 comments | 2 weeks ago

I'll second this, actually - $250 for a 12gb rtx 3060 is probably a better buy than $400 for 2xp40s for 16gb.

It'd been a minute since I checked refurb prices and $250 for the rtx 3060 12gb is a good price.

Easier on the rest of the system than a 2x card setup, and is probably a drop in replacement.

bhelkey | 0 comments | 2 weeks ago

Have you tried Ollama [1]? You should be able to run a 8b model in RAM and a 1b model in VRAM.

[1] https://news.ycombinator.com/item?id=42069453

basilgohar | 0 comments | 2 weeks ago

I can run 7B models with Q4 quantization on a 7000 series AMD APU without GPU acceleration quite acceptably fast. This is with DDR5600 RAM which is the current roadblock for performance.

Larger models work but slow down. I do have 64GB of RAM but I think 32 could work. 16GB is pushing is, but should be possible if you don't have anything else open.

Memory requirements depend on numerous factors. 2GB VRAM is not enough for most GenAI stuff today.

whimsicalism | 0 comments | 2 weeks ago

2g of vram is pretty bad...

mrinterweb | 0 comments | 2 weeks ago

Been using this for a couple hours, and this is really nice. It is a great alternative to something like Github Copilot. Appreciate how simple and fast this is.

colordrops | 0 comments | 2 weeks ago

I've seen several posts and projects like this. Is there a summary/comparison somewhere of the various ways of running local completion/copilot?

s-skl | 0 comments | 2 weeks ago

Really awesome work! Do anyone know what's the tool/terminal configuration he's using on the video demo to embed CPU/GPU usage on the terminal in that way ? Much appreciated :)

jerpint | 2 comments | 2 weeks ago

It’s funny because I actually use vim mostly when I don’t want LLM assisted code. Sometimes it just gets in the way.

If I do, I load up cursor with vim bindings.

rapind | 1 comment | 2 weeks ago

What I'd really like is a non-intrusive LLM running that I can put on a third monitor which sees what I'm working on, while I work, and tries to mirror and make suggestions WITHOUT interjecting itself into my editor. Then I just want a mappable prompt command (like <leader>P) where I can provide feedback / suggestions (or maybe I just do that via mic).

This way I can choose to just ignore it for the most part, but if I see something interesting, I can refine it, prompt, cherry pick, etc. I really don't want autocomplete for anything.

survirtual | 1 comment | 2 weeks ago

Hey I was working on something like this, because I wanted something similar. I had it confined to just the terminal though, because I don't want just a vim helper, I want an everything helper.

I was going to have the helper generate suggestions in a separate window, which could be moved to any monitor or screen. It would make suggestions and autocompletes, and you can chat with it to generate commands etc on the fly.

Maybe I will pick it up again soon.

rapind | 0 comments | 2 weeks ago

That sounds pretty great! Kind of like an AI pair.

renewiltord | 1 comment | 2 weeks ago

Funny. I think the most common usage I have is using it at the command line to write commands with vim set as my EDITOR so the AI completion really helps.

This will help for offline support (on planes and such).

qup | 2 comments | 2 weeks ago

Can you more specifically talk about how you use this, like with a small example?

renewiltord | 0 comments | 2 weeks ago

Yes, I mentioned it here first and haven't changed it since (Twitter link and links include video of use)

https://news.ycombinator.com/item?id=34769611

Which leads (used to lead?) here https://wiki.roshangeorge.dev/index.php/AI_Completion_In_The...

VMG | 0 comments | 2 weeks ago

- composing a commit message

- anything bash script related

morcus | 2 comments | 2 weeks ago

Looking for advice from someone who knows about the space - Suppose I'm willing to go out and buy a card for this purpose, what's a modestly priced graphics card with which I can get somewhat usable results running local LLM?

loudmax | 0 comments | 2 weeks ago

The bottleneck for running LLMs on consumer grade equipment is the amount of VRAM your GPU has. VRAM is RAM that's physically built into the unit and it has much higher memory bandwidth than regular system RAM.

Obviously, newer GPUs will run faster than older GPUs, but you need more VRAM to be able to run larger models. A small LLM that fits into an RTX 4060's 8GB of VRAM will run faster there than it would on an older RTX 3090. But the 3090 has 24GB of VRAM, so it can run larger LLMs that the 4060 simply can't handle.

Llama.cpp can split your LLM onto multiple GPUs, or split part of it onto the CPU using system RAM, though that last option is much much slower. The more of the model you can fit into VRAM, the better.

The Apple M-series macbooks have unified memory, so the GPU has higher bandwidth access to system RAM than would be available over a PCIe card. They're not as powerful as Nvidia GPUs, but they're a reasonable option for running larger LLMs. It's also worth considering AMD and Intel GPUs, but most of the development in the ML space is happening on Nvidia's CUDA architecture, so bleeding edge stuff tends to be Nvidia first and other architectures later, if at all.

estreeper | 0 comments | 2 weeks ago

This is a tough question to answer, because it depends a lot on what you want to do! One way to approach it may be to look at what models you want to run and check the amount of VRAM they need. A back-of-the-napkin method taken from here[0] is:

    VRAM (GB) = 1.2 * number of parameters (in billions) * bits per parameter / 8

The 1.2 is just an estimation factor to account for the VRAM needed for things that aren't model parameters.

Because quantization is often nearly free in terms of output quality, you should usually look for quantized versions. For example, Llama 3.2 uses 16-bit parameters but has a 4-bit quantized version, and looking at the formula above you can see that will allow you to run a 4x larger model.

Having enough VRAM will allow you to run a model, but performance is dependent on a lot of other factors. For a much deeper dive into how all of this works along with price/dollar recommendations (though from last year!), Tim Dettmers wrote this excellent article: https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

Worth mentioning for the benefit of those who don't want to buy a GPU: there are also models which have been converted to run on CPU.

[0] https://blog.runpod.io/understanding-vram-and-how-much-your-...

cfiggers | 0 comments | 2 weeks ago

Do people with "Copilot+ PCs" get benefits running stuff like this from the much-vaunted AI coprocessors in for e.g. Snapdragon X Elite chips?

awwaiid | 1 comment | 2 weeks ago

The blinking cursor in demo videos is giving me heart palpitations! But this is super cool. It makes me wonder how Linux is doing on M* hardware.

amelius | 0 comments | 2 weeks ago

> It makes me wonder how Linux is doing on M* hardware.

Not so great. Zero legal status. No blessing by Apple. Can be taken down whenever Apple execs decide to.

amelius | 1 comment | 2 weeks ago

This looks very interesting. Can this be trained on the user's codebase, or is the idea that everything must fit inside the context buffer?

larodi | 0 comments | 2 weeks ago

Models don’t get trained on your codebase in scenarios like Continue and Copilot assisted completion. It’s RAG at best and it depends on how the RAG is built.

So all depends on how it fits in context and it will for the foreseeable time be like it as training is absolutely expensive compared to RAG.

entelechy0 | 0 comments | 2 weeks ago

I use this on-and-off again. It is nice that I can flip between this and Copilot by commenting out one line in my init.lua