ggerganov | 9 comments | 2 weeks ago
I highly recommend to take a look at the technical details of the server implementation that enables large context usage with this plugin - I think it is interesting and has some cool ideas [0].
Also, the same plugin is available for VS Code [1].
Let me know if you have any questions about the plugin - happy to explain. Btw, the performance has improved compared to what is seen in the README videos thanks to client-side caching.
amrrs | 3 comments | 2 weeks ago
kennethologist | 0 comments | 2 weeks ago
sergiotapia | 0 comments | 2 weeks ago
bangaladore | 1 comment | 2 weeks ago
Is this because of a max latency setting, or the internal prompt, or am I doing something wrong? Or is it only really make to try to autocomplete lines and not blocks like Copilot will.
Thanks :)
ggerganov | 1 comment | 2 weeks ago
- Generation time exceeded (configurable in the plugin config)
- Number of tokens exceeded (not the case since you increased it)
- Indentation - stops generating if the next line has shorter indent than the first line
- Small probability of the sampled token
Most likely you are hitting the last criteria. It's something that should be improved in some way, but I am not very sure how. Currently, it is using a very basic token sampling strategy with a custom threshold logic to stop generating when the token probability is too low. Likely this logic is too conservative.
bangaladore | 0 comments | 2 weeks ago
I didn't catch T_max_predict_ms and upped that to 5000ms for fun. Doesn't seem to make a difference, so I'm guessing you are right.
eklavya | 0 comments | 2 weeks ago
jerpint | 0 comments | 2 weeks ago
liuliu | 1 comment | 2 weeks ago
Just curious: how much of your code nowadays completed by LLM?
ggerganov | 3 comments | 2 weeks ago
I think a fairly large amount, though can't give a good number. I have been using Github Copilot from the very early days and with the release of Qwen Coder last year have fully switched to using local completions. I don't use the chat workflow to code though, only FIM.
menaerus | 1 comment | 2 weeks ago
Am I correct to understand that you're basically minimizing the latencies and required compute/mem-bw by avoiding the KV cache? And encoding the (local) context in the input tokens instead?
I ask this because you set the prompt/context size to 0 (--ctx-size 0) and the batch size to 1024 (-b 1024). Former would mean that llama.cpp will only be using the context that is already encoded in the model itself but no local (code) context besides the one provided in the input tokens but perhaps I misunderstood something.
Thanks for your contributions and obviously the large amount of time you take to document your work!
ggerganov | 1 comment | 2 weeks ago
To get high-quality completions, you need to provide a large context of your codebase so that the generated suggestion is more inline with your style and implementation logic. However, naively increasing the context will quickly hit a computation limit because each request would need to compute (a.k.a prefill) a lot of tokens.
The KV cache shifts used here is an approach to reuse the cache of old tokens by "shifting" them in new absolute positions in the new context. This way a request that would normally require a context of lets say 10k tokens, could be processed more quickly by computing just lets say 500 tokens and reusing the cache of the other 9.5k tokens, thus cutting the compute ~10 fold.
The --ctx-size 0 CLI arg simply tells the server to allocate memory buffers for the maximum context size supported by the model. For the case of Qwen Coder models, this corresponds to 32k tokens.
The batch sizes are related to how much local context around your cursor to use, along with the global context from the ring buffer. This is described in more detail in the links, but simply put: decreasing the batch size will make the completion faster, but with less quality.
menaerus | 1 comment | 2 weeks ago
ggerganov | 1 comment | 2 weeks ago
To control how much global context to keep in the ring buffer (i.e. the context that is being reused to enrich the local context), you can adjust the "ring_n_chunks" and "rink_chunk_size". With the default settings, this amounts to about 8k tokens of context on our codebases when the ring buffer is full, which is a conservative setting. Increasing these numbers will make the context bigger, will improve the quality but will affect the performance.
There are a few other tricks to reduce the compute for the local context (i.e. the 1k batch of tokens), so that in practice, a smaller amount is processed. This further saves compute during the prefill.
menaerus | 0 comments | 2 weeks ago
Not bad.
gloflo | 2 comments | 2 weeks ago
jjnoakes | 0 comments | 2 weeks ago
rav | 0 comments | 2 weeks ago
LoganDark | 0 comments | 2 weeks ago
attentive | 1 comment | 2 weeks ago
If so, what's ollama missing?
mistercheph | 0 comments | 2 weeks ago
There is also https://github.com/olimorris/codecompanion.nvim which doesn't have text completion, but supports a lot of other AI editor workflows that I believe are inspired by Zed and supports ollama out of the box
nancyp | 1 comment | 2 weeks ago
nacs | 0 comments | 2 weeks ago
You can use C or VIMscript but programs like Neovim support Lua as well which makes it really easy to make plugins.
halyconWays | 0 comments | 2 weeks ago
eigenvalue | 3 comments | 2 weeks ago
nacs | 0 comments | 2 weeks ago
Agreed but he's an international treasure (his Github profile states Bulgaria).
feznyng | 1 comment | 2 weeks ago
cosmojg | 1 comment | 2 weeks ago
acters | 0 comments | 5 days ago
frankfrank13 | 0 comments | 2 weeks ago
estreeper | 2 comments | 2 weeks ago
After using it for a couple hours (on Elixir code) with Qwen2.5-Coder-3B and no attempts to customize it, this checks a lot of boxes for me:
- I pretty much want fancy autocomplete: filling in obvious things and saving my fingers the work, and these suggestions are pretty good
- the default keybindings work for me, I like that I can keep current line or multi-line suggestions
- no concerns around sending code off to a third-party
- works offline when I'm traveling
- it's fast!
So I don't need to remember how to run the server, I'll probably set up a script to check if it's running and if not start it up in the background and run vim, and alias vim to use that. I looked in the help documents but didn't see a way to disable the "stats" text after the suggestions, though I'm not sure it will bother me that much.ggerganov | 1 comment | 2 weeks ago
Currently, there isn't a user-friendly way to disable the stats from showing apart from modifying the "'show_info': 0" value directly in the plugin implementation. These things will be improved with time and will become more user-friendly.
A few extra optimizations will soon land which will further improve the experience:
- Speculative FIM
- Multiple suggestions
tomnipotent | 0 comments | 2 weeks ago
First tab completes just "func (t *Type)" so then I can type the first few characters of something I'm specifically looking for or wait for the first recommendation to kick in. I hope this isn't just a coincidence from the combination of model and settings...
douglee650 | 0 comments | 2 weeks ago
msoloviev | 0 comments | 2 weeks ago
The solution I came up with involved maintaining a tree of tokens branching whenever an alternative next token was explored, with full LLM state snapshots at fixed depth intervals so that the buffer would only have to be "replayed" for a few tokens when something changed. I wonder if there are some mathematical properties of how the important parts of the state (really, the KV cache, which can be thought of as a partial precomputation of the operation that one LLM iteration performs on the context) work that could have made this more efficient, like to avoid saving full snapshots or perhaps to be able to prune the "oldest" tokens out of a state efficiently.
(edit: Georgi's comment that beat me by 3 minutes appears to be pointing at information that would go some way to answer my questions!)
h14h | 2 comments | 2 weeks ago
Intuitively, it seems like you could provide much more context and better output as a result. Even better would be if you could fine-tune LLMs on a per-language basis and ship them alongside typical editor tooling.
A problem I see w/ these AI tools is that they work much better with old, popular languages, and I worry that this will grow as a significant factor when choosing a language. Anecdotally, I see far better results when using TypeScript than Gleam, for example.
It would be very cool to be able to install a Gleam-specific model that could be fed data from the LSP and compiler, and wouldn't constantly hallucinate invalid syntax. I also wonder if, with additional context & fine-tuning, you could make these models smaller and more feasible to run locally on modest hardware.
sdesol | 0 comments | 2 weeks ago
I think this will improve significantly over time when when hardware becomes cheaper. As long as newer languages can map to older languages (syntax/function wise), we should be able generate enough synthetic data to make working with less known languages easier.
rabiescow | 0 comments | 2 weeks ago
mijoharas | 2 comments | 2 weeks ago
The context gathering seems very interesting[1], and very vim-integrated, so I'm guessing there isn't anything very similar for Tabby. I skimmed the docs and saw some stuff about context for the Tabby chat feature[2] which I'm not super interested in using even if the docs adding sounds nice, but nothing obvious for the auto completion[3].
Does anyone have more insight or info to compare the two?
As a note, I quite like that the LLM context here "follows" what you're doing. It seems like a nice idea. Does anyone know if anyone else does something similar?
[1] https://github.com/ggerganov/llama.cpp/pull/9787#issue-25729... "global context onwards"
[2] https://tabby.tabbyml.com/docs/administration/context/
[3] https://tabby.tabbyml.com/docs/administration/code-completio...
mijoharas | 0 comments | 2 weeks ago
> During LLM inference, this context is utilized for code completion
Hmmm... I wonder what's better. As I'm coding I jump and search to relevant parts of the codebase to build up my own context for solving the problem, and I expect that's likely better than RAG. Llama.vim seems to follow this model, while tabby could theoretically get at things I'm not looking at/haven't looked at recently...
ghthor | 1 comment | 2 weeks ago
mijoharas | 0 comments | 2 weeks ago
dingnuts | 9 comments | 2 weeks ago
Occasionally I find a hosted LLM useful but I haven't found any output from the models I can run in Ollama on my gaming PC to be useful.
It's all plausible-looking but incorrect. I feel like I'm taking crazy pills when I read about others' experiences. Surely I am not alone?
remexre | 2 comments | 2 weeks ago
It seems very logical to me that there'd be orders of magnitude more training data for some domains than others, and that existing models' skill is not evenly distributed cross-domain.
dkga | 0 comments | 2 weeks ago
q0uaur | 3 comments | 2 weeks ago
My usecase is gdscript for godot games, and all the models i've tried so far use godot 2 stuff that's just not around anymore, even if you tell it to use godot 4 it gives way too much wrong output to be useful.
I wish i could just point it at the latest godot docs and have it give up to date answers. but seeing as that's still not a thing, i guess it's more complicated than i expect.
psytrx | 0 comments | 2 weeks ago
My web framework of choice provides these [1], but they're not easily injected into the LLM context without much fuss. It would be a game changer if more LLM tools implemented them.
doctoboggan | 0 comments | 2 weeks ago
mohsen1 | 0 comments | 2 weeks ago
fovc | 2 comments | 2 weeks ago
You're not alone :-) I asked a very similar question about a month ago: https://news.ycombinator.com/item?id=42552653 and have continued researching since.
My takeaway was that autocomplete, boiler plate, and one-off scripts are the main use cases. To use an analogy, I think the code assistants are more like an upgrade from handsaw to power tools and less like hiring a carpenter. (Which is not what the hype engine will claim).
For me, only the one-off script (write-only code) use-case is useful. I've had the best results on this with Claude.
Emacs abbrevs/snippets (+ choice of language) virtually eliminate the boiler plate problem, so I don't have a use for assistants there.
For autocomplete, I find that LSP completion engines provide 95% of the value for 1% of the latency. Physically typing the code is a small % of my time/energy, so the value is more about getting the right names, argument order, and other fiddly details I may not remember exactly. But I find, that LSP-powered autocomplete and tooltips largely solve those challenges.
sdesol | 1 comment | 2 weeks ago
I 100% agree with the not hiring a carpenter part but we need a better way to describe the improvement over just a handsaw. If you have domain knowledge, it can become an incredible design aid/partner. Here is a real world example as to how it is changing things for me.
I have a TreeTable component which I built 100% with LLM and when I need to update it, I just follow the instructions in this chat:
http://beta.gitsense.com/?chat=dd997ccd-5b37-4591-9200-b975f...
Right now, I am thinking about adding folders to organize chats, and here is the chat with DeepSeek for that feature:
http://beta.gitsense.com/?chat=3a94ce40-86f2-4e68-b5d7-88d33...
I'm thoroughly impressed as it suggested data structures and more for me to think about. And here I am asking it to review what was discussed to make the information easier to understand.
http://beta.gitsense.com/?chat=8c6bf5db-49a7-4511-990c-5e6ad...
All of this cost me less than a penny. I'm still waiting for my Anthropic API limit to reset and I'm going to ask Sonnet for feedback as well, and I figure that will cost me 5 cents.
I fully understand the not hiring a carpenter part, but I think what LLMs bring to the table is SO MUCH more than an upgrade to a power tool. If you know what you need and can clearly articulate it well enough, there really is no limit to what you can build with proper instructions, provided the solution is in its training data and you have a good enough BS detector.
strogonoff | 1 comment | 2 weeks ago
In other words: you must already know how to do what you are asking the LLM to do.
In other words: it may make sense if typing speed is your bottleneck and you are dealing with repetitive tasks that have well been solved many times (i.e., you want an advanced autocomplete).
This basically makes it useless for me. Typing speed is not a bottleneck, I automate or abstract away repetition, and I seek novel tasks that have not yet been well solved—or I just reuse those existing solutions (maybe even contributing to respective OSS projects).
The cases where something new is needed in areas that I don’t know well it completely failed me. NB: I never actually used it myself, I only gave into a suggestion by a friend (whom LLM reportedly helps) to use his LLM wrangling skills in a thorny case.
sdesol | 1 comment | 2 weeks ago
Those that will benefit the most will be senior developers. They might not know the exact problem or language, but they should know enough to guide the LLM.
> In other words: it may make sense if typing speed is your bottleneck and you are dealing with repetitive tasks that have well been solved many times (i.e., you want an advanced autocomplete).
I definitely use a LLM as a typist and I love it. I've come to a point now where I mentally ask myself, "Will it take more time to do it myself or to explain it?" Another factor is cost, as you can rack up a bill pretty quickly with Claude Sonnet if you ask it to generate a lot of code.
But honestly, what I love about integrating LLM into my workflow is, I'm better able to capture and summarize my thought process. I've also found LLMs can better articulate my thoughts most of the time. If you know how to prompt a LLM, it almost feels like you are working with a knowledgeable colleague.
> I never actually used it myself, I only gave into a suggestion by a friend (whom LLM reportedly helps) to use his LLM wrangling skills in a thorny case.
LLMs are definitely not for everyone, but I personally cannot see myself coding without LLMs now. Just asking for variable name suggestions is pretty useful. Or describing something vague and having it properly articulate my thoughts is amazing. I think we like to believe what we do is rather unique, but I think a lot of things that we need to do have already been done. Whether it is in the training data is another thing, though.
strogonoff | 0 comments | 2 weeks ago
I was in this exact situation. I worked with an unfamiliar area with a hardware SDK in C that I needed to rewrite for my runtime, or at least call its C functions from my runtime, or at least understand how the poorly written (but working) example SDK invocation works in C by commenting it. The LLMs failed to help with any of that, they produced code that was 1) incorrect (literally doing the opposite of what’s expected) and 2) full of obvious comments and missing implementetions (like “cleanup if needed” comment in the empty deinit function).
Later it turned out there is actually an SDK for my runtime, I just failed to find it at first, so the code the LLM could use or tell me about actually existed (just not very easy to find).
Those were two top LLMs as of December 2024. It left me unimpressed.
I don’t think I would be compelled to guide them, once I understood how the code works it is faster to just write it or read relevant reference.
My friend, who volunteered to waste those precious tokens to help with my project, does use chatbots a lot while coding, but he’s more of an intermediate than senior developer.
> Just asking for variable name suggestions is pretty useful.
I can’t see myself asking anyone, much less an LLM, for the name of a variable. I am known to ask about and/or look up, say, subject domain terminology that I then use when naming things, but to name things well you first need to have a full picture of what you are making. Our job is to have one…
barrell | 0 comments | 2 weeks ago
Quality of Life went up massively. LSPs and nvim-cmp have come a long way (although one of these days I’ll try blink.cmp)
sangnoir | 0 comments | 2 weeks ago
I've found incredible value in having LLMs help me write unit tests. The quality of the test code is far from perfect, but AI tooling - Claude Sonnet specifically - is good at coming up with reasonable unit test cases after I've written the code under test (sue me, TDD zealots). I probably have to fix 30% of the tests and expand the test cases, but I'd say it cuts the number if test code lines I author by more than 80%. This has decreased the friction so much, I've added Continuous Integration to small, years-old personal projects that had no tests before.
I've found lesser value with refactoring and adding code docs, but that's more of autocomplete++ using natural language rather than AST-derived code.
coder543 | 4 comments | 2 weeks ago
“One”? Wired up how? There is a huge difference between the best and worst. They aren’t fungible. Which one? How long ago? Did it even support FIM (fill in middle), or was it blindly guessing from the left side? Did the plugin even gather appropriate context from related files, or was it only looking at the current file?
If you try Copilot or Cursor today, you can experience what “the best” looks like, which gives you a benchmark to measure smaller, dumber models and plugins against. No, Copilot and Cursor are not available for emacs, as far as I know… but if you want to understand if a technology is useful, you don’t start with the worst version and judge from that. (Not saying emacs itself is the worst… just that without more context, my assumption is that whatever plugin you probably encountered was probably using a bottom tier model, and I doubt the plugin itself was helping that model do its best.)
There are some local code completion models that I think are perfectly fine, but I don’t know where you will draw the line on how good is good enough. If you can prove to yourself that the best models are good enough, then you can try out different local models and see if one of those works for you.
Lanedo | 0 comments | 2 weeks ago
I hacked up a slim alternative localpilot.js layer that uses llama-server instead of the copilot API, so copilot.el can be used with local LLMs, but I find the copilot.el overlays kinda buggy... It'd probably be better to instead write a llamapilot.el for local LLMs from scratch for emacs.
b5n | 0 comments | 2 weeks ago
yoyohello13 | 0 comments | 2 weeks ago
whimsicalism | 0 comments | 2 weeks ago
colonial | 0 comments | 2 weeks ago
Copilot, Ollama, and the others have all been strictly inferior to rust-analyzer. The suggested code is often just straight up invalid and takes just long enough to be annoying. Compare that to just typing '.'/'::' + a few characters to fuzzy-select what I'm looking for + enter.
ETA: Both did save me a few seconds here and there when obvious "repetition with a few tweaks each line" was involved, but to me that is not worth a monthly subscription or however much wall power my GPU was consuming.
codingdave | 0 comments | 2 weeks ago
So maybe it is just a difference in perspective? Even incorrect code and bad ideas can still be helpful. It is only useless if you expect them to hand you working code.
whimsicalism | 0 comments | 2 weeks ago
righthand | 0 comments | 2 weeks ago
tomr75 | 0 comments | 2 weeks ago
frankfrank13 | 0 comments | 2 weeks ago
binary132 | 0 comments | 2 weeks ago
mohsen1 | 0 comments | 2 weeks ago
And when you're really stuck you can use DeepSeek R1 for a deeper analysis in your terminal using `askds`
opk | 5 comments | 2 weeks ago
horsawlarway | 0 comments | 2 weeks ago
I can run 2b-14b models just fine on the CPU on my laptop (framework 13 with 32gb ram). They aren't super fast, and the 14b models have limited context length unless I run a quantized version, but they run.
If you just want generation and it doesn't need to be fast... drop the $200 for 128gb of system ram, and you can run the vast majority of the available models (up to ~70b quantized). Note - it won't be quick (expect 1-2 tokens/second, sometimes less).
If you want something faster in the "low end" range still - look at picking up a pair of Nvidia p40s (~$400) which will give you 16gb of ram and be faster for 2b to 7b models.
If you want to hit my level for "moderate", I use 2x3090 (I bought refurbed for ~$1600 a couple years ago) and they do quite a bit of work. Ex - I get ~15t/s generation for 70b 4 quant models, and 50-100t/s for 7b models. That's plenty usable for basically everything I want to run at home. They're faster than the m2 pro I was issued for work, and a good chunk cheaper (the m2 was in the 3k range).
That said - the m1/m2 macs are generally pretty zippy here, I was quite surprised at how well they perform.
Some folks claim to have success with the k80s, but I haven't tried and while 24g vram for under $100 seems nice (even if it's slow), the linux compatibility issues make me inclined to just go for the p40s right now.
I run some tasks on much older hardware (ex - willow inference runs on an old 4gb gtx 970 just fine)
So again - I'm not really sure we'd agree on moderate (I generally spend ~$1000 every 4-6 years to build a machine to play games, and the machine you're describing would match the specs for a machine I would have built 12+ years ago)
But you just need literal memory. bumping to 32gb of system ram would unlock a lot of stuff for you (at low speeds) and costs $50. Bumping to 124gb only costs a couple hundred, and lets you run basically all of them (again - slowly).
zamadatix | 1 comment | 2 weeks ago
If "moderate hardware" is your average office PC then it's unlikely to be very usable. Anyone with a gaming GPU from the last several years should be workable though.
horsawlarway | 0 comments | 2 weeks ago
It'd been a minute since I checked refurb prices and $250 for the rtx 3060 12gb is a good price.
Easier on the rest of the system than a 2x card setup, and is probably a drop in replacement.
bhelkey | 0 comments | 2 weeks ago
basilgohar | 0 comments | 2 weeks ago
Larger models work but slow down. I do have 64GB of RAM but I think 32 could work. 16GB is pushing is, but should be possible if you don't have anything else open.
Memory requirements depend on numerous factors. 2GB VRAM is not enough for most GenAI stuff today.
whimsicalism | 0 comments | 2 weeks ago
mrinterweb | 0 comments | 2 weeks ago
colordrops | 0 comments | 2 weeks ago
s-skl | 0 comments | 2 weeks ago
jerpint | 2 comments | 2 weeks ago
If I do, I load up cursor with vim bindings.
rapind | 1 comment | 2 weeks ago
This way I can choose to just ignore it for the most part, but if I see something interesting, I can refine it, prompt, cherry pick, etc. I really don't want autocomplete for anything.
survirtual | 1 comment | 2 weeks ago
I was going to have the helper generate suggestions in a separate window, which could be moved to any monitor or screen. It would make suggestions and autocompletes, and you can chat with it to generate commands etc on the fly.
Maybe I will pick it up again soon.
rapind | 0 comments | 2 weeks ago
renewiltord | 1 comment | 2 weeks ago
This will help for offline support (on planes and such).
qup | 2 comments | 2 weeks ago
renewiltord | 0 comments | 2 weeks ago
https://news.ycombinator.com/item?id=34769611
Which leads (used to lead?) here https://wiki.roshangeorge.dev/index.php/AI_Completion_In_The...
VMG | 0 comments | 2 weeks ago
- anything bash script related
morcus | 2 comments | 2 weeks ago
loudmax | 0 comments | 2 weeks ago
Obviously, newer GPUs will run faster than older GPUs, but you need more VRAM to be able to run larger models. A small LLM that fits into an RTX 4060's 8GB of VRAM will run faster there than it would on an older RTX 3090. But the 3090 has 24GB of VRAM, so it can run larger LLMs that the 4060 simply can't handle.
Llama.cpp can split your LLM onto multiple GPUs, or split part of it onto the CPU using system RAM, though that last option is much much slower. The more of the model you can fit into VRAM, the better.
The Apple M-series macbooks have unified memory, so the GPU has higher bandwidth access to system RAM than would be available over a PCIe card. They're not as powerful as Nvidia GPUs, but they're a reasonable option for running larger LLMs. It's also worth considering AMD and Intel GPUs, but most of the development in the ML space is happening on Nvidia's CUDA architecture, so bleeding edge stuff tends to be Nvidia first and other architectures later, if at all.
estreeper | 0 comments | 2 weeks ago
VRAM (GB) = 1.2 * number of parameters (in billions) * bits per parameter / 8
The 1.2 is just an estimation factor to account for the VRAM needed for things that aren't model parameters.Because quantization is often nearly free in terms of output quality, you should usually look for quantized versions. For example, Llama 3.2 uses 16-bit parameters but has a 4-bit quantized version, and looking at the formula above you can see that will allow you to run a 4x larger model.
Having enough VRAM will allow you to run a model, but performance is dependent on a lot of other factors. For a much deeper dive into how all of this works along with price/dollar recommendations (though from last year!), Tim Dettmers wrote this excellent article: https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...
Worth mentioning for the benefit of those who don't want to buy a GPU: there are also models which have been converted to run on CPU.
[0] https://blog.runpod.io/understanding-vram-and-how-much-your-...
cfiggers | 0 comments | 2 weeks ago
awwaiid | 1 comment | 2 weeks ago
amelius | 0 comments | 2 weeks ago
Not so great. Zero legal status. No blessing by Apple. Can be taken down whenever Apple execs decide to.
amelius | 1 comment | 2 weeks ago
larodi | 0 comments | 2 weeks ago
So all depends on how it fits in context and it will for the foreseeable time be like it as training is absolutely expensive compared to RAG.