Bit-twiddling optimizations in Zed's Rope
https://zed.dev/blog/zed-decoded-rope-optimizations-part-1By misternugget at
rgrmrts | 1 comment | 10 hours ago
Unrelated to this specific post I’m such a fan of Zed. It’s the first feature complete text editor in recent memory that I’ve truly enjoyed using (i.e. it stays out of the way, is really fast, feels well engineered). I’m coming to Zed after years of Emacs which I still have love for but no longer feels like a competitive piece of software (it does not take full advantage of how good computers are today, e.g. gpu rendering or multicore). I really hope Zed stays a fast and lightweight text editor instead of becoming some bloated growth-at-all-cost VC ware (not that they’ve exhibited any signs of that happening). I’d also happily pay for Zed without a subscription based thing for access to LLM features (which I do not use).
seanw444 | 4 comments | 10 hours ago
Why does Emacs need that though? I hear people say this all the time and I don't get it. Multicore kind of works against the structure that Emacs touts as a feature. And GPU rendering? In many applications, I totally agree with these complaints. But it's a text editor.
I tried Zed myself, and it's good. But it doesn't dethrone Emacs (for me personally).
vinkelhake | 1 comment | 8 hours ago
Long time emacs user here (+20 years, yikes). I've used it on all kinds of computers during this time. Even a relatively modest computer from 2024 is an absolute beast compared to something from the year 2000.
With that said, there are text editing operations that can cause it to grind to a complete halt, especially working with large files (or very long lines). And it shouldn't, you know? I think emacs users sort of internalize which operations they should avoid. It's kind of ridiculous to have to do that in a text editor with the massive amounts of compute that our computers have today.
donio | 1 comment | 8 hours ago
Long line handling has greatly improved in emacs-29. Multi-megabyte lines are not a problem anymore.
canucker2016 | 0 comments | 7 hours ago
** Emacs is now capable of editing files with very long lines.
The display of long lines has been optimized, and Emacs should no
longer choke when a buffer on display contains long lines. The
variable 'long-line-threshold' controls whether and when these display
optimizations are in effect.
A companion variable 'large-hscroll-threshold' controls when another
set of display optimizations are in effect, which are aimed
specifically at speeding up display of long lines that are truncated
on display.
PittleyDunkin | 1 comment | 9 hours ago
I have consistent issues with emacs locking up when executing network requests. I'm sure there's a specific bug that could be hunted down and addressed, but this sort of thing shouldn't happen much in an editor that's multicore by default.
I'm not trying to dismiss emacs' reasoning, of course, but I can understand being disgruntled with it.
The actual rendering I've been quite please by, though!
rgrmrts | 2 comments | 9 hours ago
I use Emacs GUI (outside of the terminal) and comparing performance for rending to something like Zed or Sublime is definitely noticeable. It’s great that Emacs is so resource efficient but sometimes I wish it used more of my beefy computer(s).
Like I said I still love Emacs and it’s okay for it to make a different set of trade-offs. I honestly didn’t think I’d ever switch editors but here we are!
karthink | 0 comments | 8 hours ago
Installing packages does not need to block either, there is no architectural limitation here. The Elpaca package manager for Emacs provides async, parallel package updates. Loading packages into the Lisp image will block though, there's no way around that.
The other big source of input lag is garbage collection, and there are some ongoing efforts to use the MPS library in Emacs for a copying, concurrent GC. This is a big change and I don't know if this experiment will go anywhere, but Eli Zaretskii and co are trying.
seanw444 | 1 comment | 8 hours ago
As for LSP, other than the Nim langserver, I've been quite satisfied with Eglot's performance. I'm curious what your setup is like. To be fair, I'm running a highly customized Doom Emacs config, so it's possible I inherited some non-vanilla optimizations I'm not aware of.
PittleyDunkin | 0 comments | 5 hours ago
Parallelism trivially enables concurrency. The lack of parallelism magnifies the issues emacs has with concurrency.
DrBenCarson | 0 comments | 8 hours ago
Most new TTY projects use GPUs for that reason
kevin_thibedeau | 0 comments | 2 hours ago
benreesman | 0 comments | 2 hours ago
The are serious about local vs. remote vs. shared. They are serious about hardware acceleration because they care about users who type fast and edit big files. They care about real computer science on how to push the current hardware to serve the user, rather than treating the hardware as a crutch to give the user a slightly worse experience at a fraction of the development cost. They care about code highlighting the snippets on their blog like very similar to the default Zed theme.
These are cool, serious people. If they give me emacs key bindings with full paredit-everywhere, I might switch my daily driver.
And this is about using modern SIMD-style stuff, branch less stuff, Lemire stuff in a modern and relevant context. Other commenters have pointed out that you can do the mask or popcount better with intrinsically, and yeah, but they probably know that, and B. They got the massive win, the remaining factor of K on the pop count is coming.
Long Zed.
dzaima | 0 comments | 5 hours ago
eviks | 1 comment | 3 hours ago
You don't need information about the position of newlines in all the chunks located before the one your offset lands on
teo_zero | 1 comment | 36 minutes ago
eviks | 0 comments | 13 minutes ago
sapiogram | 1 comment | 7 hours ago
mattbaker | 0 comments | 3 hours ago
camel-cdr | 3 comments | 10 hours ago
kwillets | 0 comments | 6 hours ago
Bitstring rank/select is a well-known problem, and the BMI and non-BMI (Hacker's Delight) versions are available as a reference.
stouset | 0 comments | 4 hours ago
SkiFire13 | 1 comment | 9 hours ago
zamadatix | 0 comments | 5 hours ago
ramon156 | 0 comments | 7 hours ago
aa -> -> bb -> -> bb
It only takes up two spaces, after all
Am4TIfIsER0ppos | 0 comments | 7 hours ago
> filter: blur(1px);
Wonderful styling!
dmitrygr | 2 comments | 10 hours ago
> // Parallel bit count intermediates
> let a = v - ((v >> 1) & (u64::MAX / 3));
> let b = (a & (u64::MAX / 5)) + ((a >> 2) & (u64::MAX / 5));
> let c = (b + (b >> 4)) & (u64::MAX / 0x11);
> let d = (c + (c >> 8)) & (u64::MAX / 0x101);
That "parallel bit count" is almost certainly slower than using two POPCNT instructions on a modern cpu. Should just call __builtin_popcount() and let the compiler do it the most optimal way. Luckily, people do this sort of thing so often that many modern compilers will try (and often succeed) to detect you trying this insanity and convert it to a POPCOUNT (or a pair of POPCOUNTs as the case may be here)ot | 0 comments | 8 hours ago
On processors with BMI2 the whole algorithm reduces to a PDEP as mentioned in another comment, but if you don't have that this is pretty much the best you can do (unless you use lookup tables but those have pros and cons).
akoboldfrying | 4 comments | 10 hours ago
The above code is completely source- and binary-portable and reasonably fast -- certainly faster than naively looping through the bits, and within a small constant factor of a CPU POPCOUNT instruction.
aseipp | 0 comments | 9 hours ago
And the compiler is not required to lower it to a single instruction. It will if the target architecture is specified appropriately, but there's nothing that says it has to explode if it can't. In fact, by doing it this way, the compiler is actually more free to generate code in a way that's optimal for the architecture in all cases, because all the implementation details are hidden e.g. loads of large constants may be avoided if the compiler is allowed to choose the exact implementation, while using the portable version may tie its hands more depending on how it's feeling on that day. Here's __builtin_popcount working just fine while targeting a ~20yo architecture without native support for SSE4.2; it can generate this code knowing what the proper instructions and schedules are: https://godbolt.org/z/ra7n5T5f3
The moral here is that the primitives are there for you to use. Just use them and save yourself and your would-be code reviewer's time.
jandrewrogers | 0 comments | 9 hours ago
Also, pretty much every compiler for a very long time has supported __builtin_popcount or equivalent.
woadwarrior01 | 0 comments | 10 hours ago
Clang supports __builtin_popcount() too. And MSVC has __popcnt().
dmitrygr | 2 comments | 9 hours ago
SkiFire13 | 1 comment | 9 hours ago
vlovich123 | 0 comments | 5 hours ago
Findecanor | 1 comment | 7 hours ago
64-bit ARM's actually has a very peculiar encoding of immediates to arithmetic instructions. It supports only recurring bit patterns such as used by this algorithm. For example "add x2, x3, #3333333333333333" is encoded as one four-byte instruction.
stassats | 0 comments | 7 hours ago
It does have this: https://developer.arm.com/documentation/ddi0596/2021-09/SIMD...
And GCC happily uses it https://godbolt.org/z/dTW46f9Kf