FFmpeg School of Assembly Language
https://github.com/FFmpeg/asm-lessons/blob/main/lesson_01/index.mdBy davikr at
computerbuster | 4 comments | 5 months ago
As I'm seeing in the comments here, the usefulness of handwritten SIMD ranges from "totally unclear" to "mission critical". I'm seeing a lot on the "totally unclear" side, but not as much on the "mission critical", so I'll talk a bit about that.
FFmpeg is a pretty clear use case because of how often it is used, but I think it is easier to quantify the impact of handwriting SIMD with something like dav1d, the universal production AV1 video decoder.
dav1d is used pretty much everywhere, from major browsers to the Android operating system (superseding libgav1). A massive element of dav1d's success is its incredible speed, which is largely due to how much of the codebase is handwritten SIMD.
While I think it is a good thing that languages like Zig have built-in SIMD support, there are some use cases where it becomes necessary to do things by hand because even a potential performance delta is important to investigate. There are lines of code in dav1d that will be run trillions of times in a single day, and they need to be as fast as possible. The difference between handwritten & compiler-generated SIMD can be up to 50% in some cases, so it is important.
I happen to be somewhat involved in similar use cases, where things I write will run a lot of times. To make sure these skills stay alive, resources like the FFmpeg school of assembly language are pretty important, in my opinion.
cornstalks | 2 comments | 5 months ago
janwas | 4 comments | 5 months ago
rbultje | 1 comment | 5 months ago
And don't forget that any asm-optimized variant always has a C fallback for generic platforms lacking a hand-optimized variant which is also used to verify the asm-optimized variant using checkasm. This might not be linked into your binary/library (the linker eliminated it because it's never used), but the code exists nonetheless.
janwas | 0 comments | 5 months ago
hrydgard | 1 comment | 5 months ago
janwas | 0 comments | 5 months ago
ajb | 0 comments | 5 months ago
weebull | 0 comments | 5 months ago
MortyWaves | 4 comments | 5 months ago
epr | 0 comments | 5 months ago
antoinealb | 0 comments | 5 months ago
wolf550e | 0 comments | 5 months ago
secondcoming | 0 comments | 5 months ago
janwas | 1 comment | 5 months ago
Example: our new matmul outperforms a well-known library for LLM inference, sometimes even if it uses AMX vs our AVX512BF16. Why? They seem to have some threading bottleneck, or maybe it's something else; hard to tell with a JIT involved.
This would not have happened if I had to write per-platform kernels. There are only so many hours in the day. Writing a single implementation using Highway enabled exploring more of the design space, including a new kernel type and an autotuner able to pick not only block sizes, but also parallelization strategies and their parameters.
Perhaps in a second step, one can then hand-tune some parts, but I sure hope a broader exploration precedes micro-optimizing register allocation and calling conventions.
rbultje | 1 comment | 5 months ago
It should be obvious that both are pursued independently whenever it makes sense. The idea that one should precede the other or is more important than the other is simply untrue.
janwas | 0 comments | 5 months ago
Are you really suggesting writing a variant of a kernel, tuning it to the max, then discovering a new and different way to do it, and then discarding the first implementation? That seems like a lot of wasted effort.
dundarious | 2 comments | 5 months ago
GCC and Clang support the vector_size attribute and overloaded arithmetic operators on those "vectorized" types, and a LOT more besides -- in fact, that's how intrinsics like _mm256_mul_ps are implemented: `#define _mm256_mul_ps(a,b) (__m256)((v8sf)(a) * (v8sf)(b))`. The utility of all of that is much, much greater than what's available in Zig.
anonymoushn | 0 comments | 5 months ago
MortyWaves | 1 comment | 5 months ago
neonsunset | 3 comments | 5 months ago
You can usually port existing SIMD algorithms from C/C++/Rust to C# with few changes retaining the same performance, and it's practically impossible to do so in Java.
I feel like C veterans often don't realize how unnecessarily ceremonious platform-specific SIMD code is given the progress in portable abstractions. Unless you need an exotic instruction that does not translate across architectures and/or common patterns nicely, there is little reason to have a bespoke platform-specific path.
kierank | 1 comment | 5 months ago
neonsunset | 0 comments | 5 months ago
Most code paths out there aren't like that however and compilers are not too bad at instruction selection nowadays (you'd be right to mention that they sometimes have odd regressions, I've definitely seen that being a problem in LLVM, GCC and RyuJIT).
anonymoushn | 1 comment | 5 months ago
ack_complete | 0 comments | 5 months ago
MortyWaves | 0 comments | 5 months ago
zbobet2012 | 0 comments | 5 months ago
buserror | 6 comments | 5 months ago
And stare at the generated code!
More often than not, the auto-vectorisation now generates pretty excellent SIMD version of your function, and all you have to do is 'hint' the compiler -- for example explicitly list alignment, provide your own vector source/destination type -- you can do a lot by 'styling' your C code while thinking about what the compiler might be able to do with it -- for example, use extra intermediary variables, really break down all the operations you want etc.
Worst case if REALLY the compiler isn't clever enough, this give you a good base to adapt the generated assembly to tweak, without having to actually write the boilerplate bits.
In most case, the resulting C function will be vectorized as good, or better than the hand coded one I'd do -- and in many other cases, it's "close enough" not to matter that much. The other good news is that that code will probably vectorize fine for WASM and NEON etc without having to have explicit versions.
kimixa | 0 comments | 5 months ago
But due to needing to support other compilers and platforms we actually ended up importing the generated asm from those source files in the actual build.
ack_complete | 1 comment | 5 months ago
https://gcc.godbolt.org/z/rjEqzf1hh
This is an unsigned byte saturating add. It is directly supported as a single instruction in both x86-64 and ARM64 as PADDUSB and UQADD.16B. But all compilers make a mess of it from a straightforward description, either failing to vectorize it or generating vectorized code that is much larger and slower than necessary.
This is with a basic, simple vectorization primitive. It's difficult to impossible to get compilers to use some of the more complex ones, like a rounded narrowing saturated right shift (UQRSHRN).
buserror | 0 comments | 5 months ago
But again, if you don't like the generated code, you can take the generated code and tweak it, and use that; I did it quite a few times.
holowoodman | 1 comment | 5 months ago
Narishma | 0 comments | 5 months ago
And keep redoing that for every new compiler or version of a compiler, or if you change compile options. Any of those things can prevent the auto-vectorization.
Narishma | 1 comment | 5 months ago
eddd-ddde | 0 comments | 5 months ago
anonymoushn | 0 comments | 5 months ago
kierank | 6 comments | 5 months ago
Ask me anything.
ilyagr | 2 comments | 5 months ago
If it's so heavy in assembly, the fact that ffmpeg works on my Mac seems like a miracle. Is it ported by hand?
rbultje | 0 comments | 5 months ago
Not ported, but rather re-implemented. So: yes.
A bit more detail: during build, on x86, the FFmpeg binary would include hand-written AVX2 (and SSSE3, and AVX512, etc.) implementations of CPU-intensive functions, and on Arm, the FFmpeg binary would include hand-written Neon implementations (and a bunch of extensions; e.g. dotprod) instead.
At runtime (when you start the FFmpeg binary), FFmpeg "asks" the CPU what instruction sets it supports. Each component (decoder, encoder, etc.) - when used - will then set function pointers (for CPU-intensive tasks) which are initialized to a C version, and these are updated to the Neon or AVX2 version depending on what's included in the build and supported by this specific device.
So in practice, all CPU-intensive tasks for components in use will run hand-written Neon code for you, and hand-written AVX2 for me. For people on obscure devices, it will run the regular C fallback.
saagarjha | 0 comments | 5 months ago
cnt-dracula | 3 comments | 5 months ago
I have a question, as someone who can just about read assembly but still do not intuitively understand how to write or decompose ideas to utilise assembly, do you have any suggestions to learn / improve this?
As in, at what point would someone realise this thing can be sped up by using assembly? If one found a function that would be really performant in assembly how do you go about writing it? Would you take the output from a compiler that's been converted to assembly or would you start from scratch? Does it even matter?
qingcharles | 1 comment | 5 months ago
For instance, I used to work on graphics renderers. You'd find the bit that was called the most (writing lines of pixels to the screen) and try to jiggle the order of the instructions to decrease the number of cycles used to move X bits from system RAM to graphics RAM.
When I was doing it, branching (usually checking an exit condition on a loop) was the biggest performance killer. The CPU couldn't queue up instructions past the check because it didn't know whether it was going to go true or false until it got there.
booi | 2 comments | 5 months ago
kaslai | 0 comments | 5 months ago
akoboldfrying | 0 comments | 5 months ago
Quite often, though, you don't have predictable branches, and then you'll pay half the misprediction cost each time on average. If you're really unlucky, you could hit inputs where the branch predictor gets it wrong more than 50% of the time.
epr | 0 comments | 5 months ago
When someone indicates to me they want to learn programming for example, I ask them how many programs they've written. The answer is usually zero, and in fact I've never even heard greater than 10. No one will answer a larger number because that selects out people who would even ask the question. If you write 1000 programs that solve real problems, you'll be at least okay. 10k and you'll be pretty damn good. 100k and you might be better than the guy who wrote the assembly manual.
For a fun answer, this is a $20 nand2tetris-esque game that holds your hand through creating multiple cpu architectures from scratch with verification (similarly to prolog/vhdl), plus your own assembly language. I admittedly always end up writing an assembler outside of the game that copies to my clipboard, but I'm pretty fussy about ux and prefer my normal tools.
otteromkram | 0 comments | 5 months ago
I don't know assembly, but my advice would be to take the rote route by rewriting stuff in assembly.
Just like anything else, there's no quick path to the finish line (unless you're exceptionally gifted), so putting in time is always the best action to take.
HALtheWise | 1 comment | 5 months ago
janwas | 0 comments | 5 months ago
It can be more complicated to write things which are vector-length dependent, such as sorting networks or transposes, but we have always found a way so far.
On the contrary, there are increasing numbers of ISAs, including the two LoongArch LSX/LASX, AVX-512 which is really really good on Zen5, and three versions of Arm SVE. RISC-V V also has lots of variants and extensions. In such a world, I would not want to have to implement per-platform implementations.
201984 | 0 comments | 5 months ago
qingcharles | 7 comments | 5 months ago
Can we not just write tests and have some LLM try 10,000 different algorithms and profile the results?
Or is an LLM unlikely to find the optimal solution even with 10,000 random seeds?
Just asking. Optimizing x86 by hand isn't the easiest, because to think through it you start to have to try and fit all the registers in your mind and work through the combinations. Also you need to know how long each instruction combination will take; and some of these instructions have weird edge cases that take vastly longer or quicker to run that is hard for a human to take into account.
Ecco | 0 comments | 5 months ago
I don't have an answer but I believe that a lot of effort has been put in making (very smart) compilers already, so if it's even possible I doubt it's easy.
I also believe there are some cases where it's simply not possible for a compiler to beat handwritten assembly : indeed there is only so much info you can convey in a C program, and a developer who's aware of the whole program's behavior might be able to make extra assumptions (not written in the C code) and therefore beat a compiler. I'm sure people here would be able to come up with great practical examples of this.
magicalhippo | 1 comment | 5 months ago
Like, I should be able to give the compiler a hot loop and a week, and see what it can come up with.
One potential pitfall I can see is that there are a lot of non-local interactions in moderns systems. We have large out-of-order buffers, many caching layers, complex branch predictors, and an OS running other tasks at the same time, and a dozen other things.
What is optimal on paper might not be optimal in the real world.
dist-epoch | 0 comments | 5 months ago
There are optimization libraries which can find the optimum combination of parameters for an objective, like Optuna.
It would be enough to expose all the optimization knobs that LLVM has, and Optuna will find the optimum for a particular piece of code on a particular test payload.
danybittel | 1 comment | 5 months ago
janwas | 0 comments | 5 months ago
kierank | 0 comments | 5 months ago
saagarjha | 0 comments | 5 months ago
christiangenco | 0 comments | 5 months ago
Hi thank you for writing this!
Daniel_Van_Zant | 21 comments | 5 months ago
msaltz | 1 comment | 5 months ago
I actually quite like coding in assembly now (though I haven’t done much more than the tutorial, just made an array library that I could call from C). I think it’s so fun because at that level there’s very little magic left - you’re really saying exactly what should happen. What you see is mostly what you get. It also helped me understand linking a lot better and other things that I understood at a high level but still felt fuzzy on some details.
Am now interested to check out this ffmpeg tutorial bc it’s x86 and not ARM :)
Daniel_Van_Zant | 0 comments | 5 months ago
crq-yml | 0 comments | 5 months ago
It isn't a thing to be scared of - assembly is verbose, not complex. Everything you do in it needs load and store, load and store, millions of times. When you add some macros and build-time checks, or put it in the context of a Forth system(which wraps an interpreter around "run chunks of assembly", enabling interactive development and scripting) - it's not that far off from C, and it removes the magic of the compiler.
I'm an advocate for going retro with it as well; an 8-bit machine in an emulator keeps the working model small, in a well-documented zone, and adds constraints that make it valuable to think about doing more tasks in assembly, which so often is not the case once you are using a 32-bit or later architecture and you have a lot of resources to throw around. People who develop in assembly for work will have more specific preferences, but beginners mostly need an environment where the documentation and examples are good. Rosetta Code has some good assembly language examples that are worth using as a way to learn.
btown | 0 comments | 5 months ago
I remember a university course where we competed on who could have the most performant assembly program for a specific task; everyone tried various variants of loop unrolling to eke out the best performance and guide the processor away from bad branch predictions. I may or may not have hit Ballmer Peak the night before the due date and tried a setup that most others missed, and won the competition by a hair!
There’s also the incredible joy of seeing https://github.com/chrislgarry/Apollo-11 and quipping “this is a Unix system; I know this!” Knowing how to read the language of how we made it to the moon will never fade in wonder.
Short answer: yes!
brown | 2 comments | 5 months ago
renox | 0 comments | 5 months ago
For example, an HLL pointer is different from an assembly pointer(1). Sure the HLL pointer will be lowered to an assembly language pointer eventually but it still has a different semantic.
1: because you're relying on the compiler to use efficiently the registers, HLL pointers must be restricted otherwise programs would be awfully slow as soon as you'd use one pointer.
Daniel_Van_Zant | 0 comments | 5 months ago
daeken | 0 comments | 5 months ago
kevingadd | 0 comments | 5 months ago
Of course, most applications probably never need optimization to that degree, so it's still kind of a niche skill.
ghhrjfkt4k | 0 comments | 5 months ago
The library sqrt handles all kinds of edge-cases which prevent the compiler from autovectorizing it.
sigbottle | 0 comments | 5 months ago
Compilers have debug symbols, you can tune optimization levels, etc. so it's hopefully not too scary of a mess once you objdump it, but I've seen people both use their assembly knowledge at work and get rewarded handsomely for it.
bitwize | 0 comments | 5 months ago
jwr | 0 comments | 5 months ago
tengwar2 | 0 comments | 5 months ago
All of them apart from the screen memory thing were fun, but the only one which could be useful these days is the bit-twiddling. All the rest have been made obsolete by improved operating systems, so that the domain of useful assembler programs shrinks ever further. OTOH, debugging them is vastly easier than in the old days where all you got was random lines drawn across your screen as the system crashed, and you couldn't even single-step because you had to bank-switch out.
AnyTimeTraveler | 0 comments | 5 months ago
In my masters degree, there was another course, where one built their own computer PCB in Eagle, got it fabbed and then had to make a game for the 8052 CPU on there. 8052 assembly is very fun! The processor has a few bytes of ram where every bit is individually addressable and testable. I built the game Tetris on three attached persistence of vision LED-Matrices[1]. Unfortunately, the repository isn't very clean, but I used expressive variable names, so it should be readable. I did create my own calling convention for performance reasons and calculated how many cpu cycles were available for game logic between screen refreshes. Those were all very fun things to think about :)
Reading assembly now has me look up instruction names here and there, but mostly I can understand what's going on.
[0] https://github.com/AnyTimeTraveler/HardwareNaheProgrammierun... [1] https://github.com/AnyTimeTraveler/HardwarenaheSystementwick...
pjmlp | 0 comments | 5 months ago
Nowadays most of that can be done with intrinsics, which were already present in some 1960's system programming languages, predating UNIX for a decade.
Modern Assembly is too complex, it is probably easier to target retrogaming, or virtual consoles, if the purpose is having fun.
anta40 | 0 comments | 5 months ago
When lockdown started in 2020, I thought working from home would give me more spare time, thus enrolled those classes on Udemy.
I'm a mobile app dev (Java/Kotlin), and assembly is practically irrelevant for daily use cases.
saagarjha | 0 comments | 5 months ago
nevi-me | 0 comments | 5 months ago
A few years ago I embarked on learning ARM assembly, I also got far, but I found it more laborious somehow. x64 is just too much for me to want to learn.
colanderman | 0 comments | 5 months ago
mobiledev2014 | 0 comments | 5 months ago
The most popular are the Zachtronics games and Tomorrow Corp games. They’re so so good!
gostsamo | 0 comments | 5 months ago
YZF | 0 comments | 5 months ago
dinkumthinkum | 0 comments | 5 months ago
jupp0r | 1 comment | 5 months ago
frontfor | 2 comments | 5 months ago
> To make multimedia processing fast. It’s very common to get a 10x or more speed improvement from writing assembly code, which is especially important when wanting to play videos in real time without stuttering.
TinkersW | 0 comments | 5 months ago
Personally I'd say the only good reason to use assembly over intrinsics is having control over calling convention, for example the windows CC is absolute trash and wastes many SIMD registers.
edward28 | 0 comments | 5 months ago
slicktux | 0 comments | 5 months ago
lukaslalinsky | 0 comments | 5 months ago
foresto | 3 comments | 5 months ago
I find that sentence confusing.
I assume that i.e is supposed to be i.e., but What is *(* supposed to mean? Shouldn't that be just an open parenthesis?
In what context would *sizeof(*src) be considered valid? As far as I know, sizeof never yields a pointer.
I get the impression that someone sprinkled random asterisks in that sentence, or maybe tried to mix asterisks-denoting-italics with C syntax.
kevingadd | 0 comments | 5 months ago
SavioMak | 0 comments | 5 months ago
sweeter | 1 comment | 5 months ago
foresto | 1 comment | 5 months ago
Nitpicking aside, the result of sizeof(*src) would be the size of the object at which the pointer points. The type of that result is size_t. That's what makes this code from the lesson I quoted invalid:
*sizeof(*src)
That first asterisk tries to dereference the result of sizeof as though it were a pointer, but it's a size_t: an unsigned integer type. Not a pointer.
sweeter | 1 comment | 5 months ago
foresto | 1 comment | 5 months ago
sweeter | 0 comments | 5 months ago
wruza | 0 comments | 5 months ago
imglorp | 9 comments | 5 months ago
jsheard | 2 comments | 5 months ago
You might see a 10x difference if you compare meticulously optimized assembly to naive C in cases where vectorization is possible but the compiler fails to capitalize on that, which is often, because auto-vectorization still mostly sucks beyond trivial cases. It's not really a surprise that expert code runs circles around naive code though.
CyberDildonics | 1 comment | 5 months ago
I can get far more than 10x over naive C just by reordering memory accesses. With SIMD it can be 7x more, but that can be done with ISPC, it doesn't need to be done with asm.
magicalhippo | 0 comments | 5 months ago
However you can write better than naive C by compiling and watching the compiler output.
I stopped writing assembly back around y2k as I was fairly consistently getting beaten by the compiler when I wrote compiler-friendly high-level code. Memory organization is also something you can control fairly well on the high-level code side too.
Sure some niches remained, but for my projects the gains were very modest compared to invested time.
UltraSane | 3 comments | 5 months ago
Why?
Narishma | 0 comments | 5 months ago
oguz-ismail | 3 comments | 5 months ago
jsheard | 1 comment | 5 months ago
oguz-ismail | 1 comment | 5 months ago
Nah. I find well commented three column AT&T assembly with light use of C preprocessor macros easier and more enjoyable to read.
Inityx | 1 comment | 5 months ago
saagarjha | 0 comments | 5 months ago
t-3 | 0 comments | 5 months ago
xgkickt | 0 comments | 5 months ago
schainks | 2 comments | 5 months ago
TL;DR They want to squeeze every drop of performance out of the CPU when processing media, and maintaining a mixture of intrinsics code and assembly is not worth the trade off when doing 100% assembly offers better performance guarantees, readability, and ease of maintenance / onboarding of developers.
astrange | 1 comment | 5 months ago
There is just a little bit of intrinsics code in ffmpeg, which I wrote, that does memory copies.
https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/x86/i...
It's like this because we didn't want to hide the memory accesses from the compiler, because that hurts optimization, as well as memory tools like ASan.
janwas | 0 comments | 5 months ago
Good point about asan and other instrumentation :) hm, I'd think that is very important for codecs in particular?
brigade | 0 comments | 5 months ago
Nowadays the impact of suboptimal register allocation and addressing calculations of compilers is almost unmeasurable between having 16/32 registers available and CPUs that are 8-10 wide in the frontend but only 3-4 vector units in the backend. But the added complexity of newer codecs has strained their use of the nasm/gas macro systems to be far less readable or maintainable than intrinsics. Like, think of how unmaintainable complex C macros are and double that.
And it's not uncommon to find asm in ffmpeg or related projects written suboptimally in a way a compiler wouldn't, either because the author didn't fully read/understand CPU performance manuals or because rewriting/twisting the existing macros to fix a small suboptimality is more work than it's worth.
(yes, I have written some asm for ffmpeg in the past)
1propionyl | 0 comments | 5 months ago
Put more simply: a C compiler can't infer from a plain C implementation that you're trying to do certain mathematics that could alternately be expressed more efficiently with SIMD intrinsics. It doesn't have access to your knowledge about the mathematics you're trying to do.
There are also target specific considerations. A compiler is, necessarily, a general purpose compiler. Problems like resource (e.g. register) allocation are NP-complete (equivalent to knapsack) and very few people want their compiler to spend hours upon hours searching for the absolute most optimal (if indeed you can even know that statically...) asmgen.
lukaslalinsky | 2 comments | 5 months ago
PaulDavisThe1st | 1 comment | 5 months ago
Not really a fair comment. They are entirely normal code in most senses. They differ in one important way: they are (frequently) perfect examples of where "single instruction, multiple data" completely makes sense. "Do this to every sample" is the order of the day, and that is a bit odd when compared with text processing or numerical computation.
But this is true of the majority of signal processing, not just codecs. As simple a thing as increasing the volume of an audio data stream means multiplying every sample by the same value - more or less the definition of SIMD.
astrange | 1 comment | 5 months ago
screcth | 1 comment | 5 months ago
Think encoding or decoding non-overlapping parts of a video.
astrange | 0 comments | 5 months ago
That means you have to stick to inside the current block. But there are some tricks; like for an IDCT there's a previous stage where you can rearrange the output memory elements for free, so you can shuffle things as you need to fit them into vectors.
bad_username | 0 comments | 5 months ago
Codecs are pretty normal code. You can get decent performance by just writing quality idiomatic C or C++, even without asm. (I implemented a commercial x.264 codec and worked on a bunch of audio codecs.)
variadix | 1 comment | 5 months ago
astrange | 1 comment | 5 months ago
janwas | 0 comments | 5 months ago
jki275 | 2 comments | 5 months ago
CyberDildonics | 0 comments | 5 months ago
1propionyl | 2 comments | 5 months ago
Compilers today are liable in most circumstances to know many more tricks than you do. Especially if you make use of hints (e.g. "this memory is almost always accessed sequentially", "this branch is almost never taken", etc) to guide it.
jki275 | 0 comments | 5 months ago
But I suspect there are cases where the super experts exist who can do things better.
astrange | 0 comments | 5 months ago
warble | 1 comment | 5 months ago
I work on bare metal embedded systems though, so maybe there's some nuance when working with bigger OS libs?
umanwizard | 1 comment | 5 months ago
warble | 0 comments | 5 months ago
bob1029 | 1 comment | 5 months ago
You can go nuclear option with your static compilations and turn on all the optimizations everywhere, but this kills inner loop iteration speed. I believe there are aspects of some dynamic compiling runtimes that can make them superior to static compilations - even if we don't care how long the build takes.
astrange | 0 comments | 5 months ago
In ffmpeg's case you can just always be the correct thing.
epolanski | 3 comments | 5 months ago
Essentially people think they are writing low level code, in reality that's not how CPUs interpret that code, so he explained how writing manual assembly kills performance pretty much always (at least on modern x86).
iforgotpassword | 0 comments | 5 months ago
If you know it really well, have already optimized everything on an algorithmic level and have code that can benefit from simd, 10x is real.
FarmerPotato | 0 comments | 5 months ago
I've used Intel's icc compiler and profiler tools in an iterative fashion. A compiler like Intel's might be made to profile cache misses, pipeline utilization, branches, stalls, and supposedly improve in the next compilation.
The assembly programmer has to consider those factors. Sure would be nice to have a computer check those things!
In the old days, we only worried about cycle counts, wait states, and number of instructions.
saagarjha | 0 comments | 5 months ago
ajross | 0 comments | 5 months ago
But when the human and compiler are not faced with the same problem...
Say, if your compiler doesn't support autovectorization and/or your C code isn't friendly to the idiom, then sure: a 10x difference in performance between a hand-optimized SIMD implementation and a naive scalar one fed to a C compiler is probably about right.
xuhu | 2 comments | 5 months ago
But then again no-one is adjusting timestamps manually in batch scripts, so a high-level script on top of filter_complex doesn't have much purpose.
chgs | 0 comments | 5 months ago
pdyc | 0 comments | 5 months ago
agumonkey | 0 comments | 5 months ago
fracus | 0 comments | 5 months ago
thayne | 9 comments | 5 months ago
One thing that confuses me is the opposition to inline asm. It seems like inline asm would be more efficient than having to make a function call to an asm function.
PaulDavisThe1st | 2 comments | 5 months ago
Ardour's own code doesn't do very much DSP (it's a policy choice), but one thing that our own code does do is metering: comparing a current sample value to every previous sample value in a given audio data stream within a given time window to decide if it is higher (or lower) than the previous max (or min).
When someone stepped forward (hi Sampo!) to code this in hand-written SIMD assembler, we got a 30% reduction in CPU usage when using mid-sized buffers on moderate size sessions (say, 24 tracks or so).
That's a worthy tradeoff, even though it means that we now have 5 different asm versions of about half-a-dozen functions. The good news is that they don't really need to be maintained. New SIMD architectures mean new implementations, not hacks to existing code.
However, I should note that it is always very important to compare what compilers are capable of, and to keep comparing that. In the decade or more after our asm metering code was first written, gcc improved to the point where simply using C(++) and some compiler flags produced code that was within an instruction or two of our hand-crafted version (and may be more correct in the face of all possible conditions).
So ... you can get dramatic performance benefits that are worth the effort, the maintainance costs are low, you should keep checking how your code compares with today's compiler's best optimization effort.
thayne | 1 comment | 5 months ago
The performance wins may very well be worth it, but it is still good to be aware of the tradeoff involved.
saagarjha | 0 comments | 5 months ago
sweeter | 0 comments | 5 months ago
arkj | 0 comments | 5 months ago
adgjlsfhk1 | 4 comments | 5 months ago
sorenjan | 0 comments | 5 months ago
https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...
Someone | 1 comment | 5 months ago
So, chances are you’ll need to have more than one AVX2 assembly version of your code if you want to have the fastest code.
anonymoushn | 0 comments | 5 months ago
renhanxue | 0 comments | 5 months ago
withinboredom | 0 comments | 5 months ago
wffurr | 2 comments | 5 months ago
kccqzy | 0 comments | 5 months ago
femto | 0 comments | 5 months ago
astrange | 0 comments | 5 months ago
There is no such thing as a generic "SIMD API" it could use because it uses all specific hardware tools it can to be performant. Anyone who thinks this is posssible is simply mistaken. You can tell because none of them have written ffmpeg.
(There are some things called "array languages" or "stream processing" or "autoscalarization" that work better than SIMD - an example is ispc. But they're not a great fit here, because ffmpeg isn't massively parallel. It's just parallel enough to work.)
hereonout2 | 0 comments | 5 months ago
They lay it out quite clearly I think, but things like libavcodec are probably one of the few types of project where the benefits of assembly outweigh the lack of portability.
I'm not sure rust or zig's support for SIMD would be the project's first complaint either. Likely more concerned with porting a 25 year old codebase to a new language first.
aidenn0 | 1 comment | 5 months ago
jsheard | 0 comments | 5 months ago
brigade | 1 comment | 5 months ago
wyldfire | 0 comments | 5 months ago
As far as C, C++ go - that's two out of three. So it's not as bad as it sounds to be "at most two".
anonymoushn | 0 comments | 5 months ago
The thing that is present in Zig and not yet stable in Rust does not include any dynamic shuffles, so these end up requiring intrinsics or asm for all sorts of things. It's a significant weakness compared to e.g. highway, eve, or simde.
eachro | 0 comments | 5 months ago
fulafel | 0 comments | 5 months ago
neallindsay | 1 comment | 5 months ago
jesserobbins | 0 comments | 5 months ago
krick | 0 comments | 5 months ago
Charon77 | 0 comments | 5 months ago
jancsika | 2 comments | 5 months ago
umanwizard | 2 comments | 5 months ago
aidenn0 | 0 comments | 5 months ago
astrange | 1 comment | 5 months ago
TinkersW | 0 comments | 5 months ago
For SIMD integer to scalar integer, it has to move into separate register, so there is some short penalty(3 cycles iir).
kccqzy | 0 comments | 5 months ago
henning | 0 comments | 5 months ago
sylware | 1 comment | 5 months ago
They are abusing nasm macro-preprocessor up to obscene levels...
ryanianian | 1 comment | 5 months ago
sylware | 2 comments | 5 months ago
And the macro language is specific to nasm.
What to do: unroll the macros and/or use a little abstraction using a simple common macro preprocessor, aka not tied to the assembler.
And I am just doing exactly that: my x86_64 assembly code does assemble with fasm/nasm/gas with a little abstraction using a C preprocessor.
PhilipRoman | 0 comments | 5 months ago
pengaru | 1 comment | 5 months ago
sylware | 0 comments | 5 months ago
belter | 0 comments | 5 months ago
ej1 | 0 comments | 5 months ago
mkoubaa | 4 comments | 5 months ago
qingcharles | 0 comments | 5 months ago
graypegg | 0 comments | 5 months ago
Narishma | 0 comments | 5 months ago
ghhrjfkt4k | 0 comments | 5 months ago
toisanji | 2 comments | 5 months ago
mikestew | 0 comments | 5 months ago
saagarjha | 0 comments | 5 months ago
beebaween | 2 comments | 5 months ago
Especially curious given the advent of apple metal etc.
Does anyone have recommendations?
mvdtnz | 1 comment | 5 months ago
filleduchaos | 1 comment | 5 months ago
The pipeline/plugin based architecture is pretty neat even as an end user, I find it a lot more discoverable.
lukaslalinsky | 1 comment | 5 months ago
filleduchaos | 0 comments | 5 months ago
Because GStreamer is designed to be modular and plugin-based, all sorts of plugins exist for it (including libav plugins) but it is itself a media framework that is quite capable of replacing ffmpeg. Neither is "higher level" than the other.
adamnemecek | 0 comments | 5 months ago
imchaz | 1 comment | 5 months ago
thegrim33 | 2 comments | 5 months ago
The "documentation" is a collect of 15-20 year old source samples. The vast majority of them either won't compile anymore because the API has changed, or they use 2, 3, or 4 times deprecated functions that shouldn't be used anymore. The source examples also have almost no comments explaining anything. They have super dense, super complicated code, with no comments, but then there will be a line like "setRenderSpeed(3)" or whatever and it'll have a comment: "Sets render speed to 3", the absolute most useless comment ever. The source examples are also all written in 30 year old as C-Style of C code as you can get, incredibly horribly dense, with almost no organization, have to jump up and down all over the file to find the global variables being accessed, it's just gross and barely comprehensible.
They put a lot of effort into producing doxygen documentation for everything, but the doxygen documentation is nearly useless, it just lists the API with effectively zero documentation or explanation on the functions or parameters. There's so little explanation of how to do anything. On the website they have sections for each library, and for most libraries you get 2-3 sentences of explanation on what the library is for, and that's it. That's the extent the entire library is documented. They just drop an undocumented massive C API split across a dozen or so libraries on you and wish you luck.
The API has also gotten absolutely wrecked over the last 20 years or however long it's been around as it has evolved. Sometimes they straight up delete functions to deprecate them, sometimes they create a new version of a function as fuction2 and then as function3, and keep all of them around, sometimes they replace a function with a completely differently named function and keep them both around, and there's absolutely nothing written anywhere about what the "right" way to do anything is, what functions you should actually be using. So many times I went down rabbit holes reading some obscure 15 year old mailing list post trying to find anyone that had successfully done something I was trying to do. And again, the obscure message board posts and documentation that does exist is almost all deprecated at this point and shouldn't be used.
Then there's the custom build system, so if you need to build it custom to support or not support different features, you can't use any modern build system, it's all custom scripts that do weird things like hardcoded dumping build output into your home directory. Makes it difficult to integrate with a modern build system.
It has so much momentum, and so many users, but man, there has to be a massive opening for someone to replace ffmpeg with a modern programming language and a modern build system, built with GPU acceleration of stuff in mind from the beginning and not tacked on top 20 years later, and not using 30 year old c-style code, and an actually documented project.
Ono-Sendai | 0 comments | 5 months ago
imchaz | 1 comment | 5 months ago
imchaz | 0 comments | 5 months ago
netr0ute | 7 comments | 5 months ago
KeplerBoy | 0 comments | 5 months ago
wolf550e | 1 comment | 5 months ago
zozbot234 | 0 comments | 5 months ago
high_na_euv | 1 comment | 5 months ago
ksec | 0 comments | 5 months ago
You do not criticise The Rusted Holy Grail and the Riscy Silver Bullet.
do_not_redeem | 2 comments | 5 months ago
hagbard_c | 0 comments | 5 months ago
snvzz | 0 comments | 5 months ago
Only while fasting can a person think clearly. When thinking clearly, RISC-V is inevitably chosen as the ISA.
Fasting will also eventually make you hungry. Thus "RISC-V is coming to eat its lunch, FAST."