Vogtinator | 2 comments | 3 days ago
The latency shows after how many cycles the result of an instruction can be consumed by another, while the throughput shows how many such instructions can be pipelined per cycle, i.e. in parallel.
wtallis | 1 comment | 3 days ago
BeeOnRope | 1 comment | 3 days ago
For example, for many years Intel chips had a multiplier unit on a single port, with a latency of 3 cycles, but an inverse throughput of 1 cycle, so effectively pipelined across 3 stages.
In any case, I think uops.info [1] has replaced Agner for up-to-date and detailed information on instruction execution.
---
Earw0rm | 1 comment | 3 days ago
BeeOnRope | 0 comments | 3 days ago
ajross | 3 comments | 3 days ago
The Fog tables try hard show the former, not the latter. You measure dispatch parallelism with benchmarks, not microscopes.
Also IIRC there are still some non-pipelined units in Intel chips, like the division engine, which show latency numbers ~= to their execution time.
BeeOnRope | 1 comment | 3 days ago
Pipelining in stages like fetch and decode are mostly hidden in these small benchmarks, but are visible when there are branch misprediction, other types of flushes, I$ misses and so on.
ajross | 2 comments | 3 days ago
I'm curious what you think the distinction is? Those statements are equivalent. The circuit implementing "an instruction" can't work in a single cycle, so you break it up and overlap sequentially issued instructions. Exactly what they do will be different for different hardware, sure, clearly we've moved beyond the classic four stage Patterson pipeline. But that doesn't make it a different kind of pipelining!
BeeOnRope | 1 comment | 3 days ago
I.e., a lot of what you need to model for tight loops only depends on the execution latencies (as little as 1 cycle), and not on the full pipeline end-to-end latency (almost always more than 10 cycles on big OoO, maybe more than 20).
eigenform | 2 comments | 3 days ago
Those are different notions of pipelining with different motivations: one is motivated by "instruction-level parallelism," and the other is motivated by "achieving higher clock rates." If 64-bit multiplication were not pipelined, the minimum achievable clock period would be constrained by "how long it takes for bits to propagate through your multiplier."
tjoff | 0 comments | 3 days ago
Which are exactly the same thing? For exactly the same reasons?
Sure, you can focus your investigation on one or the other but that doesn't change what they are or somehow change the motivations for why it is being done.
And you can have a shorter clock period than your non-pipelined multiplier just fine. Just that other uses of that multiplier would stall in the meantime.
formerly_proven | 1 comment | 3 days ago
gpderetta | 0 comments | 3 days ago
gpderetta | 2 comments | 3 days ago
Being able to execute multiple instructions is more properly superscalar execution, right? In-order designs are also capable of doing it and the separate execution unit do not even need to run in lockstep (consider the original P5 U and V pipes).
wtallis | 1 comment | 3 days ago
(What's the best-performing in-order general purpose CPU core? POWER6 was notably in-order and ran at quite high clock speeds for the time. Intel's first-gen Atom cores were in-order and around the same time as POWER6 but at half the clock speed. SPARC T3 was ran at an even lower clock speed.)
gpderetta | 1 comment | 3 days ago
sillywalk | 0 comments | 3 days ago
ajross | 2 comments | 3 days ago
But as to OO: the whole idea of issuing sequential instructions in parallel means that the hardware needs to track dependencies between them so they can't race ahead of their inputs. And if you're going to do that anyway, allowing them to retire out of order is a big performance/transistor-count win as it allows the pipeline lengths to be different.
gpderetta | 1 comment | 3 days ago
But you can have, for example, a classic in-order RISC design that allows for parallel execution. OoO renaming is not necessary for dependency tracking (in fact even scalar in order CPUs need dependency tracking to solve RAW and other hazards), it is "only" needed for executing around stalled instructions (while an in order design will stall the whole pipeline).
Again P5 (i.e the original Pentium) was a very traditional in order design, yet could execute up to two instructions per cycle.
ajross | 2 comments | 3 days ago
No it isn't. I'm being very deliberate here with refusing pedantry. In practice, "multiple dispatch" means "OO" in the same way that "VLIW" means "parallel in order dispatch". Yes, you can imagine hypothetical CPUs that mix the distinction, but they'd be so weird that they'd never be built. Discussing the jargon without context only confuses things.
> you can have, for example, a classic in-order RISC design that allows for parallel execution.
Only by inventing VLIW, though, otherwise there's no way to tell the CPU how to order what it does. Which is my point; the ideas are joined at the hip. Note that the Pentium had two defined pipes with specific rules about how the pairing was encoded in the instruction stream. It was, in practice, a VLIW architecture (just one with a variable length encoding and where most of the available instruction bundles only filled one slot)! Pedantry hurts in this world, it doesn't help.
wtallis | 0 comments | 3 days ago
This is ridiculous. There are no nop-filled slots in the instruction stream, and you can't even be sure which instructions will issue together unless you trace backwards far enough to find a sequence of instructions that can only be executed on port 0 and thus provide a known synchronization point. The P5 only has one small thing in common with VLIW, and there's already a well-accepted name for that feature, and it isn't VLIW.
gpderetta | 0 comments | 3 days ago
atq2119 | 1 comment | 3 days ago
You could have a VLIW ISA that is implemented by a processor that "unrolls" each instruction word and mostly executes the constituent instructions serially.
tliltocatl | 0 comments | 3 days ago
eigenform | 2 comments | 3 days ago
I don't think that's accurate. That latency exists because the execution unit is pipelined. If it were not pipelined, there would be no latency. The latency corresponds to the fact that "doing division" is distributed across multiple clock cycles.
BeeOnRope | 0 comments | 18 hours ago
eigenform | 0 comments | 3 days ago
If it were pipelined, you'd expect to be able to schedule DIV every cycle, but I don't think that's the case. Plus, 99% of the time the pipeline would just be doing nothing because normal programs aren't doing 18 DIV instructions in a row :^)
barbegal | 3 comments | 3 days ago
In most real code the high throughput of these sorts of operations means that something else is the limiting factor. And if multiplier throughput is limiting performance then you should be using SIMD or a GPU.
sroussey | 0 comments | 3 days ago
True. You can imagine how difficult it is for the hardware engineer designing and testing these things before production!
Joker_vD | 1 comment | 3 days ago
And it's quite sad because when you are faced with choosing between two ways to express something in the code, you can't predict how fast one or another option will run. You need to actually run both, preferrably in an environment close to the prod, and under similar load, to get accurate idea which one is more performant.
And the worst thing is, you most likely can't extract any useful general principle out of it, because any small perturbation in the problem will result in a code that is very similar yet has completely different latency/throughput characteristics.
The only saving grace is that modern computers are really incredibly fast, so layers upon layers of suboptimal code result in applications that mostly perform okay, with maybe some places where they perform egregiously slow.
mpweiher | 1 comment | 3 days ago
"The best way to predict the future is to invent it" -- Alan Kay
> You need to actually run both
Always! If you're not measuring, you're not doing performance optimization. And if you think CPUs are bad: try benchmarking I/O.
Operating System (n) -- Mechanism designed specifically to prevent any meaningful performance measurement (every performance engineer ever)
If it's measurable and repeatable, it's not meaningful. If it's meaningful, it's not measurable or repeatable.
Pretty much.
Or put another way: an actual stop watch is a very meaningful performance measurement tool.
Joker_vD | 0 comments | 3 days ago
Imagine having to design electronics the way we design performant programs. Will this opamp survive the load? Who knows, let's build and try these five alternatives of the circuit and see which one of them will not blow. Oh, this one survived but it distorts the input signal horribly ("yeah, this one is fast, but it has multithreading correctness issues and reintroduction of locks makes it again about as slow"), what a shame. Back to the drawing board.
alain94040 | 2 comments | 3 days ago
devit | 1 comment | 3 days ago
A pro would generally only run benchmarks if it's the only way to find out (or if it's easy), but isn't going to trust it unless there's a good explanation for the effects, or unless they actually just want to compare two very specific configurations rather than coming up with a general finding.
LegionMammal978 | 0 comments | 3 days ago
pizlonator | 0 comments | 3 days ago
dhash | 1 comment | 3 days ago
For me, this really makes working with a modern microprocessor a science, as anyone who has written benchmarks knows -- it's difficult to reason about the complex behaviour and performance cliffs without testing.
Another excellent example of the weirdness has to be JVM anatomy quarks [1]
gpderetta | 2 comments | 3 days ago
titanomachy | 1 comment | 3 days ago
I'd love to see an updated version. The article talks about how out-of-order execution has a high power overhead and produces relatively small performance gains, but the M-series chips from Apple have deep OOO and low power consumption; I'm curious to learn what they did differently.
atq2119 | 0 comments | 3 days ago
Being 20 years later than the original version of that article.
Silicon process improvements since then have been wild.
aargh_aargh | 0 comments | 3 days ago
rasz | 1 comment | 3 days ago
gary_0 | 2 comments | 3 days ago
rasz | 0 comments | 3 days ago
On x86 Integer instructions _never_ waited for floating point opcode to retire. You can check it yourself by reading FPU status register busy flag - if FPU was blocking you could never catch it in BUSY state and FWAIT would be useless :-)
Abrash exploited the fact Pentium was the first time x87 FPU instructions could run pipelined overlapping one another. All other x86 vendors FPUs waited for previous FPU instruction to retire.
https://www.agner.org/optimize/microarchitecture.pdf page 47
While floating point instructions in general cannot be paired, many can be pipelined, i.e. one
instruction can begin before the previous instruction has finished. Example:
fadd st1,st0 ; Clock cycle 1-3
fadd st2,st0 ; Clock cycle 2-4
fadd st3,st0 ; Clock cycle 3-5
fadd st4,st0 ; Clock cycle 4-6
Obviously, two instructions cannot overlap if the second instruction needs the result of the
first one. Since almost all floating point instructions involve the top of stack register, ST0,
there are seemingly not very many possibilities for making an instruction independent of the
result of previous instructions. The solution to this problem is register renaming. The FXCH
instruction does not in reality swap the contents of two registers; it only swaps their names.
Paradoxically often mentioned Texture Divide every 16 pixels overlaps just fine on all CPUs and doesnt explain performance discrepancies. Intel FDIV latency is 19 cycles compared to Cyrix 24, mere 20% yet you need almost double the MHz on Cyrix to match FPS numbers. Answer lies in rest of Quake heavily pipelined FPU code.gpderetta | 1 comment | 3 days ago
ack_complete | 0 comments | 3 days ago
Additionally, the FXCH instructions required to optimally schedule FPU instructions on the Pentium hurt 486/K6/6x86 performance even more since they cost additional cycles. Hard for the 6x86 to keep up when it takes 7 cycles to execute an FADD+FXCH pair vs. 1 for the Pentium.
delusional | 5 comments | 3 days ago
I know that some ALU's have multiple ADD complexes, and I assume that would influence the answer, hence why I specified x86.
BeeOnRope | 0 comments | 3 days ago
In some cases mixing operations with _different_ latencies on the same execution port will leave you with less throughput than you expect due to "writeback conflicts", i.e., two instructions finishing on the same cycle (e.g., a 2 cycle operation starting on cycle 0 and a 1 cycle operation on cycle 1, will both finish on cycle 2 and in some CPUs this will delay the results of one of the operations by 1 cycle due to a conflict).
bjourne | 0 comments | 3 days ago
barbegal | 0 comments | 3 days ago
In general though there is no penalty for interleaved operations.
Tuna-Fish | 0 comments | 3 days ago
There are some exceptions where "domain crossing", or using a very different operation costs an extra clock cycle. Notably, in vector registers using FP or integer operations on the result of the different type of op, as modern CPUs don't actually hold FP values in their IEEE754 transfer format inside registers, but instead registers have hidden extra bits and store all values in normal form, allowing fast operations on denormals. The downside of this is that if you alternate between FP and INT operations, the CPU has to insert extra conversion ops between them. Typical cost is 1-3 cycles per domain crossing.
mshockwave | 0 comments | 3 days ago
saagarjha | 0 comments | 3 days ago
codedokode | 5 comments | 3 days ago
nxobject | 0 comments | 3 days ago
There are good historical hardware architecture textbooks: Peter Koggeâs âThe Architecture of Pipelined Computersâ (â81) mentions that the UNIVAC 1 pipelined IO (from which dedicated IO units sprung out), the IBM 7094 similarly interleaved multiple-state memory access cycles between multiple memory units to increase throughput, and then the IBM STRETCH had a two-part fetch-execute without attempt to address hazards.
So pipelining came out the ad-hoc application of the patterns of interleaving and hiving off functionality to dedicated units. In general, a lot of hardware architecture ideas cropped up very early in big iron, with microcoding even co-existing with them, without all of the extra full ideas being worked out (e.g. forwarding, hazard resolution of all types, etc.)
Thereâs a whole fun universe of vintage hardware architecture textbooks that get through the essential concepts through historical big iron, if youâre willing to look beyond Hennessy and Patterson and Shen and Lipasti! I am thinking that the latter might have references to other historical textbooks on pipelining, though. âThe Anatomy of a High-Performance Microprocessorâ, a pedagogical deep-dive into the design of the AMD K6 uarch (up to detailed algorithms and HDL!), will have good references to historical textbooks too since these authors were trained before the crop of âmodernâ (microprocessor-era) textbooks.
paulsutter | 2 comments | 3 days ago
adrian_b | 1 comment | 3 days ago
Both ways had been used for many centuries for the manufacturing of complex things, as ways to organize the work of multiple workers.
taneq | 0 comments | 3 days ago
zelos | 2 comments | 3 days ago
NotCamelCase | 0 comments | 3 days ago
Joking aside, of course, the ingenuity lies in finding a fitting solution to problem at hand, as well as knowing how to apply it.
sroussey | 1 comment | 3 days ago
codedokode | 0 comments | 3 days ago
PittleyDunkin | 1 comment | 3 days ago
adrian_b | 3 comments | 3 days ago
1957 in a transistorized computer is far too late for the origin of pipelining.
Pipelining had already been used in computers with vacuum tubes a few years before and it had also been used already a decade earlier in computers with electromechanical relays, i.e. IBM SSEC, which had a 3-stage pipeline for the execution of its instructions (IBM SSEC had a Harvard architecture, with distinct kinds of memories for program and for data).
Before 1959, pipelining was named "overlapped execution".
The first use of the word "pipeline" was as a metaphor in the description of the IBM Stretch computer, in 1959. The word "pipeline" began to be used as a verb, with forms like "pipelining" and "pipelined" around 1965. The first paper where I have seen such a verbal use was from the US Army.
Outside computing, pipelining as a method of accelerating iterative processes had been used for centuries in the progressive assembly lines, e.g. for cars, and before that for ships or engines.
Pipelined execution and parallel execution are dual methods for accelerating an iterative process that transforms a stream of data. In the former the process is divided into subprocesses through which the stream of data passes in series, while in the latter the stream of data is divided into substreams that go in parallel through multiple instances of the process.
cogman10 | 2 comments | 3 days ago
[1] https://en.wikipedia.org/wiki/Portsmouth_Block_Mills
adrian_b | 0 comments | 3 days ago
bjourne | 0 comments | 3 days ago
Voultapher | 1 comment | 3 days ago
With parallel do you mean superscalar execution i.e. having two ALUs or do you mean having multiple cores?
adrian_b | 1 comment | 3 days ago
At instruction level, the execution of instructions is an iterative process, so like for any other iterative process parallelism or pipelining or both parallelism and pipelining may be used.
Most modern CPUs use both parallelism and pipelining in the execution of instructions. If you have e.g. a stream of multiply instructions, the CPU may have for example 2 multiply pipelines, where each multiply pipeline has 4 stages. The incoming multiply instruction stream is divided into 2 substreams, which are dispatched in parallel to the 2 multiply pipelines, so 2 multiply instructions are initiated in each clock cycle, which makes the example CPU superscalar. The multiply instructions are completed after 4 clock cycles, which is their latency, when they exit the pipeline. Thus, in the example CPU, 8 multiply instructions are simultaneously executed in each clock cycle, in various stages of the 2 parallel pipelines.
Whenever you have independent iterations, e.g. what looks like a "for" loop in the source program, where there are no dependencies between distinct executions of the loop body, the iterations can be executed in 4 different ways, sequentially, interleaved, pipelined or in parallel.
The last 3 ways can provide an acceleration in comparison with the sequential execution of the iteration. In the last 2 ways there is simultaneous execution of multiple iterations or multiple parts of an iteration.
For parallel execution of the iteration (like in OpenMP "parallel for" or like in NVIDIA CUDA), a thread must be created for each iteration execution and all threads are launched to be executed in parallel by multiple hardware execution units.
For pipelined execution of the iteration, the iteration body is partitioned in multiple consecutive blocks that use as input data the output data of the previous block (which may need adding additional storage variables, to separate output from input for each block), then a thread must be created for each such block that implements a part of the iteration, and then all such threads are launched to be executed in parallel by multiple hardware execution units.
These 2 ways of organizing simultaneous work, pipelined execution and parallel execution are applicable to any kind of iterative process. Two of the most important such iterative processes are the execution of a stream of instructions and the implementation of an array operation, which performs some operation on all the elements of an array. In both cases one can use a combination of pipelining and parallelism to achieve maximum speed. For these 2 cases, sometimes the terms "instruction-level parallelism and pipelining" and "data-level parallelism and pipelining" are used. The second of these 2 terms is misleading, because not data are executed in parallel or pipelined, but the iterations that process data are executed in parallel or pipelined. For any case where pipelining may be used, it is important to recognize which is the iterative process that can be implemented in this way. For unrelated tasks a.k.a. processes a.k.a. threads, only 3 ways of execution are available: sequential, interleaved and in parallel. The 4th way of execution, pipelined, is available only for iterations, where the difference between iterations and unrelated tasks is that each iteration executes the same program (i.e. the loop body, when the iteration is written with the sequential loop syntax).
GregarianChild | 1 comment | 3 days ago
You are right, but this can be make more precise: pipelining is a specific form of parallelism. After all the different stages of the pipeline are executing in parallel.
adrian_b | 1 comment | 2 days ago
I use "parallel" with its original meaning "one besides the other", i.e. for spatial parallelism.
You use "parallel" with the meaning "simultaneous in time", because only with this meaning you can call pipelining as a form of parallelism.
You are not the only one who uses "parallel" for "simultaneous in time", but in my opinion this is a usage that must be discouraged, because it is not useful.
If you use "parallel" with your meaning, you must find new words to distinguish parallelism in space from parallelism in time. There are no such words in widespread use, so the best that you can do is to say "parallel in space" and "parallel in time", which is too cumbersome.
It is much more convenient to use "parallel" only with its original meaning, for parallelism in space, which in the case of parallel execution requires multiple equivalent execution units (unlike pipelined execution, which in most cases uses multiple non-equivalent execution units).
When "parallel" is restricted to parallelism in space, pipelining is not a form of parallelism. Both for pipelining and for parallelism there are multiple execution units that work simultaneously in time, but the stream of data passes in parallel through the parallel execution units and in series through the pipelined execution units.
With this meaning of "parallel", one can speak about "parallel execution" and "pipelined execution" without any ambiguity. It is extremely frequent to have the need to discuss about both "parallel execution" and "pipelined execution" in the same context or even in the same sentence, because these 2 techniques are normally combined in various ways.
When "parallel" is used for simultaneity in time it becomes hard to distinguish parallel in space execution from pipelined execution.
mafribe | 1 comment | 2 days ago
More generally, parallel in space is interesting because it is a necessary precondition for parallel in time.
adrian_b | 0 comments | 2 days ago
One should not say that the pipeline stages are parallel, when the intended meaning is that they are separate or distinct, which is the correct precondition for their ability to work simultaneous in time.
"Parallel" says about two things that they are located side-by-side, with their front-to-back axes aligned, which is true for parallel execution units where the executions of multiple operations are initiated simultaneously in all subunits, but it is false for pipelined execution units, where the executions of multiple operations are initiated sequentially, both in the first stage and in all following stages, but the initiation of an execution is done before the completion of the previous execution, leading to executions that are overlapped in time.
The difference between parallel execution and pipelined execution is the same as between parallel connections and series connections in any kind of networks, e.g. electrical circuits or networks describing fluid flow.
Therefore it is better if the terms used in computing remain consistent with the terms used in mathematics, physics and engineering, which have already been used for centuries before the creation of the computing terminology.
formerly_proven | 1 comment | 3 days ago
adrian_b | 0 comments | 2 days ago
The main inspiration for IBM SSEC has been Harvard Mark I, an earlier electromechanical computer built by IBM based on a design done mostly by Howard Aiken, but it would not have been impossible for some information about the Zuse computers to have reached IBM after WWII, contributing to the design of the SSEC.
AtlasBarfed | 0 comments | 3 days ago
filiphorvat | 2 comments | 3 days ago
gpderetta | 0 comments | 3 days ago
The optimal design depends a lot on the rest of the microarchitecture, the loads the core is being optimized for, the target frequency, the memory latency, etc.
bonzini | 1 comment | 3 days ago
thesz | 1 comment | 3 days ago
Usually FP adds take a cycle or two longer than FP multiplication.
dahart | 0 comments | 3 days ago
ryukoposting | 3 comments | 3 days ago
Imagine if the OP's benchmark had two of those multiplication chains on independent registers:
mul x1, x0, x0 // a
mul x2, x1, x1 // b
mul x3, x2, x2 // c
mul x4, x3, x3 // d
mul x6, x5, x5 // e
mul x7, x6, x6 // f
mul x8, x7, x7 // g
mul x9, x8, x8 // h
If you had two independent ALUs, my intuition is that you'd want to dispatch the instructions as AEBFCGDH, interleaving the two chains.But, if the two ALUs could feed each other, perhaps you'd actually want to leave it in ABCDEFGH order? Hmm.
wtallis | 0 comments | 3 days ago
Once the frontend has determined what the dependencies are between instructions, you no longer have a linear sequence of instructions being dispatched to execution units but a DAG, and with multiple instructions starting on the same clock cycle the execution order would be more like (AE)(BF)(CG)(DH) without any ordering between instructions in the same parenthesized group.
The reorder buffer in the CPU will be large enough to handle the compiler emitting either ABCDEFGH or AEBFCGDH, but the interleaved ordering is less likely to run into limitations of the instruction decoder (if this chunk of code isn't already in a decoded ”op cache).
ack_complete | 0 comments | 3 days ago
monocasa | 0 comments | 3 days ago
mshockwave | 1 comment | 3 days ago
gpderetta | 1 comment | 3 days ago
ALUs in Recent Apple cpus can actually start a new division every other cycle (in addition to having an abnormally low latency), which is very impressive.
mshockwave | 0 comments | 2 days ago
That's indeed impressive.
I'll argue that we're definitely capable of making fully pipelined divisions, it's just that it's usually not worth the PPA.
taneq | 3 comments | 3 days ago
monocasa | 0 comments | 3 days ago
BeeOnRope | 0 comments | 3 days ago
syncsynchalt | 1 comment | 3 days ago
olliej | 0 comments | 3 days ago
v0:
1. if (input not a number)
fallback to C++; else
2. return tagged 0; // Just making sure the numeric check was optimal
v1:
1. As above
2. If integer
convert to float
3. return tagged 0
v2:
1-2. as above
3. If negative
return tagged nan
4. Return tagged 0
v3:
1-3. as above
4. use the sqrt instruction
5. return tagged 0
v4.
1-4. as above
5. move <4> back to an integer register
6. return tagged 0
v5.
1-5. as above
6. tag the result of sqrt
7. return tagged 0
v6.
1-6. as above
7. Actually return/store the result of <6>
Alas I cannot recall whether at this point return values were going into the heap allocated VM call stack, or whether the return was via rax, but that's not the bit that was eye opening to me.I had a benchmark that was something like
for (var i = 0; i < large number; i++)
Math.sqrt(i)
Noting that while I was working on this there was no meaningful control flow analysis, inlining, etc so this was an "effective" benchmark for perf work at the time - it would not be so today.The performance remained "really good" (read fake) until the `v6` version that actually store/returned the result of the work. It was incredibly eye opening to see just how much code could be "executed" before the CPU actually ended up doing any work, and significantly impacted my approach to dealing with codegen in future.
My perspective at the time was "I know there's a significant marshaling cost to calling host code, and I know the hardware sqrt is _very_ fast", so it seemed that it was possible that a 5-10x perf improvement seemed "plausible" to me at the time (because marshaling was legitimately very expensive) - and I can't recall where in the 5-10x range the perf improvement was - but then once the final store/return was done it dropped in perf to only 2x faster. Which was still a big win, but also seeing just how much work the CPU could just avoid doing while trying to build out the code was a significant learning experience.