Paul Hsieh's 6th generation x86 CPU Comparisons

6th Generation CPU Comparisons.

by Paul Hsieh

AMD K6

Intel Pentium II

Cyrix 6x86MX

Common features

Closing words

Glossary

Update: 09/07/99 I have now started a seventh generation architecture web page.

The following is a comparative text meant to give people a feel for the differences in the various 6th generation x86 CPUs. For this little ditty, I've chosen the Intel P-II (aka Klamath, P6), the AMD K6 (aka NX686), and the Cyrix 6x86MX (aka M2). These are all MMX capable 6th generation x86 compatible CPUs, however I am not going to discuss the MMX capabilities at all beyond saying that they all appear to have similar functionality. (MMX never really took off as the software enabling technology Intel claimed it to be, so its not worth going into any depth on it.)

In what follows, I am assuming a high level of competence and knowledge on the part of the reader (basic 32 bit x86 assembly at least). For many of you, the discussion will be just slightly over your head. For those, I would recommend sitting through the 1 hour online lecture on the Post-RISC architecture by Charles Severence to get some more background on the state of modern processor technology. It is really an excellent lecture, that is well worth the time:

Beyond RISC - The Post RISC Architecture (Mark Brehob, Travis Doom, Richard Enbody, William H. Moore, Sherry Q. Moore, Ron Sass, Charles Severance )

Much of the following information comes from online documentation from Cyrix, AMD and Intel. I have played a little with Pentiums and Pentium-II's from work, as well as my AMD-K6 at home. I would also like to thank, Dan Wax, Lance Smith and "Bob Instigator" from AMD who corrected me on several points about the K6, and both Andreas Kaiser and Lee Powell who also provided insightful information, and corrections gleened from first hand experiences with these CPUs. Also, thanks to Terje Mathisen who pointed out an error, and Brian Converse who helped me with my grammar.

Comments welcome.

The AMD K6

The K6 architecture seems to mix some of the ideas of the P-II and 6x86MX architectures. They made trade offs, and decisions that they believed would deliver the maximal performance over all potential software. They have emphasized short latencies (like the 6x86MX) but the K6 translates their x86 instructions into RISC operations that are queued in large instruction buffers and feed many (7 in all) independent units (like the P-II.) While they don't always have the best single implementation of any specific aspect, this was the result of conscious decisions that they believe helps strike a balance that hits a good performance sweet spot. Versus the P-II, they avoid situations of really deep pipelining which has high penalties when the pipeline has to be backed out. Versus the Cyrix, the AMD is a fully POST-RISC architecture which is not as susceptible to pipeline stalls which artificially back ups other stages.

General Architecture

The K6 is an extremely short and elegant pipeline. The AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note contains the following diagrams:

This seems remarkably simple considering the features that are claimed for the K6. The secret, is that most of these stages do very complicated things. The light blue stages execute in an out of order fashion (and were colored by me, not AMD.)

The fetch stage, is much like a typical Pentium instruction fetcher, and is able to present 16 cache aligned bytes of data per clock. Of course this means that some instructions that straddle 16 byte boundaries will suffer an extra clock penalty before reaching the decode stage, much like they do on a Pentium. (The K6 is a little clever in that if there are partial opcodes from which the predecoder can determine the instruction length, then the prefetching mechanism will fetch the new 16 byte buffer just in time to feed the remaining bytes to the issue stage.)

The decode stage attempts to simultaneously decode 2 simple, 1 long, and fetch from 1 ROM x86 instruction(s). If both of the first two fail (usually only on rare instructions), the decoder is stalled for a second clock which is required to completely decode the instruction from the ROM. If the first fails but the second does not (the usual case when involving memory, or an override), then a single instruction or override is decoded. If the first succeeds (the usual case when not involving memory or overrides) then two simple instructions are decoded. The decoded "OpQuad" is then entered into the scheduler.

Thus the K6's execution rate is limited to a maximum of two x86 instructions per clock. This decode stage decomposes the x86 instructions into RISC86 ops.

This last statement has been generally misunderstood in its importance (even by me!) Given that the P-II architecture can decode 3 instructions at once, it is tempting to conclude that the P-II can execute typically up to 50% faster than a K6. According to "Bob Instigator" (a technical marketroid from AMD) and "The Anatomy of a High-Performance Microprocessor A Systems Perspective" this just isn't so. Besides the back-end limitations and scheduler problems that clog up the P-II, real world software traces analyzed at Advanced Micro Devices indicated that a 3-way decoder would have added almost no benefit while severely limiting the clock rate ramp of the K6 given its back end architecture.

That said, in real life decode bandwidth limitation crops up every now and then as a limiting factor, but is rarely egregiously in comparison to ordinary execution limitations.

The issue stage accepts up to 4 RISC86 instructions from the scheduler. The scheduler is basically an OpQuad buffer that can hold up to 6 clocks of instructions (which is up to 12 dual issued x86 instructions.) The K6 issues instructions subject only to execution unit availability using an oldest unissued first algorithm at a maximum rate of 4 RISC86 instructions per clock (the X and Y ALU pipelines, the load unit, and the store unit.) The instructions are marked as issued, but not removed until retirement.

The operand fetch stage reads the issued instruction operands without any restriction other than register availability. This is in contrast with the P-II which can only read up to two retired register operands per clock (but is unrestricted in forwarding (unretired) register accesses.) The K6 uses some kind of internal "register MUX" which allows arbitrary accesses of internal and commited register space. If this stage "fails" because of a long data dependency, then according to expected availability of the operands the instruction is either held in this stage for an additional clock or unissued back into the scheduler, essentially moving the instruction backwards through the pipeline!

This is an ingenious design that allows the K6 to perform "late" data dependency determinations without over-complicating the scheduler's issue logic. This clever idea gives a very close approximation of a reservation station architecture's "greedy algorithm scheduling".

The execution stages perform in one or two pipelined stages (with the exception of the floating point unit which is not pipelined, or complex instructions which stall those units during execution.) In theory, all units can be executing at once.

Retirement happens as completed instructions are pushed out of the scheduler (exactly 6 clocks after they are entered.) If for some reason, the oldest OpQuad in the scheduler is not finished, scheduler advancement (which pushes out the oldest OpQuad and makes space for a newly decoded OpQuad) is halted until the OpQuad can be retired.

What we see here is the front end starting fairly tight (two instruction) and the back end ending somewhat wider (two integer execution units, one load, one store, and one FPU.) The reason for this seeming mismatch in execution bandwidth (as opposed to the Pentium, for example which remains two-wide from top to bottom) is that it will be able to sustain varying execution loads as the dependency states change from clock to clock. This at the very heart of what an out of order architecture is trying to accomplish, being wider at the back-end is a natrual consequence of this kind design.

Branch Prediction

The K6 uses a very sophisiticated branch prediction mechanism which delivers better prediction and fewer stalls than the P-II. There is a 8192 table of two bit prediction entries which combine historic prediction information for any given branch with a heuristic that takes into account the results of nearby branches. Even branches that have somehow left the branch prediction table, still can have the benefit of nearby branch activity data to help their prediction. According to published papers which studied these branch prediction implementations, this allows them to achieve a 95% prediction rate versus the P-II's 90% prediction rate.

Additional stalls are avoided by using a 16 entry times 16 byte branch target cache which allows first instruction decode to occur simultaneously with instruction address computation, rather than requiring (E)IP to be known and used to direct the next fetch (as is the case with the P-II.) This removes an (E)IP calculation dependency and instruction fetch bubble. (This is a huge advantage in certain algorithms such as computing a GCD; see my examples for the code) The K6 allows up to 7 outstanding unresolved branches (which seems like more than enough since the scheduler only allows up to 6 issued clocks of pending instructions in the first place.)

The K6 benefits additionally from the fact that it is only a 6 stage pipeline (as opposed to a 12 stage pipeline like the P-II) so even if a branch is incorrectly predicted it is only a 4 clock penalty as opposed to the P-II's 11-15 clock penalty.

One disadvantage pointed out to me by Andreas Kaiser is that misaligned branch targets still suffer an extra clock penalty and that attempts to align branch targets can lead to branch target cache tag bit aliasing. This is a good point, however it seems to me that you can help this along by hand aligning only your most inner loop branches.

Another (also pointed out to me by Andreas Kaiser) is that such a prediction mechanism does not work for indirect jump predictions (because the verification tables only compare a binary jump decision value, not a whole address.) This is a bit of a bummer for virtual C++ functions.

Back of envelope calculation

This all means that the average loop penalty is:

(95% * 0) + (5% * 4) = 0.2 clocks per loop

But because of the K6's limited decode bandwidth, branch instructions take up precious instruction decode bandwidth. There are no branch execution clocks in most situations, however, branching instructions end up taking a slot where there is essentially no calculations. In that sense K6 branches have a typical penalty of about 0.5 clocks. To combat this, the K6 executes the LOOP instruction in a single clock, however this instruction performs so badly on Intel CPUs, that no compiler generates it.

Floating Point

The common high demand, high performance FPU operations (FADD, FSUB, FMUL) all execute with a throughput and latency of 2 clocks (versus 1 or 2 clock throughput and 3-5 clock latency on the P-II.) Amazingly, this means that it can complete FPU operations faster than the P-II, however is worse on FPU code that is optimally scheduled for the P-II. Like the Pentium, in the P-II Intel has worked hard on fully pipelining the faster FPU operations which works in their favor. Central to this is FXCH which, in combination with FPU instruction operands allows two new stack registers to be addressed by each binary FPU operation. The P-II allows FXCH to execute in 0 clocks -- the early revs of the K6 took two clocks, while later revs based on the "CXT core" can execute them in 0 clocks. Unfortunately, the P-II derives much more benefit from this since its FPU architecture allows it to decode and execute at a peak rate of one new FPU instruction on every clock.

More complex instructions such as FDIV, FSQRT and so on will stall more of the units on the P-II than on the K6. However since the P-II's scheduler is larger it will be able to execute more instructions in parallel with the stalled FPU instruction (21 in all, however the port 0 integer unit is unavailable for the duration of the stalled FPU instruction) while the K6 can execute up to 11 other x86 instructions a full speed before needing to wait for the stalled FPU instruction to complete.

In a test I wrote (admittedly rigged to favor Intel FPUs) the K6 measured to only perform at about 55% of the P-II's performance. (Update: using the K6-2's new SIMD floating point features, the roles have reversed -- the P-II can only execute at about 70% of a K6-2's speed.)

An interesting note is that FPU instructions on the K6 will retire before they completely execute. This is possible because it is only required that they work out whether or not they will generate an exception, and the execution state is reset on a task switch, by the OS's built-in FPU state saving mechanism.

The state of floating point has changed so drastically recently, that its hard to make a definitive comment on this without a plethora of caveats. Facts: (1) the pure x87 floating point unit in the K6 does not compare favorably with that of the P-II, (2) this does not tend to always reflect in real life software which can be made from bad compilers, (3) the future of floating point clearly lies with SIMD, where AMD has clearly established a leadership role. (4) Intel's advantage was primarily in software that was hand optimized by assembly coders -- but that has clearly reversed roles since the introduction of the K6-2.

Cache

The K6's L1 cache is 64KB, which is twice as large as the P-II's L1 cache. But it is only 2 way set associative (as opposed to the P-II which is 4 way). This makes the replacement algorithm much simpler, but decreases its effectiveness in random data accesses. The increased size, however, more than compensates for the extra bit of associativity. For code that works with contiguous data sets, the K6 simply offers twice the working set ceiling of the P-II.

Like the P-II, the K6's cache is divided into two fixed caches for separate code and data. I am not as big a fan of split architectures (commonly referred to as the Harvard Architecture) because they set an artificial lower limit on your working sets. As pointed out to me by the AMD folk, this keeps them from having to worry about data accesses kicking out their instruction cache lines. But I would expect this to be dealt with by associativity and don't believe that it is worth the trade off of lower working set sizes.

Among the design benefits they do derive from a split architecture is that they can add pre-decode bits to just the instruction cache. On the K6, the predecode bits are used for determining instruction length boundaries. Their address tags (which appears to work out to 9 bits) point to a sector which contains two 32 byte long cache lines, which (I assume) are selected by standard associativity rules. Each cache line has a standard set of dirty bits to indicate accessibility state (obsolete, busy, loaded, etc).

Although the K6's cache is non-blocking, (allowing accesses to other lines even if a cache line miss is being processed) the K6's load/store unit architecture only allows in-order data access. So this feature cannot be taken advantage of in the K6. (Thanks to Andreas Kaiser for pointing this out to me.)

In addition, like the 6x86MX, the store unit of the K6 actually is buffered by a store Queue. A neat feature of the store unit architecture is that it has two operand fetch stages -- the first for the address, and the second for the data which happens one clock later. This allows stores of data that are being computed in the same clock as the store to occurr without any apparent stall. That is so darn cool!

But perhaps more fundamentally, as AMD have said themselves, bigger is better, and at twice the P-II's size, I'll have to give the nod to AMD (though a bigger nod to the 6x86MX; see below.)

The K6 takes two (fully pipelined) clocks to fetch from its L1 cache from within its load execution unit. Like the original P55C, the 6x86MX spends extra load clocks (i.e., address generation) during earlier stages of their pipeline. On the other hand this compares favorably with the P-II which takes three (fully pipelined) clocks to fetch from the L1 cache. What this means is that when walking a (cached) linked list (a typical data structure manipulation), the 6x86MX is the fastest, followed by the K6, followed by the P-II.

Update: AMD has released the K6-3 which, like the Celeron adds a large on die L2 cache. The K6-3's L2 cache is 256K which is larger than the Celeron's at 128K. Unlike Intel, however, AMD has recommended that motherboard continue to include on board L2 caches creating what AMD calls a "TriLevel cache" architecture (I recall that an earler Alpha based system did exactly this same thing.) Benchmarks indicate that the K6-3 has increased in performance between 10% and 15% over similarly clocked K6-2's! (Wow! I think I might have to get one of these.)

Other

The K6 has bad memory bandwidth. One is an unknown bottleneck in their block move and bursting over their bus (I and others have observed this through testing, though there is no documentation available from AMD that explains this. Update: the K6 did not support pipelined stores which has been corrected in the "CXT core".)
The K6 has a 2/3 (32 bits ready/64 bits ready) clock integer multiply, which is good counter to the P-II's 1/4 (throughput/latency) clock integer multiply. Programmers usually only use the base 32 bit LSB result of the multiply, and so are likely to achieve realistic 2-clock throughputs. On the other hand the P-II's 4 clock "hands off" rule is unlikely to be so easily scheduled, since no contentions in 4 clocks is unlikely. In the real world, I would be very surprised if the P-II actually achieves a 1 or even 2 clock throughput.
The K6 does not suffer the same kind of partial register stalls that the P-II does. Register contention is accurate down to the byte sub-register as required. Special clearing of the register is unnecessary. However 16 bit partial register instructions will have instruction decode overrides which will cost an extra clock.
The K6 seems to prefer [esi+0] to [esi] memory addressing (for faster pre-decoding.) This is a side effect of the 386 ISA's strange encoding rules for this operand. Basically, the 16 bit mod/rm encodings and 32 bit modrm encodings cause an mode or operand conflict for this situation. Basically, if they made the [esi] decoding fast, numerous 16 bit modrm decodings would be very slow. This trade off was more beneficial to more code at the time.
The K6 has a wider riscop instruction window than the P-II. That is to say instructions are entered into and retired from their scheduler at a rate of 4 RISC86 ops per clock, while the P-II enters and retires microops from their reorder buffer at a rate of 3 microops per clock.
The K6 has a fast (1 cycle) LOOP instruction. It looks like the Intel CPUs may be the lone wolves with their slow LOOP instruction. If you ask me, this is the most ideal instruction to use for loops.
Of course, the K6 has more a sophisticated instruction decode implementation in the sense that they can decode two 7 byte instructions or one 11 byte instruction in a single clock. Like the 6x86MX (though, with an entirely different mechanism) it can only decode a maximum of two instructions per clock versus the P-II's maximum rate of 3 instructions per clock. However, the P-II's decoding is overly optimistic since it balks on any instructions more than 7 bytes long and is also limited by micro-op decode restrictions.
Anyhow, this design is very much in line with AMD's recommendation of using complicated load and execute instructions which tend to be longer and would favor the K6 over the P-II. In fact, the AMD just seems better suited overall for the CISCy nature of the x86 ISA. For example, the K6 can issue 2 push reg instructions per clock, versus the P-II's 1 push reg per clock.
According to AMD, the typical 32 bit decode bandwidth is about the same for both the K6 and the P-II, but 16 bit decode is about 20% faster for the K6. Unfortunately for AMD, if software developers and compiler writers heed the P-II optimization rules with the same vigor that they did with the Pentium, the typical decode bandwidth will change over time to favor the P-II.
The K6's issue to execute scheduling is pretty cool. They use complete logical comparisons between pipeline stages to always find the best path to propagate from issue to operand read to execute. This is particularly effective to divide the work between the two integer units. The scheduler will actually push stalled instructions backwards through the pipeline to simplify and avoid over speculation in multi-clock stalls situations. This also allows other instructions to slip through rather than being caught behind a stalled instruction. This is an effective alternative to the P-II's reservation station which is an optional extra pipeline stage that serves a similar purpose.
6x86MX seems to just let their pipelines accumulate with work moving only in a forward direction which makes them more susceptible to being backed up, but they do allow their X and Y pipes to swap contents at one stage.
The K6 does not support the new P6 ISA instructions, specifically, the conditional move instructions. It also does not appear to support the set of MSRs that the P6 does (besides the ever important TSC register.) So from a programmer's architecture point of view, the K6 is more like a Pentium than a Pentium-II. Its not clear that this is a real big issue since all the modern compilers still target the 80386 ISA.

Update: AMD's new "CXT Core" has enabled write combining.

As I have been contemplating the K6 design, it has really grown on me. Fundamentally, the big problem with x86 processors versus RISC chips is that they have too few registers and are inherently limited in instruction decode bandwidth due to natural instruction complexity. The K6 addresses both of these by maximizing performance of memory based in-cache data accesses to make up for the lack of registers, and by streamlining CISC instruction issue to be optimally broken down into RISC like sub-instructions.

It is unfortunate, that compilers are favoring Intel style optimizations. Basically there are several instructions and instruction mixes that compilers avoid due to their poor performance on Intel CPUs, even though the K6 executes them just fine. As an x86 assembly nut, it is not hard to see why I favor the K6 design over the Intel design.

Optimization

AMD realizing that there is a tremendous interest for code optimization for certain high performance applications, decided to write up some Optimization documentation for the K6 (and now K6-2) processor(s). The documentation is fairly good about describing general strategies as well as giving a fairly detailed description for modelling the exact performance of code. This documentation far exceeed the quality of any of Intel's "Optimization AP notes", fundamentally because its accurate and more thorough.

The reason I have come to this conclusion is that the architecture of the chip itself is much more straight forward than, say the P-II, and so there is less explanation necessary. So the volume of documentation is not the only determining factor to measuring its quality.

If companies were interested in writing a compiler that optimized for the K6 I'm sure they could do very well. In my own experiments, I've found that optimizing for the K6 is very easy.

Recommendations I know of: (1) Avoid vector decoded instructions including carry flag reading instructions and shld/shrd instructions, (2) Use the loop instruction, (3) Align branch targets and code in general as much as possible, (4) Pre-load memory into registers early in your loops to work around the load latency issue.

Brass Tacks

The K6 is cheap, supports super socket 7 (with 100Mz Bus), that has established itself very well in the market place, winning businnes from all the top tier OEMs (with the exception of Dell, which seems to have missed the consumer market shift entirely, and taken a serious step back from challenging Compaq's number one position.) AMD really changed the minds of people who thought the x86 market was pretty much an Intel deal (including me.)

Their marketting strategy of selling at a low price while adding features (cheaper Super7 infrastructure, SIMD floating point, 256K on chip L2 cache combined with motherboard L2 cache) has paid off in an unheard of level brand name recognition outside of Intel. Indeed, 3DNow! is a great counter to Intel Inside. If nothing else they helped create a real sub-$1000 PC market, and have dictated the price for retail x86 CPUs (Intel has been forced to drop even their own prices to unheard of lows for them.)

AMD has struggled more to meet the demand of new speeds as they come online (they seem predictably optimistic) but overall have been able to sell a boat load of K6's without being stepped on by Intel.

Previously, in this section I maintained a small chronical of AMD's acheivements as the K6 architecture grew, however we've gotten far beyond the question of "will the K6 survive?" (A question only idiots like Ashok Kumar still ask.) From the consumer's point of view, up until now (Aug 99) AMD has done a wonderful job. Eventually, they will need to retire the K6 core -- it has done its tour of duty. However, as long as Intel keeps Celeron in the market, I'm sure AMD will keep the K6 in the market. AMD has a new core that they have just introduced into the market: the K7. This processor has many significant advantages over "6th generation architectures".

The real CPU WAR has only just begun ...

AMD performance documentation links

The first release of their x86 Optimization guide is what triggered me to write this page. With it, I had documentation for all three of these 6th generation x86 CPUs. Unfortunately, they often elect to go with terse explanations that assume the reader is very familiar with CPU architecture and terminologies. This lead me to some misunderstandings from my initial reading of the documentation (I'm just a software guy.) On the other hand, the examples they give really help clarify the inner workings of the K6.

Update: The IEEE Computer Society has published a book called "The Anatomy of a High-Performance Microprocessor A Systems Perspective" based on the AMD K6-2 microprocessor. It gives inner details of the K6-2 that I have never seen in any other documentation on Microprocessors before. These details are a bit overwhelming for a mere software developer, however, for a hard core x86 hacker its a treasure trove of information.

The Anatomy of a High-Performance Microprocessor A Systems Perspective

The Intel P-II

This was the first processor (I knew of) to have completely documented post-RISC features such as dynamic execution, out of order execution and retirement. (PA-RISC predated it as far as implementing the technology, however; I am suspicious that HP told Intel to either work with them on Merced, or be sued up the wazoo.) This stuff really blew me away when I first read it. The goal is to allow longer latencies in exchange for high throughput (single cycle whenever possible.) The various stages would attempt to issue/start instructions with the highest possible probability as often as possible, working out dependencies, register renaming requirements, forwarding, resource contentions, as later parts of the instruction pipe by means of speculative and out of order execution.

Intel has enjoyed the status of "defacto standard" in the x86 world for some time. Their P6/P-II architecture, while not delivering the same performance boost of previous generational increments, solidifies their position. Its is the fastest, but it is also the most expensive of the lot.