Essays academic service

A description of the micro architecture of pentium pro processor

  • Branch uops are tagged in the in-order pipeline with their fallthrough address and the destination that was predicted for them;
  • Removal of local prediction;
  • These uops are scheduled independently to maximize their concurrency, but must re-combine in the store buffer for the store to complete;
  • Early in the Pentium Pro processor project, we studied the importance of memory access reordering;
  • The 512 entry BTB uses an extension of Yeh's algorithm to provide greater than 90 percent prediction accuracy;
  • The Pentium Pro processor is implemented as three independent engines coupled with an instruction pool as shown in Figure 1 below.

Flip-chip Deschutes core is on the left. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. March 2014 Learn how and when to remove this template message The Pentium Pro incorporated a new microarchitecture in a departure from the Pentium x86 architecture. It has a decoupled, 14-stage superpipelined architecture which used an instruction pool. The Pentium Pro P6 featured many advanced concepts not found in the Pentium, although it wasn't the first or only x86 processor to implement them see NexGen Nx586 or Cyrix 6x86.

The Pentium Pro pipeline had extra decode stages to dynamically translate IA-32 instructions into buffered micro-operation sequences which could then be analysed, reordered, and renamed in order to detect parallelizable operations that may be issued to more than one execution unit at once.

The Pentium Pro thus featured out of order executionincluding speculative execution via register renaming. There are three instruction decoders. The decoders are not equal in capability: This restricts the Pentium Pro's ability to decode multiple instructions simultaneously, limiting superscalar execution. The micro-ops are RISC -like; that is, they encode an operation, two sources, and a destination.

The general decoder can generate up to four micro-ops per cycle, whereas the simple decoders can generate one micro-op each per cycle. Thus, x86 instructions that operate on the memory e.

Likewise, the simple decoders are limited to instructions that can be translated into one micro-op.

Tuning the Pentium Pro Microarchitecture

Instructions that require more micro-ops than four are translated with the assistance of a sequencer, which generates the required micro-ops over multiple clock cycles. In each clock cycle, up to five micro-ops can be dispatched to five execution units.

The Pentium Pro has a total of six execution units: One of the integer units shares the same ports as the FPU, and therefore the Pentium Pro can only dispatch one integer micro-op and one floating-point micro-op, or two integer micro-ops per a cycle, in addition to micro-ops for the other three execution units.

Of the two integer units, only one has the full complement of functions such as a barrel shiftermultiplier and divider. The second integer unit, which shares paths with the FPU, does not have these facilities and is limited to simple operations such as add, subtract, and the calculation of branch target addresses.

The FPU executes floating-point operations.

  1. While effective, this is another expensive solution, especially considering the speed requirements of today's L2 cache SRAM components.
  2. Pentium Pro processor takes an innovative approach To avoid this memory latency problem the Pentium Pro processor "looks-ahead" into its instruction pool at subsequent instructions and will do useful work rather than be stalled.
  3. The first instruction in this example is a load of r1 that, at run time, causes a cache miss.
  4. Using the same process as a volume production processor practically assured that the Pentium Pro processor would be manufacturable, but it meant that Intel had to focus on an improved microarchitecture for ALL of the performance gains. The Pentium Pro has a total of six execution units.

Addition and multiplication are pipelined and have a latency of three and five cycles, respectively. Division and square-root are not pipelined and are executed in separate units that share the FPU's ports. Division and square root have a latency of 18-36 and 29-69 cycles, respectively. The smallest number is for single precision 32-bit floating-point numbers and the largest for extended precision 80-bit numbers. Division and square root can operate simultaneously with adds and multiplies, preventing them from executing only when the result has to be stored in the ROB.

After the microprocessor was released, a bug was discovered in the floating point unitcommonly called the "Pentium Pro and Pentium II FPU bug" and by Intel as the "flag erratum". The bug occurs under some circumstances during floating point-to-integer conversion when the floating point number won't fit into the smaller integer format, causing the FPU to deviate from its documented behaviour.

The bug is considered to be minor and occurs under such special circumstances that very few, if any, software programs are affected. The Pentium Pro P6 microarchitecture was used in one form or another by Intel for more than a decade. The design's various traits would continue after that in the derivative core called " Banias " in Pentium M and Intel Core Yonahwhich itself would evolve into the Core microarchitecture Core 2 processor in 2006 and onward.

This, together with the high cost of Pentium Pro systems, caused rather lackluster reception among PC enthusiasts at the time.

Pentium Pro (P6) 6th generation x86 History

The performance issues on legacy code were later partially mitigated by Intel with the Pentium II. Methods to circumvent this included setting VESA drawing to system memory instead of video memory in games such as Quakeand later on utilities such as FASTVID emerged, which could double performance in certain games by enabling the write combining features of the CPU.

However, its lack of MMX implementation reduces performance in multimedia applications that made use of those instructions. At the time, manufacturing technology did not feasibly allow a large L2 cache to be integrated into the processor core.

Intel instead placed the L2 die s separately in the package which still allowed it to run at the same clock speed as the CPU core. Additionally, unlike most motherboard-based cache schemes that shared the main system bus with the CPU, the Pentium Pro's cache had its own back-side bus called dual independent bus by Intel.

  • A brute-force approach to this problem is, of course, increasing the size of the L2 cache to reduce the miss ratio;
  • The general decoder can generate up to four micro-ops per cycle, whereas the simple decoders can generate one micro-op each per cycle;
  • A partially ordered unit responsible for connecting the three internal units to the real world;
  • Dynamic cache activation by quadrant selector from sleep states;
  • The P6 architecture lasted three generations from the Pentium Pro to Pentium III, being characterised by low power consumption, excellent integer performance, and relatively high instructions per cycle IPC;
  • A Tour of the Pentium r Pro Processor Microarchitecture Introduction One of the Pentium r Pro processor's primary goals was to significantly exceed the performance of the 100MHz Pentium r processor while being manufactured on the same semiconductor process.

Because of this, the CPU could read main memory and cache concurrently, greatly reducing a traditional bottleneck. The cache was also "non-blocking", meaning that the processor could issue more than one cache request at a time up to 4reducing cache-miss penalties. These properties combined to produce an L2 cache that was immensely faster than the motherboard-based caches of older processors. In multiprocessor configurations, Pentium Pro's integrated cache skyrocketed performance in comparison to architectures which had each CPU sharing a central cache.

However, this far faster L2 cache did come with some complications.

  1. Essay UK - http. This approach allows the "execute" phase of the Pentium Pro processor to have much more visibility into the program's instruction stream so that better scheduling may take place.
  2. This allows instructions to be started in any order but always be completed in the original program order.
  3. In multiprocessor configurations, Pentium Pro's integrated cache skyrocketed performance in comparison to architectures which had each CPU sharing a central cache.
  4. We shall travel down the Pentium Pro processor pipeline to understand the role of each unit.

The Pentium Pro's "on-package cache" arrangement was unique. The processor and the cache were on separate dies in the same package and connected closely by a full-speed bus. The two or three dies had to be bonded together early in the production process, before testing was possible. This meant that a single, tiny flaw in either die made it necessary to discard the entire assembly, which was one of the reasons for the Pentium Pro's relatively low production yield and high cost.

The chip was popular in symmetric multiprocessing configurations, with dual and quad SMP server and workstation setups being commonplace.