The IA-64 Computer Architecture

A 64-bit EPIC architecture developed by Intel and HP for the 21st century

The venerable RISC and x86 computer architectures of the 1990s enjoyed huge success in bringing the computer to everyone’s fingertips. But at the beginning of the 21st century, the 32-bit computer is aging and its capabilities of the 1990s have become the limitations of 2000. Developers are looking toward 64-bit computing and to new means of achieving performance. One of the first answers has arrived: Intel and HP’s IA-64.

Intel and HP worked together throughout the 1990s to create a new 64-bit architecture. The result of their research is not only a new design and a move from 32 bits to 64 bits, but also an entirely new type of architecture called Explicitly Parallel Instruction Computing (EPIC). The IA-64, implemented by its first processor, the Intel Itanium, is one of the first 64-bit architectures destined for the desktop market. 64-bit addressing removes, for the foreseeable future, the memory size limitations imposed by 32-bit architectures. The EPIC architecture that IA-64 uses also takes new approaches on all levels of processing, allowing new levels of performance and scalability. The IA-64’s three most important methods to achieve performance are “speculative loading, predication, and explicit parallelism” (EPIC 1).

Architecture Overview

The 64-bit basis of this architecture allows very large data handling capacities. Compared to the approx. 4 GB natural limit of 32-bit systems, IA-64 can handle up to 264 memory addresses. In this byte-addressable system, that corresponds to billions of gigabytes of main memory. Different instructions reference data of various sizes. Bytes are 8 bits long, “short” words are 16 bits, and words are 32 bits.

In order to take advantage of vast memory size and parallel processing, IA-64 defines very large register files. There are 128 64-bit general and 128 82-bit floating-point registers. There are also 64 1-bit predicate and 8 64-bit branch registers (intended for addresses). Five CPUID and 128 application registers round out the basic set. IA-64 supports both little endian and big endian byte ordering and a user mask register can specify which ordering to use. Floating-point values are stored in IEEE FPS formats of various lengths (single, double, double extended).

EPIC is an evolution of CISC and VLIW (very long instruction word) architectures. There are therefore a large variety of instructions available. Especially mentionable is a large set of powerful multimedia and floating-point operations that extend the MMX set of IA-32 generations. Instructions are explicitly categorized into five categories that help determine valid combinations within parallel instruction bundles. In IA-64, 41-bit instructions are packaged in groups of three (along with some other information) to form a 128-bit instruction bundle. This is a building block of the EPIC architecture, which aims for improved performance by giving the processor explicitly parallel instructions instead of having the processor guess at parallelism opportunities from a sequential stream of instructions.

EPIC’s primary goal is to allow parallelism, a massive opportunity for performance, to exist at all levels and states of program design and execution. In other architectures, compilers and assemblers take the parallel nature of code written in a high-level language and convert it to sequential code to be fed in a stream to the processor. The processor then spends considerable resources and hardware space attempting to find opportunities to execute code in parallel with methods such as delayed branching. EPIC aims to remove this bottleneck of sequential binary code by specifying explicitly to the processor what code can be executed in parallel (See Fig. 1). This moves the burden of finding parallelism to the compiler. A good compiler can find plenty of opportunities for parallel execution. The processor can then have a simpler design, and can take full advantage of parallelism by having large throughput, large register files, and many functional units.

IA-64 offers simply massive data-handling capabilities. As mentioned earlier, 64-bit memory addressing allows the potential for billions of gigabytes of memory. The first systems in this architecture support up to 64 GB of memory but that limit will continue to grow. Between main memory and the processor is an equally impressive three-level cache. Level 1 contains 16 KB of instructions plus 16 KB of data, with low latency and a very wide bus. Level 2, which is still on-die, holds 96 KB and is connected by a 256-bit bus. Level 3 is off-die, but is very large at two to four MB.

A large cache is very important for this architecture because of the huge instruction and data throughput. Data load speculation and cache hints increase its effectiveness. A fast 128-bit bus connects to high-speed memory. System designs for IA-64 commonly include DDR memory, 64-bit 66 MHz PCI bus and AGP 8x. Flexible page sizes round out data performance and scalability features.

Literally hundreds of registers, the divisions of which are discussed above, are available at any time for the user, ensuring that data stays in the processor for as long as possible. Instructions imply which group of registers to access by their types and operand ordering. Registers are capable of rotation, a performance feature discussed later.

There are usage conventions for each register set just as in other architectures. For example, the general, floating-point, and predicate registers each have a “default” value in register zero — the value 0 for general and floating-point and 1 (execute instruction) for predicate register zero. There are rules or suggestions for using various groups of the remaining registers as well.

Instructions are 41 bits wide; they usually contain a 7-bit opcode and multiple 7-bit operand fields, in order to specify one of 128 registers chosen from the register group implied by the instruction. A 6-bit predicate register field also exists in most instructions. Various instructions fill up extra bits with flags and other miscellaneous data.

Having evolved from CISC, IA-64 has a complex, powerful set of instructions. Each instruction is a member of one of five types that are used to organize properly constructed instruction bundles (covered later). A few special operations and groups are notable.

Comparison operations exist to set predicate register(s) according to the results of a logical or arithmetic comparison. To load an immediate value, one can use add-immediate specifying a source register of r0. A variety of multimedia, floating-point (FP), and shift instructions exist. Many are vectorized or SIMD (Single Instruction Multiple Data) instructions. There are branch instructions specialized for looping, and many more instructions of all varieties.

FP values can be represented in many ways. Among some special variations, available formats are IEEE FPS single, double, and double-extended (15-bit exponent, 64-bit mantissa) precision. All four IEEE rounding modes and IEEE-standard FP exceptions are supported. There are also many other higher- and lower-precision formats that are expansions of IEEE FPS.

FP multiplication and addition are both part of the single instruction called “fused multiply-add” (Cornea–Hasegan 4). This instruction takes the form a = (b * c) + d. The advantage is that it results in less error than separate multiply and add instructions. A single multiplication can happen by specifying d = f0 (the zero register). Addition is similarly possible by specifying c = f1, the “one” register. Subtraction happens by using a related multiply-subtract operation.

Division, modulus, and square root operations are software-implemented. The multiply-add instruction allows such routines to execute efficiently with low error. Calls to such operations typically happen through a Software Assistance (SWA) request, which is a sort of trap handler. SWA also implements integer division and remainder operations.

Implementing EPIC with the IA-64 Architecture

Bundles are the basic unit of parallelism in IA-64 (Jarp 18). They explicitly show the processor how instructions can execute in parallel and what kind of interdependence (and independence) exists among instructions. A group of three 41-bit instructions plus a 5-bit “template” forms a 128-bit bundle. (See Fig. 2) The processor loads and manipulates entire bundles, so the Instruction Pointer (equivalent to the MIPS RISC PC register) points to the address of the first byte of a bundle. Accordingly, branch operations specify a bundle as a branch target.

Templates specify what kinds of instructions can fill each slot of the bundle. They aid in ensuring combinations of instructions that can execute in parallel, without interdependency. Bit 0 (little endian) of the template is a stop bit, which specifies a boundary between parallel instructions and bundles. Bits 1–4 specify one of 12 different templates. Templates are combinations of the letters M, I, A, F, and B, which represent the five general instruction categories. A special template is included to allow 64-bit immediate values.

The unique feature that is key to IA-64’s performance is the parallel execution offered by an EPIC approach. Parallelism in the hardware matches parallelism in the software to help eliminate sequential bottlenecks. There are two to four copies of each functional unit (arithmetic/logic, branch, floating point, and other modules) in the Itanium processor family. There is also a ten-stage pipeline. The scheduler hardware can disperse instructions into up to nine slots, and the first Itanium processor was able to assign up to six EPIC instructions in a cycle, double the capabilities of 32-bit x86 processors. The actual number of operations per cycle, as opposed to instructions, can approach 20. This refers to complex instructions and floating-point operations. (Simon 5) Obviously, this is just the potential of the processor and it may not be able to reach this potential with much of the code it receives. Good compilers and smart assemblers are the primary means of optimizing parallel execution of code.

EPIC uses predication, which conditionally executes individual instructions based upon the value of special registers. Various architectures have used versions of predication and condition codes throughout the years but predication in EPIC is unusually widespread; predicates can be specified for almost any instruction. There are 64 one-bit predicate registers. When using a predicate register, a value of 1 means to execute the instruction. Predicate register zero has the value of one — “always execute”. This is the default predicate for instructions.

Predication is used mainly to replace “if-then” branching situations. First, a comparison instruction sets the value of a predicate or predicates based upon the result of a logical or Boolean comparison. This is equivalent to the condition part of a conditional branch. Then both sets of instructions (the “if” and, if applicable, the “else” part) appear. Predicates tell the processor which of the two sets of instructions to execute. This results in performance gains by changing the control dependency to a data dependency (Dulong 11). Conditional branches cause problems by not knowing soon enough which instruction to jump to, even when delayed branching and prediction are used. Predication eliminates pipeline bubbles and branch mispredictions because there is simply no longer any branch. Instruction bundling and parallel execution allow both sets of predicated instructions to be examined and executed in parallel. The processor can fetch and begin executing both conditions and then only “keep” the ones for which the predicates are true.

Many architectures have the ability to pre-fetch data. IA-64 provides data speculation, which is similar to pre-fetching but aims to be more efficient at preventing cache misses and also preventing unnecessary loads. Programmers can specify a load hint well before the data is necessary, even before branches that may make the data invalid or irrelevant. This introduces the possibility of errors loading data because pre-loading data that would be valid after a particular branch may not be valid before that branch, especially if the branch is taken. Many architectures that allow pre-fetching simply ignore load errors, and others lose performance from suppressing the errors. IA-64 includes a “Nat” error bit that follows the loaded data through operations. At the point where the data is used, a check instruction ensures that the data loaded properly. Usually, any load errors that occurred from speculation are thrown out because predicates or branching renders the data unused. However, the check instruction allows a means to find and recover from true errors, a feature that other architectures tend not to have. (IA-64 Architecture Overview)

An advanced feature of IA-64 is register rotation. To the software, register rotation appears as if the values within each register are shifted to an adjacent register. The upper 3/4 of each set of general, FP, and predicate registers can rotate. Rotating registers can also serve as a stack when calling and returning from procedures because of the user’s capability to define the degree of rotation. Rotation allows efficient looped operations over a stream of data. Each stage of the loop uses a different predicate register, starting from p16 (the first rotating predicate) and increasing incrementally. Similarly, data is placed strategically in rotating registers. At the bottom of the loop is an instruction to trigger register rotation and a branch to loop back. The result is “software pipelining” — different stages of the loop execute in parallel with each cycle, and multiple iterations of the loop can be executing simultaneously. (See Fig. 3)

With its performance potential unquestionable, the Itanium family offers many extra features to make it more attractive to the business world. Two of the most importance features are backward compatibility and stability.

The Itanium line can run unmodified 32-bit x86 code (including operating systems) by switching to a compatibility mode. An on-chip decoder translates 32-bit instructions at runtime to EPIC bundles. A field in one of the status registers keeps track of whether the processor is operating in compatibility mode. Performance suffers in compatibility mode because of the decoding and scheduling overhead. (Simon 11) However, more importantly, this protects investments in 32-bit software to help provide a smooth transition to 64-bit computing.

For stability, full ECC protection exists on the buses, pipelines, and other connections within the Itanium processor and on the buses to L3 cache and other hardware. The Itanium family was designed for servers and workstations and has built-in multi-processor support. Two- and four-processor systems currently exist but the architecture can support massively parallel systems.

Programming and Optimizing for the IA-64

The combination of bundling, parallelism, and the complex EPIC instruction set makes quality assembly programming for IA-64 a challenging task. RISC architectures place a heavy load on the processor to execute code efficiently. In contrast, EPIC places most of the burden upon the compiler and assembler. Using bundles, templates, and explicit stop points (which separate blocks of code that can execute in parallel), the assembly programmer takes responsibility to organize code in a way that maximizes opportunities for parallelism while ensuring that intra-block data dependence does not exist. Templates help, but it is still a challenge.

Recognizing this, Intel has an IA-64 assembler available. It provides, among other features, code coloring, wizards, and smart assembly to aid development. There are two assembly modes, explicit and automatic. Explicit mode is where the programmer must define bundle boundaries, templates, and stops. (Tal 4) The assembler then validates the code. This is good for high performance and expert programming. In automatic mode, the assembler analyzes more “traditional” non-bundled code and rearranges it into bundled code. The assembler also provides the ability to use symbolic identifiers that it maps efficiently to registers during assembly, taking advantage of the large register file. Compilers face a similar challenge of dealing with an unfamiliar architecture. Intel also offers a compiler that uses all the latest optimization technology and incorporating new algorithms designed for the IA-64 processor.

Optimizing for IA-64 primarily involves milking out every opportunity for parallelism possible. The parallel code must not only be valid, but also should appear in an order that strategically provides steady workloads to each functional unit of the processor. Good organization can fill the entire “width” of the processor.

There are varieties of situations that optimize especially well for this architecture. Initialization blocks are by nature parallel because instructions often work on many pieces of unrelated data. For another example, multi-way branches and branches with many ANDs or ORs need a variety of conditional statements. In these situations, multiple comparisons to set up predicates can usually happen in parallel, which is a very good opportunity for increased performance. In other architectures, such situations may involve multiple conditional branches (which in themselves are slow instructions) and “end if” branches that all execute in parallel. IA-64 can eliminate branching even in these complex cases. Another example of optimization mentioned earlier is software-pipelined loops using rotating registers. Many more examples exist. Code that is inherently sequential, by contrast, is often difficult to optimize. (Jarp 85)

IA-64 in Reality

IA-64 was developed through the 1990s in a joint research venture by Intel and HP. The Itanium processor, the first IA-64 processor as well Intel’s first 64-bit offering, entered the market in April 2001 after delays of several months. The Itanium family will continue to be only for servers and high-end workstations for several years. Second-generation Itanium 2 processors still cost over a thousand dollars, as of spring 2003. The Itanium line has nonetheless proven its massive performance capabilities, especially with applications such as in-memory databases, data-intensive research, and high-performance graphics. It is currently supported by a wide range of systems and server vendors and has Linux and Microsoft Windows operating systems ported. Future IA-64 processor generations will continue to improve in both performance and cost, and will gradually move toward the general consumer market.

AMD has released competitive chips. The Opteron is capable of 32-bit and 64-bit processing, and has an advantage in being much more affordable. Unlike earlier AMD processors, it is not a “clone” of its Intel peer and there are indications that AMD will remain on a separate path in 64-bit computing. AMD will also release the Athlon 64 in fall 2003. (Fordahl)

Intel perceives IA-64 its biggest evolution in processing since the inception of the ubiquitous x86 line. Indeed, like the advances of the 386, the Itanium is a long step above the Pentium. Intel has proven its ability for innovation in this entirely new architecture that represents a forward-looking jump to the computer of the 21st century.

Figures

Figure 1a: Sequential Program Model Figure 1b: Goal of EPIC

Figure 1: Removing the Bottleneck of Sequential Code (Jacob 4–5)

Figure 2: Assembly Code Examples of Bundles, Templates, Stops, and Predicates (Tal 3)

Figure 2: Assembly Code Examples of Bundles, Templates, Stops, and Predicates (Tal 3)

Figure 3: Software Pipelining (Jarp 87)

Figure 3: Software Pipelining (Jarp 87)

Works Cited

  1. Cornea–Hasegan, Marius, and Bob Norin. “IA-64 Floating-Point Operations and the IEEE Standard for Binary Floating Point-Arithmetic.” Intel Technology Journal. Q4 1999: 15 pages. Viewed 30 April 2003. Available online http://developer.intel.com/technology/itj/q41999/articles/art_6.htm
  2. Dulong, Carole, et. al. “An Overview of the Intel IA-64 Compiler”. Intel Technology Journal. Q4 1999: 15 pages. Viewed 27 April 2003. Available online http://developer.intel.com/technology/itj/q41999/articles/art_1.htm
  3. EPIC. 23 Dec. 2002. Viewed 28 April 2003. Available online http://searchcio.techtarget.com/sDefinition/0,,sid19_gci214560,00.html
  4. Fordahl, Matthew. “Fingers crossed, AMD prepares to launch chip.” The Miami Herald. 22 April 2003: 8C. Newsbank. Herrick Library 29 April 2003. http://www.newsbank.com/
  5. “IA-64 Architecture Overview”. IA-64 References. Feb. 1999. HP Systems & VLSI Technology Division. Viewed 1 May 2003. Available online http://cpus.hp.com/technical_references/ia64_overview_wp.shtml
  6. Jacob, Bruce. The IA-64 Architecture. 16 Sept. 1998. Department of Electrical Engineering, University of Maryland at College Park. Viewed 29 April 2003. Available online http://www.ece.umd.edu/~blj/talks/IA-64.2.pdf
  7. Jarp, Sverre. IA-64 Architecture: a Detailed Tutorial. 8 Nov 1999. CERN. Viewed 29 April 2003. Available online http://sverre.home.cern.ch/sverre/IA64_1.pdf
  8. Simon, Jon. Itanium Technology Guide. 4 Dec 2000. Viewed 29 April 2003. Available online http://www.sharkyextreme.com/hardware/guides/itanium/
  9. Tal, Ady, et. al. “Assembly Language Programming Tools for the IA-64 Architecture”. Intel Technology Journal. Q4 1999: 11 pages. Viewed 27 April 2003. Available online http://developer.intel.com/technology/itj/q41999/articles/art_3.htm