CPU design

To a large extent, the design of a CPU, or central processing unit, is the design of its control unit. The modern (ie, 1965 to 1985) way to design control logic is to write a microprogram

CPU design was originally an ad-hoc process. Just getting a CPU to work was a substantial governmental and technical event.

Key design innovations include cache, virtual memory, instruction pipelining, superscalar, CISC, RISC, virtual machine, emulators, microprogram, and stack.

Table of contents

1 General Purpose CPU Design

1.1 1950s: Early Designs
1.2 1960s: The Computer Revolution and CISC
1.3 1970s: Large Scale Integration
1.4 Early 1980s: The Lessons of RISC
1.5 Mid 1980s to Today: Synthesis
1.6 1990 to Today: Looking Forward

2 Embedded Design

2.7 Other design issues

3 See also

Computers throughout the early 1950s were similar in that they all contained a central processor that was unique to that machine. Programs written for one machine would not run on another, and most often wouldn't run on other machines from the same company. Each design differed in the types of instructions they supported, and few machines could be considered "general purpose". There simply wasn't enough space to wire in a full set of instructions using the technology of the day (for instance the SAGE systems filled entire floors) so each machine targeted a certain solution.

By the end of the 1950s commercial builders had developed factory-constructed, truck-deliverable computers. The most widely installed computer was the IBM 650, which used drum memory onto which programs were loaded using either paper tape or punch cards. Some very high-end machines also included core memory which provided higher speeds. Hard disks were also starting to become popular.

Computers are automatic abaci. The type of number system affects the way they work. In the early 1950s most computers were built for specific numerical processing tasks, and many machines used decimal numbers as their basic number system – that is, the mathematical functions of the machines worked in base-10 instead of base-2 as is common today. These were not merely binary coded decimal. The machines actually had ten vacuum tubes per digit in each register. Some early Soviet computer designers implemented systems based on ternary logic; that is, a bit could have three states: +1, 0, or -1, corresponding to positive, no, or negative voltage.

An early project for the U.S. Air Force, BINAC attempted to make a lightweight, simple computer by using binary arithmetic. It deeply impressed the industry.

As late as 1970, major computer languages such as "C" were unable to standardize their numeric behavior because decimal computers had groups of users too large to alienate.

Even when designers used a binary system, they still had many odd ideas. Some used sign-magnitude arthmetic (-1 = 10001), rather than modern two's complement arithmetic (-1 = 11111). Most computers used six-bit character sets, because they adequately encoded Hollerith cards. It was a major revelation to designers of this period to realize that the data word should be a multiple of the character size. They began to design computers with 12, 24 and 36 bit data words.

In this era, Grosch's law dominated computer design: Computer capacity increased as the square of its cost.

1960s: The Computer Revolution and CISC

One major problem with early computers was that a program for one would not work on others. Computer companies found that their customers had little reason to remain loyal to a particular brand, as the next computer they purchased would be incompatible anyway. At that point price and performance were usually the only concerns.

In 1962, IBM bet the company on a new way to design computers. The plan was to make an entire family of computers that could all run the same software, but with different performances, and at different prices. As users' requirements grew they could move up to larger computers, and still keep all of their investment in programs, data and storage media.

In order to do this they designed a single reference computer called the System 360 (or S/360). The System 360 was a virtual computer, a reference instruction set and capabilities that all machines in the family would support. In order to provide different classes of machines, each computer in the family would use more or less hardware emulation, and more or less microprogram emulation, to create a machine capable of running the entire System 360 instruction set.

For instance a low-end machine could include a very simple processor for low cost. However this would require the use of a larger microcode emulator to provide the rest of the instruction set, which would slow it down. A high-end machine would use a much more complex processor that could directly process more of the System 360 design, thus running a much simpler and faster emulator.

IBM chose to make the reference instruction set quite complex, and very capable. This was a conscious choice. Even though the computer was complex, its "control store" containing the microprogram would stay relatively small, and could be made with very fast memory. Another important effect was that a single instruction could describe quite a complex sequence of operations. Thus the computers would generally have to fetch fewer instructions from the main memory, which could be made slower, smaller and less expensive for a given combination of speed and price.

An often-overlooked feature of the S/360 was that it was the first instruction set designed for data processing, rather than mathematical calculation. The instruction set was designed to manipulate not just simple integer numbers, but text, scientific floating-point (similar to the numbers used in a calculator), and the decimal arithmetic needed by accounting systems.

The S/360 system was the first computer to make major use of binary coded decimal.

Almost all following computers included these innovations in some form. This basic set of features is called a "complex instruction set computer," or CISC (pronounced "sisk").

In many CISCs, an instruction could access either registers or memory, usually in several different ways. This made the CISCs easier to program, because a programmer could remember just thirty to a hundred instructions, and a set of three to ten "addressing modes," rather than thousands of distinct instructions. This was called an "orthogonal instruction set." The PDP-11 and Motorola 68000 architecture are examples of nearly orthogonal instruction sets.

1970s: Large Scale Integration

In the 1960s, the apollo guidance computer and minuteman missile made the integrated circuit economical and practical.

Around 1971, the first calculator and clock chips began to show that very small computers might be possible. The first microprocessor was the 4004, designed in 1971 for a calculator company, and produced by Intel. The 4004 is the direct ancestor of the Intel 80386, even now maintaining some code compatibility. Just a few years later, the word size of the 4004 was doubled to form the 8008.

By the mid-1970s, the use of integrated circuits in computers was commonplace. The whole decade consists of unheavals caused by the shrinking price of transistors.

It became possible to put an entire CPU on a single printed circuit board. The result was that minicomputers, usually with 16-bit words, and 4k to 64K of memory, came to be commonplace.

CISCs were believed to be the most powerful types of computers, because their microcode was small and could be stored in very high-speed memory.

Custom CISCs were constructed using "bit slice" computer logic such as the AMD 2900 chips, with custom microcode. A bit slice component is a piece of an ALU, register file or microsequencer. Most bit-slice integrated circuits were 4-bits wide.

By the late 1970s, the PDP-11 was developed, arguably the most advanced small computer of its day. Almost immediately, 32-bit CISCs were introduced, VAX and PDP-10.

Also, to control a cruise missile, Intel developed a more-capable version of its 8008 microprocessor, the 8080.

IBM continued to make large, fast computers. However the definition of large and fast now meant more than a megabyte of RAM, clock speeds near a hundred megahertz, and tens of megabytes of disk drives.

IBM's System 370 was a version of the 360 tweaked to run virtual computing environments. The virtual computer was developed in order to reduce the possibility of an unrecoverable software failure.

The Burroughs B300 series reached its largest market share. It was a stack computer programmed in a dialect of Algol. It used 64-bit fixed-point arithmetic, rather than floating-point.

All these different developments competed madly for marketshare.

Early 1980s: The Lessons of RISC

In the early 1980s, researchers at UC Berkeley and IBM both discovered that most computer languages produced only a small subset of the instructions of a CISC. Much of the power of the CPU was simply being ignored in real-world use. They realized that by making the computer simpler, less orthogonal, they could make it faster and less expensive at the same time.

At the same time CPUs were growing faster in relation to the memory they addressed. Designers also experimented with using large sets of internal registers. The idea was to cache intermediate results in the registers under the control of the compiler. This also reduced the number of addressing modes and orthogonality.

The computer designs based on this theory were called Reduced Instruction Set Computers, or RISC. RISCs generally had larger numbers of registers, accessed by simpler instructions, with a few instructions specifically to load and store data to memory. The result was a very simple core CPU running at very high speed, supporting the exact sorts of operations the compilers were using anyway.

One downside to the RISC design has been that the programs that run on them tend to be larger. That's because compilers have to generate longer sequences of the simpler instructions to accomplish the same results. Since these instructions need to be loaded from memory anyway, the larger code size offsets some of the RISC design's fast memory handling.

Recently, engineers have found ways to compress the reduced instruction sets so they fit in even smaller memory systems than CISCs. Examples of such compression schemes include the ARM's "Thumb" instruction set. In applications that do not need to run older binary software, compressed RISCs are coming to dominate sales.

Another approach to RISCs was the "niladic" or "zero-address" instruction set. This approach realized that the majority of space in an instruction was to identify the operands of the instruction. These machines placed the operands on a push-down (last-in, first out) stack. The instruction set was supplemented with a few instructions to fetch and store memory. Most used simple caching to provide extremely fast RISC machines, with very compact code. Another benefit was that the interrupt latencies were extremely small, smaller than most CISC machines (a rare trait in RISC machines). The first zero-address computer was developed by Charles Moore, placed six 5-bit instructions in a 32-bit word, and was a precursor to VLIW design (see below: 1990 to Today).

Commercial variants were mostly characterized as "FORTH" machines, and probably failed because that language became unpopular. Also, the machines were developed by defense contractors at exactly the time that the cold war ended. Loss of funding may have broken up the development teams before the companies could perform adequate commercial marketing.

RISC chips now dominate the market for 32-bit embedded systems. Smaller RISC chips are even becoming common in the cost-sensitive 8-bit embedded-system market. The main market for RISC CPUs has been systems that require low power or small size.

Even some CISC processors (based on architectures that were created before RISC became dominant) translate instructions internally into a RISC-like instruction set. These CISC chips include newer x86 and VAX models.

These numbers may surprise many, because the "market" is perceived to be desktop computers. However desktop computers are only a tiny fraction of the computers now sold. Most people own more computers in their car and house than on their desks. With Intel designs dominating the vast majority of all desktop sales, RISC is found only in the Apple computer lines.

Mid 1980s to Today: Synthesis

In the mid-to-late 1980s, designers began using a technique known as instruction pipelining, in which the processor works on multiple instructions in different stages of completion. For example, the processor may be retrieving the operands for the next instruction while calculating the result of the current one. Modern CPUs may use over a dozen such stages.

A similar idea, introduced only a few years later, was to execute multiple instructions in parallel on separate arithmetic-logic units (ALUs). Instead of operating on only one instruction at a time, the CPU will look for several similar instructions that are not dependent on each other, and run them all at the same time. The results are then interleaved when they exit, making it look like a single CPU was running (say) twice as fast while still using only one bus.

This approach, referred to as scalar processor design, is limited by the degree of instruction level parallelism (ILP), the number of non-dependent instructions in the program code. Some programs are able to run very well on scalar processors, notably graphics. However more general problems require complex logic, and this almost always results in instructions whose results are based on other results -- thus making them unable to run in parallelized forms.

Branching is one major culprit. For eample, the program might add two numbers and branch to a different code segment if the number is bigger than a third number. In this case even if the branch operation is sent to the second ALU for processing, it still must wait for the results from the addition. It thus runs no faster than if there were only one ALU.

To get around this limit, so-called superscalar designs were developed. Additional logic in the CPU looks at the code as it is being sent into the CPU, and "forces" it to be parallel. In the branching case a number of solutions are applied, including loking at past examples of the branch to see which one is most common (called branch prediction), and simply running that case as if there was no branch at all. A similar concept is speculative execution, where both sides of a branch are run at the same time, and the results of one or the other are thrown out once the answer is known.

These advances, which were originally developed from research for RISC-style designs, allow modern CISC processors to execute twelve or more instructions per clock cycle, when traditional CISC designs could take twelve or more cycles to execute just one instruction.

The resulting microcode is complex and error-prone, mostly due to the dependencies between different instructions. Furthermore, the electronics to coordinate these ALUs require more transistors, increasing power consumption and heat. In this respect RISC is superior because the instructions have less interdependence and make superscalar implementations easier. However, as Intel has demonstrated, the concepts can be applied to a CISC design, given enough time and money.

Historical note: Most of these techniques (pipelining, branch prediction, etc.) were originally developed in the late 1950s by IBM on their Stretch mainframe computer.

1990 to Today: Looking Forward

The microcode that makes a superscalar processor is just-- computer code. In the early 1990s, a significant innovation was to realize that the coordination of a multiple-ALU computer could be moved into the compiler, the software that translates a programmer's instructions into machine-level instructions.

This type of computer is called a very long instruction word (VLIW) computer.

Parallelizing the code in the compiler has many practical advantages over doing so in the CPU.

Oddly, speed is not one of them. With enough transistors, the CPU could do everything at once. However all all those transistors make the chip larger, and therefore more expensive. The transistors also use power, which means that they generate heat that must be removed. The heat also makes the design less reliable.

Since compiling happens only once on the developer's machine, the control logic is "canned" in the final realization of the program. This means that it consumes no transistors, and no power, and therefore is free, and generates no heat.

The resulting CPU is simpler, and runs at least as fast as if the prediction were in the CPU.

There were several unsuccessful attempts to commercialize VLIW. The basic problem is that a VLIW computer does not scale to different price and performance points, as a microprogrammed computer can.

Also, VLIW computers maximize throughput, not latency, so they were not attractive to the engineers designing controllers and other computers embedded in machinery. The embedded systems markets had often pioneered other computer improvements by providing a large market that did not care about copatibility with older software.

In January 2000, a company called Transmeta took the interesting step of placing a compiler in the central processing unit, and making the compiler translate from a reference byte code (in their case, x86 instructions) to an internal VLIW instruction set. This approach combines the hardware simplicity, low power and speed of VLIW RISC with the compact main memory system and software reverse-compatibility provided by popular CISC.

Later this year (2002), Intel intends to release a chip based on what they call an Explicitly Parallel Instruction Computer (EPIC) design. This design supposedly provides the VLIW advantage of increased instruction throughput. However, it avoids some of the issues of scaling and complexity, by explicitly providing in each "bundle" of instructions information concerning their dependencies. This information is calculated by the compiler, as it would be in a VLIW design. The early versions will also be reverse-compatible with current x86 software by means of an on-chip emulation mode.

Also, we may soon see multi-threaded CPUs. Current designs work best when the computer is running only a single program, however nearly all modern operating systems allow the user to run multiple programs at the same time. For the CPU to change over and do work on another program requires an expensive context-switch. In contrast, a multi-threaded CPU could handle instructions from multiple programs at once.

To do this, such CPU's include several sets of registers. When a context switch occurs the contents of the "working registers" are simply copied into one of a set of registers for this purpose.

Such designs often include thousands of registers instead of hundreds as in a typical design. On the downside, registers tend to be somewhat expensive in chip space needed to implement them. This chip space might otherwise be used for some other purpose.

Another track of development is to combine reconfigurable logic with a general-purpose CPU. In this scheme, a special computer language compiles fast-running subroutines into a bit-mask to configure the logic. Slower, or less-critical parts of the program can be run by sharing their time on the CPU. This process has the capability to create devices such as software radios, by using digital signal processing to perform functions usually performed by analog electronics.

As the lines between hardware and software increasingly blur due to progress in design methodology and availability of chips such as FPGAs and cheaper production processes, even open-source hardware has begun to appear. Loosely-knit communities like OpenCores have recently announced completely open CPU architectures such as the OpenRISC which can be readily implemented on FPGAs or in custom produced chips, by anyone, without paying license fees.

IBM does have a large bed, VERY large. Of course, on the other hand, IBM may sort of NEED to get all of these customers on-board if they want to stay on top of the CPU development game. We've seen in recent years that developing new, high-end CPUs is a VERY expensive proposition and not one that can be maintained unless you a REALLY big shot. Right now there are only 5 companies in the world left designing top-end general purpose processors, and two of them (Fujitsu and Sun) are really struggling and talking about merging. Digital/Compaq/HP have all gotten out of the business, with Alpha essentially dead and PA-RISC being replaced by Intel's Itanium. SGI is out, Motorola hasn't been competitive for years, and everyone else either abandoned the market a long time ago or have slowly faded out of the picture.

Their are a lot more companies that are designing chips than fabbing them. Only three companies are actively designing and fabbing state of the art chips: Intel, AMD and IBM. AMD is moving its CPU manufacturing over to IBM soon so that will only leave IBM and Intel. Motorola does its own CPU design and manufacturing but seems to be amputating that division from the rest of the company. TI, TSMC and Toshiba are a few examples of a company doing manufacturing for another company's chip design.

IBM seems to be designing and manufacturing the CPU's for Nintendo, Sony and now Microsoft. IBM seems to have put a very good spin on PPC technology to be doing the CPU design.

Embedded Design

The majority of computer systems in use today are embedded in other machinery, such as telephones, clocks, appliances, vehicles, and infrastructure. These "embedded systems" usually have small requirements for memory, modest program sizes, and often simple but unusual input/output systems. For example, most embedded systems lack keyboards, screens, disks, printers, or other recognizable I/O devices of a personal computer. They may control electric motors, relays or voltages, and read switches, variable resistors or other electronic devices. Often, the only I/O device readable by a human is a single light-emitting diode, and severe cost or power constraints can even eliminate that.

In contrast to general-purpose computers, embedded systems often seek to minimize interrupt latency over instruction throughput.

For example, low-latency CPUs generally have relatively few registers in their central processing units. When an electronic device causes an interrupt, the intermediate results, the registers, have to be saved before the software responsible for handling the interrupt can run, and then must be put back after it is finished. If there are more registers, this saving and restoring process takes more time, increasing the latency.

Other design issues

Another common problem involves virtual memory. Historically, random-access memory has been thousands of times more expensive than rotating mechanical storage. For businesses, and many general computing tasks, it is a good compromise to never let the computer run out of memory, an event which would halt the program, and greatly inconvenience the user.

Instead of halting the program, many computer systems save less-frequently used blocks of memory to the rotating mechanical storage. In essence, the mechanical storage becomes main memory. However, mechanical storage is thousands of times slower than electronic memory.

Thus, almost all general-purpose computing systems use "virtual memory" and also have unpredictable interrupt latencies.

A few operating system contain a real-time scheduler. Such a scheduler keeps critical pieces of code and data in solid-state RAM and guarantees a minimum amount of CPU time and a maximum interrupt latency.

One interesting near-term possibility would be to eliminate the bus. Modern vertical laser diodes enable this change. In theory, an optical computer's components could directly connect through a holographic or phased open-air switching system. This would provide a large increase in effective speed and design flexibility, and a large reduction in cost. Since a computer's connectors are also its most likely failure point, a busless system might be more reliable, as well.

Another farther-term possibility is to use light instead of electricity for the digital logic itself. In theory, this could run about 30% faster and use less power, as well as permit a direct interface with quantum computational devices. The chief problem with this approach is that for the foreseeable future, electronic devices are faster, smaller (i.e. cheaper) and more reliable. An important theoretical problem is that electronic computational elements are already smaller than some wavelengths of light, and therefore even wave-guide based optical logic may be uneconomic compared to electronic logic. We can therefore expect the majority of development to focus on electronics, no matter how unfair it might seem. See also optical computing.

Yet another possibility is the "clockless CPU." Unlike conventional processors, this processor has no central clock to coordinate the progress of data through the pipeline; instead, this unit coordinates stages of the CPU using logic devices called "pipline controls" or "FIFO sequencers." Basically, the pipline controller clocks the next stage of logic when the existing stage is complete. In this way, a central clock is unnecessary. The advantage is that components can run at different speeds in the clockless CPU. In a clocked CPU, no component can run faster than the clock rate.