DCTP - Hitachi's 200 MHz SH-4

Hitachi 200 MHz SH-4

There are a number of considerations that SEGA had to make when deciding on a CPU for the Dreamcast. The most important considerations are power to cost ratio, manufacturing volumes, and heat dissipation.

Why SEGA went with the Hitachi SH-4:

very high performance at a low cost

360 MIPS, 1.4 GigaFLOPS for around $30 in 1998

Hitachi has the manufacturing ability to provide millions of SH-4s
the SH-4 does not need a fan for heat dissipation (heat sink only)

reducing cost
simplifying system design
smaller, less expensive, lower weight power supply needed

SEGA's software engineers are familiar with programming SH series chips
SEGA has good relations with Hitachi
SEGA had some input in the design of the SH-4

floating point unit designed to excel at calculating matrix math arrays

able to run Windows CE

Details: Hitachi's SH-4

Hitachi 200 MHz SH-4	Details
Design	Hitachi
Family	SH series
Manufacturer	Hitachi
Clock Rate	200 MHz
MIPS Rating	360 MIPS
Floating Point Rating	1400 MFLOPS* (900 MFLOPS sustained with external memory)
Pipeline	5 stages
Superscaler	yes
Instruction Cache	8 KByte
Data Cache	16 KByte
Data Bus	64-bit
Bus Frequency	100 MHz
Bus Bandwidth	800 MB/sec
Power Dissipation	1.5 Watts Typical (@200MHz)
IC Process	0.25 µm, four-layer metal
Transistors	3.2 million
Die Size	42.25 mm² die size (6.5 mm x 6.5 mm)
Package	256-pin ball grid array (BGA)
Availability (samples)	January, 1998
Availability (production quantities)	3rd Quarter 1998
Price(10K)²	4,000 yen (US$31.70)*

(*) The reason the SH-4 cannot maintain 1400 MFLOPS sustained is because the SH-4 data cache cannot reload fast enough to feed the floating point unit with data.
(*) Based on YEN to US conversion for November 19th, 1997.

The Jewel in SEGA's Crown

Click to Enlarge There is without a doubt, that the SH-4 is probably the best CPU that Sega could have chosen for the Dreamcast. The SH-4 was unbeatable in 1997 for having a combination of factors that were perfect for a console: low cost, efficient integer instruction set (16-bit size), and a very powerful vector unit that can sustain 10 million polygon transformations per second.

SH-4 (SH7750-type) Die Diagram

Click to Enlarge SH-4 Specifications

200 MHz

360 integer MIPS (Dhrystone 1.1 benchmark)

32-bit integer unit

2-way superscalar

5 stage pipeline

8 KByte instruction cache

16 KByte data cache

64-bit floating point unit

1.4 GFlops (0.9 GFlops sustained), 5-million polygon capability

64-bit external bus (256 pin package)

800 MBytes/second bus bandwidth with 100 MHz SDRAM

Glueless bus memory interface to SGRAM, and SDRAM

Internal power of 1.8 V / 3.3 V (I/O)

1.5 W (typ.) heat dissipation (at 200 MHz)

0.25 µm, four-layer metal CMOS process

42.25 mm² die size (6.5 mm x 6.5 mm)

208-pin quad flat package (QFP) or 256-pin ball grid array (BGA) package

SH-4 (SH7750-type) Block Diagram

Click to Enlarge SH-4 Peripherals
As you can see by the diagram, the SH-4 comes with a wealth of on-chip peripherals like an interrupt controller (INTC), three versatile timers (TMU), a real-time clock (RTC), two serial interface channels (SCI), user break controller (UBC), and programmable power management controller. The direct memory access controller (DMAC) has four channels. The DMAC is excellent for moving blocks of memory around with almost no CPU intervention needed. This makes for efficient transfers of data from main memory to graphics memory for example.

16-bit Instructions
The instructions on the SH-4 are 16-bits in size and are fixed-length instructions. This provides code that is up to 40 percent less in size than RISC processors which use 32-bit fixed-length instructions. The MIPS achitecture for example uses 32-bit fixed-length instructions, so that a program that was 4 Megabytes in size on a MIPS CPU would only be 2.5 Megabytes on the SH-4. Another advantage in the SH-4 using 16-bit instructions is that it lowers the cache bandwidth needed to pull instructions onto the chip, so that it has more bandwidth for pulling and pushing data on and off the chip. Another advantage to using 16-bit fixed-length instructions is that the on-chip instruction cache does not have to be as large as a comparable RISC processor that uses 32-bit fixed-length instructions, thus allowing for a smaller chip to be made which makes it cheaper to produce and reduces the amount of heat generated by the chip. Software written for the previous generations of the SH series processors, which are the SH-1, SH-2 and the SH-3, will be able to run on the SH-4.

Superscaler
A superscaler processing unit is capable of executing two or more instructions at the same time. The instructions on the SH-4 can be classified into four groups: integer, simple integer/load/store, branch and floating point. Any two instructions can be processed in parallel as long as they are from different groups. Integer and floating point instructions can be processed in parallel, but not two branch instructions for example.

External Data Path
The SH-4 is a very flexible architecture in that it allows a variable width on the external data path to be either 8, 16, 32 or 64-bits in size. This allows the SH-4 to boot off of an economical 8-bit ROM for example and then use high speed 64-bit SDRAM. The external bus unit allows a glueless interface to SDRAM, which helps keep the cost down for designing a console around this CPU. 100 MHz SDRAM can transfer up to 800 MB/sec with the 64-bit data bus, which will greatly lend itself to 2D games. The Dreamcast is a 2D/3D power house thanks to the inclusion of the SH-4.

Cache
The 24 KByte of cache consists of a 8 KByte instruction cache and 16 KByte data cache. This segmented cache design provides higher performance then a unified cache design. The data cache on the SH-4 can utilize write back (WB) and write through (WT) modes of operation.

Memory Management Unit
The MMU on the SH-4 provides full Microsoft Windows CE compatibility. Page sizes of 1 KByte, 4 KByte, 64 KByte, and 1 MByte; which Windows CE can use to partition memory and to provide memory protection between different processes that are executing on the SH-4. Memory protection is important, so that different execution threads do not interfere with each other's memory spaces which can cause either the operating system or an application to crash.

Why are floating point values important?
Floating point as opposed to integer values lend themselves to results that are much more accurate, for example a 32-bit integer number can give a value of 0 to 4,295,000,000 and a 32-bit floating point number can give a value of 0 to 256 but with a mantissa of 0.0 to 0.000000059605. Lets say you had a calculation to do like 200 divide by 3. In integer arithmetic the answer would be 66, but the answer in floating point would be 66.666667. As you can see that by calculating the value on a CPU's integer unit you would lose 0.666667 of accuracy. Floating point accuracy is very important to graphic operations. The Dreamcast benefits greatly in the use of the SH-4 because of the power and accuracy of the floating point unit.

Floating Point Unit Excels At 3D Calculations

Click to Enlarge The SH-4 architecture includes impressive 3D floating point hardware. Each of the four floating point multipliers (fmuls) can receive two 32-bit values and produce a multiplied result that is passed to a four-input floating point adder. This hardware reads two 128-bit vectors (two sets of four 32-bit values) out of register files, multiplies the four 32-bit pairs at the same time, adds the four products together, and puts the 32-bit result back into the register file. This provides the equivalent of 288-bit data crunching (2 x 128 + 32 = 288).

A typical application for this processing power would be to perform the following transformation instruction, which involves seven operations:

f0*f4 + f1*f5 + f2*f6 + f3*f7 ' f7

The SH-4 can execute this seven-operation instruction in three clock cycles. Yet, because the architecture is fully pipelined, it can issue one of these instructions every cycle.

The figure (above, right) shows a better example of what the SH-4's floating point hardware can accomplish. Here the back register file is loaded with 16 values and the hardware performs the following matrix operation in seven clock cycles:

         f0*b0 + f1*b1 + f2*b2 + f3*b3 ' f0
         f0*b4 + f1*b5 + f2*b6 + f3*b7 ' f1
         f0*b8 + f1*b9 + f2*b10 + f3*b11 ' f2
         f0*b12 + f1*b13 + f2*b14 + f3*b15 ' f3

The SH-4 is fully pipelined, and the RISC architecture can repeat these 16 fmuls and 12 fadds (28 operations) every four clock cycles, for an average of seven floating point operations per cycle. The superscalar CPU and double-precision fmov allow registers to be loaded from, and stored to cache during these four cycles, so the operations are sustainable. At its 200-MHz clock speed, the SH-4 achieves 1.4-GFlops performance, sustained.

This high floating point power of the SH-4 will lend itself to a console system that is capable of having dynamic complex polygonal environments with characters in that environment made of high number of polygons making them look very detailed. Add rendering with a graphics chip to include such effects as texturing mapping, texture filtering, lens flare, smoke, fog, transparencies, shading and dynamic lighting.

A 16/32/64/128 bit CPU?
The SH-4 is a multiple bit CPU. Here is a list of the different functions, there bit sizes and the benefits of that particular bit size.

Function Bits Benefit

Instructions 16-bits 40 percent smaller size than comparable 32-bit RISC

CPU Precision 32-bits More accurate results compared to 16-bit values

Memory Addressing 32-bits 4 GB memory access which is overkill for a console that will not have more then 32 MB of memory

External Data Bus 64-bits 800 MB/sec transfer rate with 100 MHz SDRAM

Floating Point Precision 64-bits More accuracy in mantissa portion of number which aids graphic operations

Floating Point Bus 128-bits 3.2 GB/sec transfer rate from the data cache for matrix data aligned together as four separate 32-bit values like this: [32-bit][32-bit][32-bit][32-bit]

CPU's are defined by their integer size, and the SH-4 supports 32-bit integer values, so the SH-4 is classified as a 32-bit processor. By the way a 64-bit processor would be overkill for any console, and none exists. That includes Sony's PS2 and Microsoft's X-Box.

Other Documents

Hitachi SH-4 Document - November/December 1997