Hitachi
200 MHz SH-4
There are a number of considerations
that SEGA had to make when deciding on a CPU for the Dreamcast. The most
important considerations are power to cost ratio, manufacturing volumes,
and heat dissipation.
Why
SEGA went with the Hitachi SH-4:
-
very high performance at a low cost
-
360 MIPS, 1.4 GigaFLOPS for around $30
in 1998
-
Hitachi has the manufacturing ability
to provide millions of SH-4s
-
the SH-4 does not need a fan for heat
dissipation (heat sink only)
-
reducing cost
-
simplifying system design
-
smaller, less expensive, lower weight
power supply needed
-
SEGA's software engineers are familiar
with programming SH series chips
-
SEGA has good relations with Hitachi
-
SEGA had some input in the design of
the SH-4
-
floating point unit designed to excel
at calculating matrix math arrays
-
able to run Windows CE
Details: Hitachi's
SH-4
Hitachi 200 MHz SH-4
|
Details
|
Design |
Hitachi |
Family |
SH series |
Manufacturer |
Hitachi |
Clock Rate |
200 MHz |
MIPS Rating |
360 MIPS |
Floating Point
Rating |
1400 MFLOPS* (900 MFLOPS sustained
with external memory) |
Pipeline |
5 stages |
Superscaler |
yes |
Instruction
Cache |
8 KByte |
Data Cache |
16 KByte |
Data Bus |
64-bit |
Bus Frequency |
100 MHz |
Bus Bandwidth |
800 MB/sec |
Power Dissipation |
1.5 Watts Typical (@200MHz) |
IC Process |
0.25 µm, four-layer metal |
Transistors |
3.2 million |
Die Size |
42.25 mm² die size (6.5 mm
x 6.5 mm) |
Package |
256-pin ball grid array (BGA) |
Availability
(samples) |
January, 1998 |
Availability
(production quantities) |
3rd Quarter 1998 |
Price(10K)² |
4,000 yen (US$31.70)* |
(*) The reason the SH-4 cannot maintain
1400 MFLOPS sustained is because the SH-4 data cache cannot reload fast
enough to feed the floating point unit with data.
(*) Based on YEN to US conversion
for November 19th, 1997.
The
Jewel in SEGA's Crown
Click to Enlarge
|
There is without a
doubt, that the SH-4 is probably the best CPU that Sega could have chosen
for the Dreamcast. The SH-4 was unbeatable in 1997 for having a combination
of factors that were perfect for a console: low cost, efficient integer
instruction set (16-bit size), and a very powerful vector unit that can
sustain 10 million polygon transformations per second. |
SH-4
(SH7750-type) Die Diagram
Click to Enlarge
|
SH-4
Specifications
-
200 MHz
-
360 integer MIPS (Dhrystone 1.1
benchmark)
-
32-bit integer unit
-
2-way superscalar
-
5 stage pipeline
-
8 KByte instruction cache
-
16 KByte data cache
-
64-bit floating point unit
-
1.4 GFlops (0.9 GFlops sustained), 5-million
polygon capability
-
64-bit external bus (256 pin package)
-
800 MBytes/second bus bandwidth with
100 MHz SDRAM
-
Glueless bus memory interface to SGRAM,
and SDRAM
-
Internal power of 1.8 V / 3.3
V (I/O)
-
1.5 W (typ.) heat dissipation (at 200
MHz)
-
0.25 µm, four-layer metal CMOS
process
-
42.25 mm² die size (6.5 mm x 6.5
mm)
208-pin quad flat package (QFP)
or 256-pin ball grid array (BGA) package
|
SH-4
(SH7750-type) Block Diagram
Click to Enlarge
|
SH-4
Peripherals
As you can see by the diagram, the
SH-4 comes with a wealth of on-chip peripherals like an interrupt controller
(INTC), three versatile timers (TMU), a real-time clock (RTC), two serial
interface channels (SCI), user break controller (UBC), and programmable
power management controller. The direct memory access controller (DMAC)
has four channels. The DMAC is excellent for moving blocks of memory around
with almost no CPU intervention needed. This makes for efficient transfers
of data from main memory to graphics memory for example. |
16-bit
Instructions
The instructions on the SH-4 are
16-bits in size and are fixed-length instructions. This provides code that
is up to 40 percent less in size than RISC processors which use 32-bit
fixed-length instructions. The MIPS achitecture for example uses 32-bit
fixed-length instructions, so that a program that was 4 Megabytes in size
on a MIPS CPU would only be 2.5 Megabytes on the SH-4. Another advantage
in the SH-4 using 16-bit instructions is that it lowers the cache bandwidth
needed to pull instructions onto the chip, so that it has more bandwidth
for pulling and pushing data on and off the chip. Another advantage to
using 16-bit fixed-length instructions is that the on-chip instruction
cache does not have to be as large as a comparable RISC processor that
uses 32-bit fixed-length instructions, thus allowing for a smaller chip
to be made which makes it cheaper to produce and reduces the amount of
heat generated by the chip. Software written for the previous generations
of the SH series processors, which are the SH-1, SH-2 and the SH-3, will
be able to run on the SH-4.
Superscaler
A superscaler processing unit is
capable of executing two or more instructions at the same time. The instructions
on the SH-4 can be classified into four groups: integer, simple integer/load/store,
branch and floating point. Any two instructions can be processed in parallel
as long as they are from different groups. Integer and floating point instructions
can be processed in parallel, but not two branch instructions for example.
External
Data Path
The SH-4 is a very flexible architecture
in that it allows a variable width on the external data path to be either
8, 16, 32 or 64-bits in size. This allows the SH-4 to boot off of an economical
8-bit ROM for example and then use high speed 64-bit SDRAM. The external
bus unit allows a glueless interface to SDRAM, which helps keep the cost
down for designing a console around this CPU. 100 MHz SDRAM can transfer
up to 800 MB/sec with the 64-bit data bus, which will greatly lend itself
to 2D games. The Dreamcast is a 2D/3D power house thanks to the inclusion
of the SH-4.
Cache
The 24 KByte of cache consists of
a 8 KByte instruction cache and 16 KByte data cache. This segmented cache
design provides higher performance then a unified cache design. The data
cache on the SH-4 can utilize write back (WB) and write through (WT) modes
of operation.
Memory
Management Unit
The MMU on the SH-4 provides full
Microsoft Windows CE compatibility. Page sizes of 1 KByte, 4 KByte, 64
KByte, and 1 MByte; which Windows CE can use to partition memory and to
provide memory protection between different processes that are executing
on the SH-4. Memory protection is important, so that different execution
threads do not interfere with each other's memory spaces which can cause
either the operating system or an application to crash.
Why
are floating point values important?
Floating point as opposed to integer
values lend themselves to results that are much more accurate, for example
a 32-bit integer number can give a value of 0 to 4,295,000,000 and a 32-bit
floating point number can give a value of 0 to 256 but with a mantissa
of 0.0 to 0.000000059605. Lets say you had a calculation to do like
200 divide by 3. In integer arithmetic the answer would be 66, but the
answer in floating point would be 66.666667. As you can see that by calculating
the value on a CPU's integer unit you would lose 0.666667 of accuracy.
Floating point accuracy is very important to graphic operations. The Dreamcast
benefits greatly in the use of the SH-4 because of the power and accuracy
of the floating point unit.
Floating
Point Unit Excels At 3D Calculations
Click to Enlarge
|
The SH-4 architecture
includes impressive 3D floating point hardware. Each of the four floating
point multipliers (fmuls) can receive two 32-bit values and produce a multiplied
result that is passed to a four-input floating point adder. This hardware
reads two 128-bit vectors (two sets of four 32-bit values) out of register
files, multiplies the four 32-bit pairs at the same time, adds the four
products together, and puts the 32-bit result back into the register file.
This provides the equivalent of 288-bit data crunching (2 x 128 + 32 =
288). |
A typical application for this processing
power would be to perform the following transformation instruction, which
involves seven operations:
f0*f4 + f1*f5 + f2*f6 + f3*f7 ' f7
The SH-4 can execute this seven-operation
instruction in three clock cycles. Yet, because the architecture is fully
pipelined, it can issue one of these instructions every cycle.
The figure (above, right) shows a
better example of what the SH-4's floating point hardware can accomplish.
Here the back register file is loaded with 16 values and the hardware performs
the following matrix operation in seven clock cycles:
f0*b0 + f1*b1 + f2*b2 + f3*b3 ' f0
f0*b4 + f1*b5 + f2*b6 + f3*b7 ' f1
f0*b8 + f1*b9 + f2*b10 + f3*b11 ' f2
f0*b12 + f1*b13 + f2*b14 + f3*b15 ' f3
The SH-4 is fully pipelined, and
the RISC architecture can repeat these 16 fmuls and 12 fadds (28 operations)
every four clock cycles, for an average of seven floating point operations
per cycle. The superscalar CPU and double-precision fmov allow registers
to be loaded from, and stored to cache during these four cycles, so the
operations are sustainable. At its 200-MHz clock speed, the SH-4 achieves
1.4-GFlops performance, sustained.
This high floating point power of
the SH-4 will lend itself to a console system that is capable of having
dynamic complex polygonal environments with characters in that environment
made of high number of polygons making them look very detailed. Add rendering
with a graphics chip to include such effects as texturing mapping, texture
filtering, lens flare, smoke, fog, transparencies, shading and dynamic
lighting.
A
16/32/64/128 bit CPU?
The SH-4 is a multiple bit CPU.
Here is a list of the different functions, there bit sizes and the benefits
of that particular bit size.
Function |
Bits |
Benefit |
Instructions |
16-bits |
40 percent smaller size than comparable
32-bit RISC |
CPU Precision |
32-bits |
More accurate results compared to
16-bit values |
Memory Addressing |
32-bits |
4 GB memory access which is overkill
for a console that will not have more then 32 MB of memory |
External Data
Bus |
64-bits |
800 MB/sec transfer rate with 100
MHz SDRAM |
Floating Point
Precision |
64-bits |
More accuracy in mantissa portion
of number which aids graphic operations |
Floating Point
Bus |
128-bits |
3.2 GB/sec transfer rate from the
data cache for matrix data aligned together as four separate 32-bit values
like this: [32-bit][32-bit][32-bit][32-bit] |
CPU's are defined by their integer size,
and the SH-4 supports 32-bit integer values, so the SH-4 is classified
as a 32-bit processor. By the way a 64-bit processor would be overkill
for any console, and none exists. That includes Sony's PS2 and Microsoft's
X-Box.
Other
Documents
-
Hitachi SH-4 Document
- November/December 1997
|