 |
|
| |
|
|
| |
Eliminating massive clock trees in SoC designs using GALS
( 01 Feb 2007 )
by Mohit Arora, Senior Design Engineer, Transportation and Standard Products Group, Freescale Semico
|
As high speed I/O buses such as SATA II, PCIExpress (2.5 to 80Gbps) are implemented in SoC designs, it is becoming increasingly difficult to meet timing constraints. The result of combining many different separatelydeveloped IP functional blocks which are provided by different vendors, each having different specifications for both local and system clocks, is massive clock tree structures. The result is that clock skew is now comparable to the clock period, meaning that clock skew is definitely not negligible.
Moreover, verifying the thirdparty IP blocks at the system level poses another set of integration issues.
This article proposes a new methodology for IP integration that uses globally-asynchronous, locallysynchronous (GALS) clock circuits that significantly reduce the power consumed by the SoC by providing a smaller clock tree structure. Using GALS also makes it simple to meet timing requirements and reduce the amount of system level verification needed for third-party IP blocks.
TIMING PROBLEM
Conventional SoC designs consist of an interconnect bus with many IP blocks combined with a multiplicity of coupled timing domains. Since chip size and speed are continually increasing (per Moore’s Law), meeting the timing requirements of a system with hundreds of IP elements has become a major challenge. Conventional design methodologies have resulted in massive clock tree structures, which substantially increase the average power consumed by the SoC.
With clock requirements being the culprit, it would appear that a simple solution would be to eliminate the clock completely. Unfortunately, making a SoC design completely asynchronous (without a system clock) is in practice quite difficult today, one reason being the lack of mature CAD tools.
Moreover, the aggressive pressure to release product to the market means that the SoC industry has been forced to commit to deliver silicon that is right the first time, so that designers are forced to re-use existing IP blocks with minimal integration design effort. The result is that there has effectively been no change in the methodology of IP integration since the first SoC designs. As the complexity of SoC is increasing exponentially, full functional verification has become a big challenge. SoC integrators are often unable to spend the time to thoroughly understand the IP blocks purchased from third party vendors and, to meet project time constraints, are interested only in integrating the blocks.
The initial design approach of SoC designers was to select the IP blocks needed to meet application requirements, place them on silicon and connect them with a standard on-chip bus. As was the case with multimillion-gate ASICs containing many connected IP blocks, today's SoC cannot be built around a single bus. Instead, complex hierarchies of buses are used, with sophisticated protocols and multiple bridges between them (Figure 1). Communication between any two IP blocks can be via several buses, which places a lot of strain on meeting timing requirements. Essentially bus-based interconnects are being stretched to the point where they cannot be scaled further.
SoC designers face a basic paradox in today's environment: rather than enjoying significant time savings by using acquired IP blocks, they spend additional time in learning the function of the blocks in order to build the logic and test vectors for these blocks. Except for the vendors of processor cores, IP vendors typically provide little of the detailed documentation designers need. Consequently, designers find they have to acquire some level of application expertise or use consulting resources to understand the IP well enough to complete these tasks. This additional design and verification burden currently adds months to SoC design projects. Besides imposing a drain on resource-strapped projects, the additional logic inevitably degrades performance and increases chip area, while the additional test requirements further complicate final test stages.
CMOS feature size is decreasing and would be, according to Moore's Law, on the order of 22nm by 2016, with clock frequencies reaching around 28.7GHz. It is clear that interconnect speed is not keeping up with increases in transistor speed. This means that in future circuits, wire delay will no longer be negligible, but play a major role in deciding the maximum frequency at which a circuit can operate. In line with the clocking trends, global clock skew becomes an increasing fraction of clock period.
Examining all these issues makes it clear that a new interconnect strategy is required to bring design risks back under control: large high-speed integrated circuits will eventually need to be designed without global clocking. This requirement could be fulfilled by the GALS circuits that will be discussed.
Eliminating the clock completely is not likely to happen very soon, but a new GALS method has been developed in which communication between two synchronous blocks can be performed asynchronously without the need for a master clock.
As per the International Technology Roadmap for Semiconductors (ITRS, ex. SIA), 1999 edition: "With clock speed possibly exceeding GHz, and across-chip communication taking upwards of five to 20 clock cycles, an approach is needed to building a hierarchy of clock speeds with locally synchronous and globally asynchronous interconnects. Tools to handle asynchronous, multicycle interconnect as well as locally synchronous, high performance near neighbor communication are needed."
Figure 2 shows two cases: the first is a normal synchronous system with a master clock, the second, a GAL system in which two blocks talk to each other with a handshake interface. Though each block has its own local clock, the overall system works without a global clock.
In SoC designs, the GALS architecture helps to solve the increasingly difficult problem of integrating multiple clock domains into a single chip. Current synchronous solutions involve a number of inefficient design tricks such as using Grey code on a dualported SRAM to act as an interface between logic blocks with different clock frequencies. Leveraging an asynchronous SoC interconnect, for example an on-chip crossbar, independent synchronous blocks can be linked together using a simple clock domain converter on each of the crossbar ports. The clock domain converters work just like synchronous FIFOs, and can operate independently of each other at the rate of the connected synchronous block. A similar idea has been shown in detail in the next section.
Figure 3 shows how the IP is connected to an interconnect bus using a GALS circuit. IP blocks are completely decoupled from the interconnect bus.
With this method, a synchronous layer interacts with the interconnect bus using the system clock and transmits the address, data and other information to the asynchronous layer. The asynchronous layer then encodes the data via delay-insensitive encoding—for example dual rail/mon- n encoding—and transmits the same to the asynchronous layer on the IP side. This creates a boundary between the system clock and the IP clock due the asynchronous bridge that effectively eliminates the need for large clock tree buffers for the system clock. A major result is lower power consumption.
The synchronous layer acts as a target for the interconnect bus and as an initiator for the asynchronous layer, making the asynchronous communication look transparent. The particular synchronous layer implementation would be specific to the properties of the interconnect bus and the targeted application targeted (such as low- or high-performance peripherals). The asynchronous layer converts the synchronous protocol signals (address, data, etc.) to delay-insensitive encoding such as m-on-n or dual-rail to transmit the symbols across the line.
BENEFITS
With the GAL methodology, the IP is completely de-coupled from the interconnect bus. This makes it possible to integrate asynchronous communication within an existing synchronous system. Due to the delay insensitive encoding, the wires supporting the communication at physical level do not have to be balanced. Unlike the "legacy" technique for IP integration, GAL does not require large clock tree buffers due to the IP being decoupled from the interconnect bus. This saves a considerable amount of power, which can be extremely important for handheld devices that operate on battery power.
The GAL approach also means that the interconnect bus can run at a much higher frequency, thus increasing overall system performance.
Last but not least, the GAL approach can simplify systemlevel verification of the IP block. With the IP block being completely decoupled from the interconnect bus, verification can be performed at the asynchronous dividing point. In the case of third party preverified IP, IP level verification can be completely eliminated.
CHALLENGES, DRAWBACKS
Asynchronous circuits tend to implement registers using latches rather than flip-flops. In combination with the absence of a global clock, this makes it less straightforward to connect registers into scanpaths. Another consequence of the distributed self-timed control (the lack of a global clock) is that it is more complicated to single-step the circuit through a sequence of well-defined states. This makes it less straightforward to steer the circuit into particular quiescent states, which is necessary for IDDQ testing, the technique that is used to test for the short and open faults, which are typical in today's CMOS processes.
The extensive use of stateholding elements (such as the Muller C element, a basic element in the asynchronous domain, just as a flip-flop is a basic element in any synchronous design), together with the self timed behavior, also makes it difficult to test the feedback circuitry that implements the state holding behavior. Delay fault-testing represents yet another challenge.
One drawback of a GALS design is increased latency due to an additional asynchronous communi-cation layer. A cost of using GALS would be approximately 8,000 to 10,000 additional gates per IP block, the number depending on the bus width and encoding used. |
|
|