FPGAs were traditionally developed using a 4-input look up table (LUT), where a LUT constructed from SRAM bits stores digital information (1 or 0). The digital information stored, also known as configuration memory (CRAM), was stitched together using a set of multiplexers to select a bit to drive the output for any given function based on a 4-input mapping scheme. At the time, 4-input LUTs provided the best area-delay product. Altera’s Stratix core (130nm) was based on these 4-input logic elements (LEs), as shown in Figure 1.
As process geometries began shrinking to 90nm and eventually to 65nm, the benefits of higher performance and increased density became available, but at the cost of higher power consumption in the core. In addition, with the dramatic increase in FPGA density, the critical path delays in the routing fabric increased, as the routing wires did not scale as well as transistors. Hence, the traditional method of implementing logic in an FPGA on a 4-input LUT had to be fundamentally challenged through design innovation. A new core architecture that would efficiently pack more logic per logic element, thereby delivering higher performance at lower power and ultimately lowering the overall cost, had to be created.BIGGER LOOK-UP TABLES
Technically, to create a k-input LUT (K-LUT) – a LUT that maps k input functions – 2K SRAM bits and a 2K:1 multiplexer are required to map the selected CRAM bit to the output. For example, as shown on the left side of Figure 2, a 4-input LUT is implemented using 16 CRAM bits with a 16:1 LUT adeptly made with 15 2:1 multiplexers.
Larger LUTs can be built by using the smaller LUTs and one or more multiplexers (shown on the right side of Figure 2). Similarly, a 5-LUT can be built from two 4-LUTs and a multiplexer, while a 6-LUT can be built with two 5-LUTs and a 4:1 multiplexer. The problem with these architectures, however, is the logic elements built from smaller LUTs are inefficient and result in wasted resources when implementing smaller functions with fewer k-inputs. Other inefficiencies include the replication of routing to the smaller LUTs when building a larger LUT, and the extra delays between LUTs, which result in a non-optimized logic structure.
DEVELOPING ALM
With the acquisition of Right Track CAD and the creation of the Toronto Technology Centre (TTC) in 2000, Altera brought together a senior team of FPGA architecture researchers from the academia. The TTC, in conjunction with software and IC design engineers at the company’s San Jose site, created the Altera FPGA Modeling Toolkit (FMT), which allows complete "virtual prototyping" of different FPGA architecture ideas. The FMT provides an in-depth understanding of the effects of the different aspects of FPGA architecture. Figure 3 illustrates the tradeoff between cost and delay for various LUT sizes when using FMT.
Figure 3 shows that increasing the LUT size results in a significantly lower logic delay, but with a considerable cost increase. Research has consistently showed that for k-4 inputs, the 4-LUT configuration provides the best area-delay product with minimal wasted resources (inputs, CRAM, and multiplexers). A basic 6-LUT configuration can increase performance by 15 percent, but the tradeoff is area size, which can increase by 17 percent. It is imperative to analyze the logic block, its various resources, targeted software, and overall impact on cost.
Simply increasing the LUT size to a 6-LUT from a 4-LUT or 5-LUT is highly inefficient. In Figure 4, a synthesis modeling tool was used to synthesize and pack a target logic block. It provided an excellent understanding of the effects of die area (cost). The HDL was translated into various LUT sizes.
The spread in Figure 4 shows that when utilizing a synthesis tool to optimally pack functions in a 6-LUT, the outcome is a spread of smaller LUTs that are required to implement smaller logic functions. While the use of 6-LUTs can improve performance, only a relatively small number of the LUTs use all six inputs. Costly silicon real estate (CRAM bits and multiplexers) and logic and routing resources are wasted resulting in increased costs.
During an exhaustive iterative process with over 150,000 experiments – where the requirement was to reduce levels of logic for increased performance without suffering the inefficiencies – a new criterion for designing a larger LUT became apparent. With this goal, the larger LUT had to be divided into smaller LUTs when required to reduce costs. This LUT would then be able to deliver the performance benefits with no wasted silicon, as the LUT would be divided into smaller LUTs wherever appropriate. In 2002, an adaptive LUT was optimized to share LUT masks between functions, which resulted in the final design of the adaptive logic module (as shown in Figure 5).
Figure 5 shows a representation of the ALM with 4-input/3-input LUTs and multiplexers. It illustrates how the LUT mask can be divided and shared between two different logic functions. The ALM consists of an 8-input LUT, two registers, two adders, and multiplexers providing a highly-efficient core that is adaptable to any logic design.
ADVANTAGES OF ALM
Altera’s patented LUT technology was designed into the adaptive logic module of the 90nm Stratix II FPGA, where the adaptability of the LUT was adopted to efficiently pack the most logic into the least area, while delivering the maximum performance in an FPGA core. An alternate ALM is shown in Figure 6.
The ALM can implement a 6-LUT, select 7-input functions, or be fractured into smaller LUTs to implement two independent functions. The ALM consists of two extra registers providing an optimal register-to-logic ratio (2:1) for register-rich designs, and two adders each capable of performing 2-bit additions or a single ternary adder for increased arithmetic capabilities.
Unique to the ALM is the patented LUT, which is capable of supporting a large number of modes and features an efficient logic implementation. The ALM can implement 2.5 LEs in a classic 4-LUT configuration. It can hold, on average, 1.6x more combinational logic than competitive basic 6-LUT architectures. The 1.6x factor further increases to 1.8x when considering the two registers per ALM in the densest packing mode. Figure 7 shows the different LUT configurations that a single ALM can support with minimal input sharing, and Table 1 describes each ALM configuration.
The TTC and San Jose software teams were involved in the design of the ALM that enabled Quartus II software to automatically utilize all the features of the ALM. Leveraging the benefits of the ALM, the synthesis tool can alter the distribution of LUT size to produce the right mix of large and small LUTs for the fewest ALMs and software optimizations. Shown in Figure 8 is the distribution of LUTs generated when optimizing for speed, area, and a mix of both (balanced).
Depending on the optimization goal and the design, the synthesis tool generates a mix of LUTs. When maximizing for performance, the largest number of 6-LUTs is generated. When minimizing for area, a mix of smaller LUTs is generated for efficiently mapping functions with the fewest number of ALMs.
About the author
Amit Verma is the senior high-end technical analysis staff responsible for technical product analysis, FGPA architecture and technology solutions for Altera’s high-end FPGA product lines. He holds a BSEE from Rochester Institute of Technology in New York.Click here for the illustrations:
Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Table 1 |