Get Complete Project Material File(s) Now! »
Stallion Architecture
Contributing to the research in the area of high-performance computing, a family of CCMs has been created at Virginia Tech with a novel design approach. This approach, referred to as Wormhole Runtime Reconfiguration, offers fast computations along with partial runtime reconfiguration capability. Colt, the first among these CCMs, was conceptualized and designed by Ray Bittner [1]. It was targeted towards signal processing applications and implemented the concept of stream-based data processing. Dr. Bittner had also proposed Colt’s successor, Stallion, which has much larger resource pool, additional functionality and improved design. This thesis documents the task of prototyping Stallion. In order to give the reader background for the work done herein, this chapter discusses architecture of Stallion processor in limited detail.
Overview
Stallion architecture consists of three interconnect units: data ports, Cross Bar and Meshes. The data ports are input/output units and are the only ways to communicate to the chip. Meshes are the processing units of Stallion. Cross Bar is an interconnection network between the data ports and the meshes. Stallion has six data ports; two meshes containing sixty small processing elements called interconnected functional units (IFU) and one crossbar with 22 inputs and 38 outputs. Stallion is based on the stream concept [1]. Stream is defined as the concatenation of two sets of information, the programming information in the header and the operands following the header. Programming header configures various components inside Stallion and creates a computational path that will be followed by operands in the stream. The computational path determines what processing will be performed on the operands. As the programming header traverses inside the chip, each unit gets configured. The unit then passes the rest of the stream to other blocks inside the chip according to its own configuration. Thus, as the data path configuration progresses, the stream gets stripped off its programming header. The header no longer exists after the entire data path configuration is complete. It is important to note that the order and length of programming information is not fixed. The stream, its programming header and the operand data can be of arbitrary length.
Data Port
There are six data ports in Stallion. Data ports are used to send and receive programming headers, operands and results. Each data port is 20-pins wide, with 16 bi-directional pins for data input/output, three bi-directional pins – Program, Transmit and Receive – for control and one output pin named Write. The Program pin can be pulled low from inside or outside of the chip. It is used to indicate whether or not the word on data pins is a program word. The Transmit pin indicates if the word is valid or not. The signal on Receive pin has different interpretations for program and data words. When a data port is sending out program words, it waits for the other party to pull the Receive pin low before sending more program words. When a data port is sending normal data, it expects the receive pin to remain high. Otherwise, when negated, data transmission is stalled indicating that the party which pulled receive line low is not ready for data reception. On the other hand, if the data port is receiving program information from the outside, it pulls the receive line low indicating its readiness. The last control pin on a data port, the Write pin, is an output and is used to indicate whether the data port is configured as a read port or as a write port. The data ports have three modes of operation – Raw, Synchronize and Loop Mode. In raw mode, a data ports accept all data coming to the pins. In synchronize mode, data port uses a temporary buffer to store the current data word and signals the external circuitry that no more data will be accepted. This happens whenever the data port gets a signal from other synchronized ports that they are not ready to receive more data. Having this functionality helps in preserving valid data and protects it from being overwritten by invalid data. The third mode, loop mode, is useful to process computations in a lock step fashion. This is accomplished by synchronizing an output port with its input port. While operating in this mode, no new operands are accepted from outside until the current set of operands are processed and available at the output port. The main structural components of each data port are – a state machine that controls the overall operation, an address comparator that verifies whether the programming information is to be used for configuration, a buffer to hold data in synchronize mode when the data port has to wait for processing to restart, a register to hold configuration information and tri-state logic for handling bi-directional communication.
Cross Bar
Cross Bar forms the interconnect network between data ports and the Meshes in Stallion and is the primary means of creating deep pipelines. It has 22 inputs and 38 outputs and supports 16-bit wide data paths. Of the 22 inputs to the crossbar, six come from the data ports and eight come from each of the two meshes; and of the 38 outputs, six are sent to the data ports and sixteen go to each mesh. The crossbar provides full connectivity among the data ports and components in the meshes.
Mesh
Stallion has two separate computational meshes. Each mesh is organized as an 8 x 4 matrix consisting of 30 processing units called the Interconnected Functional Units (IFU) and two multipliers. The multipliers are placed on the top left and right corners of each mesh. Inputs to the mesh arrive from the outputs of the cross bar and mesh outputs are sent back to Cross Bar. Within the mesh, local and skip buses are used to transmit data among the IFUs. Each IFU can send data to its four nearest neighbors using the local bus and to distant IFUs using the skip bus. Skip bus provides a convenient way of fast data transfer between far-off IFUs.
Functional Unit
Functional Unit (FU) in Stallion forms the basic data processing unit. It has 16-bit left and right input registers, which receive inputs from the interconnected functional unit. It is possible to source these operands from any of the four local and skip bus values. As illustrated in the figure, left operand passes through the barrel shifter and is sent to the arithmetic and logic unit. The right operand is sent to ALU, the conditional unit and a delay block. The arithmetic and logic unit performs various operations on the two operands. Conditional unit can make comparisons based on a conditional flag and choose one of the two inputs as its output. The delay blocks are used for pipeline synchronization and aligning execution path lengths between two or more streams. The right operand can also be directly passed on to the auxiliary output with or without introducing a delay.
Interconnected Functional Unit
Interconnected Functional unit (IFU) is the building block of meshes. It consists of a Functional Unit surrounded by control and data buses to provide connectivity among the neighboring IFUs.
Multiplier
Stallion processor contains four multipliers that are located in top left and top right corners in each mesh. Designed by Tsuang-Hen Yang [9], this pipelined multiplier has also been used in Colt. It accepts two 16-bit inputs and produces a 32-bit output in two clock cycles. The inputs to multipliers come from the crossbar and the outputs are sent to two nearest IFUs. Multiplier design can be broken down into smaller units called multiplier cells. There is an array of such cells in the multiplier. Apart from that, there are several registers and half adders in the multiplier logic.
Design Methodology
The VLSI design flow has taken the form of a standard owing to the complexity of the task and the high costs of even one mistake. Most VLSI designs are a result of strict regimen of set design practices that have been laid out after years of experience and research. The computer-aided design tools have been designed to fit into the existing practices. The choice of a particular design methodology is based on the applications of the design and frequently, the nature of the design itself. The Stallion processor, being among the forerunners of new CCM architectures, adopts a full-custom physical design methodology. This approach is followed when the designers want freedom in defining all possible details from system-level down-to the transistor level. Many a times, the standard libraries are not suitable for the purpose. Stringent performance requirements also drive full-custom design flow. This chapter focuses on full-custom physical design flow, the practices followed for Stallion, associated CAD tools and how they fit in the Stallion full-custom physical design flow.
Full Custom Physical Design
The full custom design process is based on a “correct-by-construction” approach. This approach relies on the fact that the designer has finalized details of the design on a transistorby-transistor basis. Since all the details have been taken care of, it is implicitly guaranteed that the chip design is going to be correct as long as the net-list extracted from mask data matches with the schematics. In a typical design, following the functional specifications and system level design, all lower level modules are designed. Before the physical design process is started, the design is frozen in form of either schematics or some type of structural description in a high level HDL. The schematics are captured in a CAD tool that has a reliable interface to the physical design tools that are going to be used for creation of layouts. At this point, the IC design process is mainly concerned with creating layouts in chosen fabrication technology and making sure that there are no design rule errors and no net-list mismatches compared to schematics. Layout issues like power distribution scheme, parasitic capacitances, etc. are also taken care of in this phase of physical design.
1. INTRODUCTION
1.1 METHODOLOGY
1.2 CONTRIBUTIONS
1.3 ORGANIZATION
2. BACKGROUND
2.1 PIPERENCH
2.2 RE-CONFIGURABLE COMMUNICATIONS PROCESSOR (RCP)
2.3 CONTEXT SWITCHING FPGA
2.4 JAZZ PROCESSOR
2.5 CHIMAERA CONFIGURABLE PROCESSOR
2.6 SUMMARY
3. STALLION ARCHITECTURE
3.1 OVERVIEW
3.2 DATA PORT
3.3 CROSS BAR
3.4 MESH
3.5 FUNCTIONAL UNIT
3.6 INTERCONNECTED FUNCTIONAL UNIT
3.7 MULTIPLIER
4. DESIGN METHODOLOGY
4.1 FULL CUSTOM PHYSICAL DESIGN
4.2 PHYSICAL DESIGN APPROACH FOR STALLION
4.3 CAD TOOLS
4.4 STALLION DESIGN AND CAD TOOLS
5. STALLION FLOOR PLAN AND LAYOUT
5.1 DESIGN HIERARCHY
5.2 STALLION LIBRARY CELL LAYOUT
5.3 CREATION OF HIGHER LEVEL CELLS
5.4 MULTIPLIER LAYOUT
5.5 LAYOUT OF FUNCTIONAL UNIT
5.6 IFU LAYOUT
5.7 MESH LAYOUT
5.8 CROSS BAR LAYOUT
5.9 DATA PORTS
5.10 I/O PADS
5.11 INSERTING GRAPHICS IN LAYOUT
5.12 STALLION LAYOUT
5.13 POWER DISTRIBUTION
5.14 CLOCK DISTRIBUTION
6. CONCLUSIONS
6.1 RESULTS
6.2 FUTURE WORK
GET THE COMPLETE PROJECT
VLSI Implementation of a Wormhole Run-time Reconfigurable Processor