samos 2003 - SAMOS conference [PDF]

A Cost-Efficient RISC Processor Platform for Real Time Audio Applications................. 23. J.P. Wittenburg, U. .....

5 downloads 23 Views 8MB Size

Recommend Stories


índice samos
Knock, And He'll open the door. Vanish, And He'll make you shine like the sun. Fall, And He'll raise

štola na ostrově samos tunnel of eupalinos on samos island
Ego says, "Once everything falls into place, I'll feel peace." Spirit says "Find your peace, and then

The distribution of amphibians and reptiles on Samos island
You're not going to master the rest of your life in one day. Just relax. Master the day. Than just keep

atene, rodi, creta, kos, samos, peloponneso, penisola calcidica
I tried to make sense of the Four Books, until love arrived, and it all became a single syllable. Yunus

2003 Achper PE Conference
No amount of guilt can solve the past, and no amount of anxiety can change the future. Anonymous

The Electrum Coinage of Samos in the Light of a Recent Hoard
Raise your words, not voice. It is rain that grows flowers, not thunder. Rumi

2003 USENIX Annual Technical Conference
Life is not meant to be easy, my child; but take courage: it can be delightful. George Bernard Shaw

2003(pdf)
We may have all come on different ships, but we're in the same boat now. M.L.King

2003 Federal Forecasters Conference Papers and Proceedings
Don’t grieve. Anything you lose comes round in another form. Rumi

2003 Article PDF
Your big opportunity may be right where you are now. Napoleon Hill

Idea Transcript


Proceedings of the

Third International Workshop on Systems, Architectures, Modeling, and Simulation

SAMOS 2003

July 21-23, 2003 Research and Training Institute of East Aegean Samos, Greece

Copyright Copyright © 2003 by the SAMOS Initiative. Permission to make digital or hard copies of parts of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.

SAMOS ISBN : 90-807957-1-2 SAMOS Order Number 2003-0722

Previous Publications A selection of papers presented at the SAMOS 2001 workshop have been published in Springer’s Lecture Notes in Computer Science LNCS 2268 Embedded Processor Design Challenges. A selection of papers presented at the SAMOS 2002 workshop have been published in Marcel Dekker’s Domain-Specific Processors: Systems, Architectures, Modeliong, and Simulation

Copy ordering Additional copies of this proceedings may be ordered from: SAMOS Initiative, P.O. Box 9512 2300 RA Leiden, The Netherlands Phone +31 71 527 5778 Fax +31 71 527 6985 E-mail [email protected]

Printed in the Netherlands

Third International Workshop on Systems, Architectures, Modeling and Simulation

July 21-23, 2003 Island of Samos, Greece General Chair S. Vassiliadis, TUDelft, NL Program Committee Ed Deprettere Andy Pimentel Patrice Quinton Juergen Teich Shuvra Bhattacharyya Bernard Pottier

Leiden University, NL Amsterdam University, NL Rennes University, F Erlangen University, G University of Maryland, MD Université Brest, F

Local Arrangements Lidwina Tromp Nancy Pyrini

[email protected] [email protected]

Message from the Organizers The SAMOS Initiative is an International Informal gathering of highly qualified researchers from academia and industry, both at Ph.D. and Senior levels. For the third time in a row, the participants have been invited by us to share in our 3-day lively discussion on the lovely, quiet, and inspiring northern mountainside of the Mediterranean Island of Samos. It has by now become well-nigh a tradition to have few presentations during the mornings, and lots of after lunch discussions and problem nut cracking gatherings. The SAMOS workshop is unique in that not only recent research results are presented, but also not yet completely solved problems are released, and - particularly fruitful - problems that led to splitting headache are entered the scientific arena. Nights may be short, indeed. Because participation is by invitation only, we can take liberties to invite researchers that are known to us to pursue similar or related work and are, therefore, confronted with alternatives right here on the stage. Such encounters lead quite often to lively disputes. When, in the year 2001, the first Samos workshop was held, the participants one day experienced an animated dispute between two professors on the beach that culminated into what has been since been referred to as the Samos beach lectures. These lectures, too, are near to tradition. Indeed, they were held in 2002, too (be it that some outsiders seemed to not enjoy the free education as much as the scientists did). We hope that you will enjoy the SAMOS III Workshop in all its aspects, including the Beach Lectures, and the competition for the best paper award. Shuvra Bhattacharyya Ed Deprettere Stamatis Vassiliadis

Table of contents Reconfigurable Computing

The Molen Programming Paradigm ........................................................................................ 1 S. Vassiliadis, G. Gaydadjiev, K. Bertels, E. Moscu Panainte Loading ρµ code: Design Considerations................................................................................. 8 G. Kuzmanov, G.N. Gaydadjiev, S. Vassiliadis RAMPASS: a Reconfigurable and Advanced Multi-Processing Architecture for future Silicon System ......................................................................................................... 12 S. Chevobbe, N. Ventroux, F. Blanc, T. Collette Basic OS Support for Distriuted Reconfigurable Hardware .............................................. 18 C. Haubelt, D. Koch, J. Teich A Cost-Efficient RISC Processor Platform for Real Time Audio Applications ................. 23 J.P. Wittenburg, U. Schreiber, U. Gries, M. Schneider, T. Niggemeier Customising Processors: Design-time and Run-time Opportunities ................................... 28 L. Wayne Intermediate level components for reconfigurable platforms.............................................. 34 E. Fabiani, C. Gouyen, B. Pottier Performance Estimation of Streaming Media Applications for Reconfigurable Platforms........................................................................................................ 42 C.Reuter, J. Martín. H.J. Stolberg, P. Pirsch Partitioning and CoDesign tools & methodology for Reconfigurable Computing: The EPICURE philosophy....................................................................................................... 46 M. Auguin, K. B. Chehida, J.Ph. Diguet, X. Fornari, A.-M. Fouilliart, C. Gamrat, G. Gogniat, Ph. Kajfasz, Y. Le Moullec An Industrial perspective: a pragmatic high end signal processing design environment at Thales.............................................................................................................. 52 E. Lenormand, G. Edelin

Architectures and Implementation

CoDeL: Automatically Synthesizing Network Interface Controllers.................................. 58 R. Sivakumar, V. V. Dimakopoulos, and N.J. Dimopoulos Performance and Power Evaluation of Clustered VLIW Processors with Wide Functional Units ....................................................................................................................... 64 M. Pericàs, E. Ayguadé, J. Zalamea, J. Llosa and M. Valero Embedded SIMD Vector Signal Processor Design................................................................ 71 G. P. Fettweis An Optimized Flow for Designing high-speed, large-scale CMOS ASIC SoCs ................ 77 U. Heinkel, C. Mayer, H. Sahm Fast Exploration of Parallel Architectures for Multi-User Detection Algorithms: A Case Study ............................................................................................................................ 83 D. Massicotte, P.Quinton, A.O. Dahmane, T. Risset Register-Based Permutation Networks for Stride Permutation .......................................... 84 J. Takala and T. Järvinen A Family of Accelerators for Matrix-Vector Arithmetic Based on High-Radix Multiplier Structures ............................................................................................................... 90 D. Guevorkian, P. Liuha, A. Launiainen, V. Lappalainen Lattice-Based Memory Allocation .......................................................................................... 96 A. Darte, R. Schreiber, G. Villard Metrics for digital signal processing architecures Characterization: Remanence and Scalability .................................................................... 102 P. Benoit, G. Sassatelli, L. Torres, D. Demigny, M. Robert, G. Cambon Virtual Architecture Mapping: A SystemC based Methodology for Architectural Exploration of Systems-on-Chip Designs ............................................................................. 108 T. Kogel, A. Wieferink, R. Leupers, G. Ascheid, H. Meyr, D. Bussaglia, M. Ariyamparambath

Compilers, System Modeling, and Simulation

Comparison of Data Dependence Analysis Tests ............................................................... 114 M. Viitanen, T. D. Hämäläinen An Executable Intermediate Representation for Retargetable Compilation and High- Level Code Optimization ............................................................................................ 120 R. Leupers, O. Wahlen, M. Hohenauer, T. Kogel, P. Marwedel MOUSE: A Shortcut From Matlab Source to SIMD DSP Assembly Code...................... 126 G. Cichon, G. Fettweis Hw/Sw Co-exploration at TLM Level for the Implementation of DSP Algorithms into Application Specific DSP’s using SystemC and LISA ....................................................... 131 J.P. Robelly and G. Fettweis SYNTEL: A Synchronous Co-design Environment for the Synthesis of Wireless Telecommunications Protocols ............................................................................................. 135 M. Bourdellès, Ph. Kajfasz Sandbridge Software Tools.................................................................................................... 142 J. Glossner, S. Dorward, S. Jinturkar, M. Moudgill, E. Hokenek, M. Schulte, S. Vassiliadis High-level Energy Estimation for ARM-Based SOCs ....................................................... 148 D. Crisu, S. D. Cotofana, S. Vassiliadis A Performance/cost Estimation Model for Large Scale Distributed Array Signal Processing System Specification ........................................................................................... 154 S. Alliot Deriving efficient control in Kahn Process Networks ........................................................ 160 S. Derrien, A. Turjan, C. Zissulescu, B. Kienhuis, E. Deprettere IDF Models for Trace Transformations: A Case Study in Computational Refinement ....................................................................... 167 C. Erbas, S. Polstra, A. Pimentel Mapping Specification-level primitives to IP-primitives: A Case study ........................... 173 V. D. Zivkovic, E. Deprettere, E. de Kock, P. van der Wolf

THIRD INTERNATIONAL WORKSHOP ON SYSTEMS, ARCHITECTURES, MODELING, AND SIMULATION (SAMOS III), JULY 2003

1

The Molen Programming Paradigm Stamatis Vassiliadis, Georgi Gaydadjiev, Koen Bertels, Elena Moscu Panainte Computer Engineering Laboratory, Electrical Engineering Department, EEMCS TU Delft E-mail : {stamatis, georgi, koen, elena}@ce.et.tudelft.nl http://ce.et.tudelft.nl

Abstract— In this paper we present the Molen programming paradigm, which is a sequential consistency paradigm for programming Custom Computing Machines (CCM). The programming paradigm allows for modularity and provides mechanisms for explicit parallel execution. Furthermore it requires only few instructions to be added in an architectural instruction set while allowing an almost arbitrary number of op-codes per user to be used in a CCM. A number of programming examples and discussion is provided in order to clarify the operation, sequence control and parallelism of the proposed programming paradigm.

Support for the application portability to multiple reconfigurable platforms. This paper is organized as follows: in Section II the shortcomings of the existing reconfigurable programming paradigms are highlighted. Section III introduces the Molen programming paradigm and shows how the indicated shortcomings are addressed. Section IV discusses the sequence control of the programming paradigm. The discussion is supported by a wide range of examples. Section V concludes the discussion. •

II. P ROBLEMS WITH EXISTING RC PROGRAMMING

I. I NTRODUCTION

PARADIGMS

Since the mid nineties Reconfigurable Computing (RC) is becoming increasingly popular computing paradigm which refers to the ability of the software to transform the hardware platform underneath on a per-application basis. Computing systems built according to the Custom Computing Machines paradigm usually consist of a General Purpose Processor (GPP) and reconfigurable unit(s) possibly implemented in an FPGA technology. The software and architectural support needed for CCM remains problematic. Programming CCMs usually implies the introduction in the software design flow of detailed knowledge about the reconfigurable hardware. The compiler plays a significant role in the software design flow as it has to integrate most of this information. Computationalintensive operations are usually implemented on the reconfigurable hardware provided by different vendors and the challenge is to integrate them - whenever possible - in new or existing applications. Such integration is only possible when application developers as well as hardware providers adopt a common programming paradigm. In this paper we present such a programming paradigm. The main contributions of the paper are: • The presentation of a programming model for reconfigurable computing that allows modularity, general ”function like” code execution and parallelism in a sequential consistency computational model, • The definition of a minimal ISA extension to support the programming paradigm. Such an extension allows the mapping of an arbitrary function on the reconfigurable hardware with no additional instruction requirements, • The introduction of a mechanism allowing multiple operations to be loaded and executed in parallel on the reconfigurable hardware,

In the last decade, many different approaches have been proposed for programming FPGA / GPP combinations. The architecture of a computer, reconfigurable or not, is the minimal behavioral specification- behavioral so that the software can be written, minimal so that the widest possible range of excellence criteria can be chosen for implementations [1]. It is the conceptual structure and functional behavior as seen by the user (usually the programmer). The conceptual structure, e.g. instruction set, is determined mainly by the system functional requirements. In addition many other system requirements, e.g. power consumption, can have impact on the computer architecture. As in the case of traditional computer architecture, the Reconfigurable Computing (RC) paradigm strongly relies on the compilation process. The compiler plays an important role in code generation and needs to be extended with the knowledge about the RC extension architecture. In order to introduce the CCM to the instruction set architecture, the opcode space is extended. This is done by introducing new ”super instructions” to operate the CCM from the software. It is obvious that those new instructions will utilize the ”notused” opcode space of the targeted ISA. Four major shortcomings of existing RC programming approaches are indicated by the following: 1) Opcode space explosion: a common approach (e.g. [2], [3], [4]) is to introduce a new instruction for each portion of application mapped into the FPGA. The consequence is the limitation of the number of operations implemented into the FPGA, due to the limitation of the opcode space. More specifically stated, for a specific application domain intended to be implemented in the FPGA, the designer and compiler are restricted by the unused opcode space.

ISBN: 90-807957-1-2

THIRD INTERNATIONAL WORKSHOP ON SYSTEMS, ARCHITECTURES, MODELING, AND SIMULATION (SAMOS III), JULY 2003

Fig. 2.

Fig. 1.

The Molen machine organization

2) Limitation of the number of parameters: In a number of approaches, the operations mapped on an FPGA can only have a small number of input and output parameters ([5], [6]). For example, in the architecture presented in [5], due to the encoding limits, the fragments mapped into the FPGA have at most 4 inputs and 2 outputs; also, in Chimaera [6], the maximum number of input registers is 9 and it has one output register. 3) No support for parallel execution on the FPGA of sequential operations: an important and powerful feature of FPGA’s can be the parallel execution of sequential operations when they have no data dependency. Many architectures (see for examples in [7]) do not take into account this issue and their mechanism for FPGA integration cannot be extended to support parallelism. 4) No modularity: each approach has a specific definition and implementation bounded for a specific reconfigurable technology and design. Consequently, the applications cannot be (easily) ported to a new reconfigurable platform. Further there are no mechanisms allowing reconfigurable implementation to be developed separately and ported transparently. This implies that a reconfigurable implementation developed by a vendor A can not be included without substantial effort by the compiler developed for an FPGA implementation provided by a vendor B. Due to these limitations, the implementation of the same operation on the same type of FPGA is different for each specific approach. Moreover, some reconfigurable architectures may not include a particular operation in the reconfigurable hardware due to some specific restrictions. III. T HE M OLEN P ROGRAMMING PARADIGM In this section first the Molen machine organization [8] is briefly introduced. Consequently the Molen programming paradigm is described and it is shown how it addresses the major shortcomings of the RC paradigms as indicated in Section II. To clarify the discussion, a variety of examples will be supplemented. The main Molen components as depicted in Figure 1 are: the Core Processor - which is a GPP, and the Reconfigurable Unit (RU) - implemented in the FPGA. The Arbiter performs

2

Molen interface

a partial decoding of the instructions fetched from the main memory and issues them to the corresponding execution unit (GPP or RU). The division in hardware and software part is directly mappable to the two units. The hardware targeted pieces are executed by the RU which is composed usually by three part low grain reconfigurable fabric, while the software (remaining) modules are executed on the GPP. The Custom Computing Unit (CCU) is the RU part that performs the hardware execution. It should be noted that a CCU can incorporate complex multiple hardware ”functions”. The conceptual idea of how a Molen application looks like is presented in Figure 2. First, there are clear boundaries between the software execution and the RU noted by input and output. One should think of predefined parameters (or pointers to parameters) to be passed to and back from the hardware CCU engine. Second the CCU configuration file is to be loaded into the configuration memory in order to prepare the hardware (the CCU part on Figure 1). Since different CCU engines will have different content and length of their configuration files a general approach is provided for loading arbitrary configuration to the reconfigurable part of RU. This is denoted by the SET phase (initiated by a SET instruction). The SET instruction requires single parameter - the beginning address of the configuration microcode. When a SET instruction is detected, the Arbiter will read every sequential memory address until the termination condition is met, e.g. end op microinstruction is detected. The ρµ code unit will then ensure that the data fetched from memory is redirected to the reconfiguration support unit, e.g. FPGA configuration memory. For an implementation see [9]. After completion of the SET phase, the hardware is ready to be used in the context of the targeted CCU functionality. This is done in the EXECUTE phase initiated by the EXECUTE instruction. This instruction also utilizes a single parameter being the address of the execution microcode. The execution microcode performs the real CCU operations, this is the hardware engine initialization, reading the input parameters, performing the targeted computation and writing the results to the output registers. The majority of the CCUs will additionally access the system memory while running (depicted with horizontal arrow in Figure 2). The input/output and SET/EXECUTE parameters (and information about the memory segments used by the CCU) are made available to the compiler through a CCU description file. This description file and the binary images of the configuration and the execution microcode are

THIRD INTERNATIONAL WORKSHOP ON SYSTEMS, ARCHITECTURES, MODELING, AND SIMULATION (SAMOS III), JULY 2003

sufficient for the compiler / linker to create adequate Molen binaries. The above two generic phases SET/EXECUTE are emulated as an instruction is substituted by a sequence of microoperations using an extension of microcode (referred also as reconfigurable microcode). In addition, Exchange Registers (XR) are used for passing operation parameters to and from the reconfigurable hardware at the beginning and the end of the operation execution. The XRs can receive their data directly from the GPP register bank. Therefore, the corresponding move instructions have to be provided. The number of XRs is implementation dependent. The Molen programming paradigm is a sequential consistency paradigm for programming CCMs (reconfigurable processors) possibly including a general purpose computational engine(s). The paradigm allows for parallel and concurrent hardware execution and it is intended (currently) for single program execution. It requires only a one-time architectural extension of few instructions to provide a large user reconfigurable operation space. The complete list of eight required instructions is as follows: •

Six instructions are required for controlling the reconfigurable hardware, namely: – Two SET < address > instructions: at a particular location the hardware configuration logic is defined. This operation will actually perform the hardware configuration as stored in memory from the referred address location. The information about the configuration microcode length is embedded inside the microcode itself. This operation can be additionally split into two sub-phases (accounting for the two set instructions): ∗ partial SET (P-SET) to cover common and often used functions of an application or set of applications and to diminish time expensive reconfigurations, and ∗ complete SET (C-SET) to deal with the remaining blocks (not covered by the p-set sub-phase) in order to complete the CCU functionality by enabling it to perform the less frequent functions. – EXECUTE < address >: for controlling the executions of the operations on the reconfigurable hardware. The address sequence referred by this instruction contains the microcode to be executed on the CCU configured in the SET phase. The microcode sequence is terminated by an end op micro operation. – BREAK: this instruction may be needed in implementing explicit parallel execution between GPP and CCU if it is found to gain substantial performance with simultaneous execution. This operation is used for synchronization to indicate the parallel execution and setting boundaries. In addition BREAK can be used to express parallelism between two or more concurrent CCU units. – SET PREFETCH < address >: In implementations with a reconfigurable hardware of limited size, this

3

instruction can be used by the compiler to pre fetch the SET microcode from main memory to a local (much faster) on chip cache or the ρµ control store in order to minimize the reconfiguration time penalty. – EXECUTE PREFETCH < address >: the same reasoning as for the SET PREFETCH holding for the EXECUTE microcode. • Two move instructions for passing of values to and from the GPP register file and the reconfigurable hardware. More specially: – MOVTX XRa ← Rb : (move to X-REG) used to move the content of general purpose register Rb to XRa . – MOVFX Ra ← XRb : (move from X-REG) used to move the content of exchange register XRb to GPP register Ra . Code fragments constituting of contiguous statements (as they are represented in high-level programming languages) can be isolated as generally implementable functions (that is code with multiple identifiable input/output values). The parameters are passed via the Exchange Registers as introduced earlier. In order to maintain the correct program semantics, the code is annotated and CCU description files provide the compiler with implementation specific information such as the addresses where the SET and EXECUTE code are to be stored, the number of exchange registers, etc. The physical number of XRs, imposed by a specific implementation, is a limitation for the number of parameters that can be passed by value. This limitation can be avoided by applying passing by reference, a method that will be demonstrated later. It should be noted that this programming paradigm allows modularity, meaning that if the interfaces to the compiler are respected and if the instruction set extension (as described above) is supported, then: • custom computing hardware provided by multiple vendors can be incorporated by the compiler for the execution of the same application. • the application can be ported to multiple platforms with mere recompilation. • the designer is given the freedom to explore among several custom computing hardware designs implementing the same function and select the best one based on design constraints, e.g. speed or size. • the design process can be speeded up by having different design teams work in parallel of the different parts of the system under development, e.g. various software and hardware pieces can be designed in parallel. Finally, it is noted that every user is provided with at least 2(n−op) directly addressable functions, where n represents the instruction length and op the opcode length. The number of functions can be easily augmented to an arbitrary number by reserving opcode for indirect opcode accessing. From the previous discussion, it is obvious that the programming paradigm and the architectural extensions resolve the aforementioned problems as follows: • There is only a one-time architectural extension of a few new instructions to include an arbitrary number of

THIRD INTERNATIONAL WORKSHOP ON SYSTEMS, ARCHITECTURES, MODELING, AND SIMULATION (SAMOS III), JULY 2003



• •

configurations. The programming paradigm allows for an arbitrary (only hardware real estate design restricted) number of I/O parameter values to be passed to/from the reconfigurable hardware. It is only restricted by the implemented hardware as any given technology can (and will) allow only a limited hardware. Parallelism is allowed as long as the sequential memory consistency model can be guaranteed. Assuming that the interfaces are observed, modularity is guaranteed because the paradigm allows freedom of operation implementation.

EXECUTE op1 GPP instruction EXECUTE op2 EXECUTE op3 GPP Instructions Break

synchronization

a) synchronization when GPP and FPGA work in parallel

Fig. 3.

IV. S EQUENCE C ONTROL There are basically three distinctive cases with respect to the Molen instructions introduced earlier - the minimal, the preferred and the complete case. In more details they are as follows: • the minimal case: This is essentially the smallest set of Molen instructions needed to provide a working scenario that supports execution of an arbitrary application (providing there is no shortage of hardware resources). The four basic instructions needed are SET, EXECUTE, MOVTX and MOVFX. By implementing the first two instructions (SET/EXECUTE) any suitable CCU can be loaded and executed in the reconfigurable unit. The MOVTX and MOVFX instructions are needed to provide the input/output interface between the Software (GPP code) and the Hardware (CCU engine) parts of the application. It should be noted that this case does not introduce any restrictions on parallel hardware execution. There can be more than one CCUs configured or running concurrently. The only difference is that in such cases the GPP will be stalled for the time needed for the ”slowest” CCU to complete its operation. An example is presented in Figure 3(b) where the block of EXECUTE instructions which can be processed in parallel contains the first three consecutive EXECUTE instructions and it is delimited by a GPP instruction. The situation when a block of EXECUTE-instructions can be executed in parallel on the RU while the GPP is stalled, will most likely be the case for reconfigured ”complex” code and GPP code with numerous data dependencies. In addition, it should be noted that the above reasoning holds for the segments consiting of SET instructions or mixture of independent SET and EXECUTE instructions. • the preferred case: The minimal case provides the basic support but may suffer from the time consuming reconfiguration times that can become prohibitive for some real-time applications. In order to address this issue the two additional SET sub-phases (P-SET) and (C-SET) are introduced for distinction among very often and least often used CCU functions. More specifically in the Pphase the CCU is partially configured to perform the common functions of an application, while the C-phase takes care only of the remaining (much smaller) set of less frequent functions. This allows the compiler to ”hide” the

in parallel



EXECUTE op1 EXECUTE op2 EXECUTE op3 GPP Instruction EXECUTE op4 GPP Instructions

4

in parallel synchronization

b) synchronization when consecutive EXECUTE instructions are performed in parallel and GPP is stalled

Models of synchronization

time consuming reconfiguration operations better. In addition, for the cases when the P-set and C-set improvements are not sufficient, the two prefetch instructions (SET and EXECUTE PREFETCH) are provided to allow additional freedom to the compiler instruction scheduler and further reduce the reconfiguration penalty. the complete case: In addition when it is found that there is a substantial performance to be gained by parallel execution between GPP and RU, then the GPP and the EXECUTE-instructions can be issued and executed in parallel. Since the GPP instructions (for the pertinent discussion see parallelism control discussed later) can not be used for synchronization any longer, the additional BREAK instruction is required. The sequence of instructions performed in parallel is initiated by an EXECUTE instruction. The end of the parallel execution is marked by the BREAK instruction. It indicates where the parallel execution stops (see Figure 3 (a)). The SET instructions are executed in parallel according to the same rules. It is clear that in case of an organization utilizing single GPP, the GPP instructions, present in the parallel areas, can not be executed in parallel. This is the most general Molen case that allows the highest instruction level parallelism (the fastest execution). On the other hand this approach is the most complicated in terms of covering issues such as asynchronous interrupts handling and memory management. These all are the Molen implementor responsibility to address and solve properly.

Compilation: The compiler [10] currently relies on the Stanford SUIF2 (Stanford University Intermediate Format) and the Harvard Machine SUIF back-end framework. The x86 processor has been considered as the GPP in the evaluated Molen organization in [10]. An example of the code generated by the extended compiler for the Molen programming paradigm is presented in Figure 4. In the left column, the original C program is shown. The function implemented in reconfigurable hardware is annotated with a pragma directive named call fpga. It has incorporated the operation name: op1 as specified in the CCU description file. In the central part of the picture, the code generated by the original compiler for the C program is depicted. The pragma annotation is ignored and a standard function call is included. The right column of the picture presents the code generated by the compiler extended for the Molen programming paradigm; the function call is replaced with the appropriate instructions for sending parameters to the reconfigurable hardware in XRs, hardware reconfiguration, preparing the fix XR for the microcode of

THIRD INTERNATIONAL WORKSHOP ON SYSTEMS, ARCHITECTURES, MODELING, AND SIMULATION (SAMOS III), JULY 2003 #pragma call_fpga op1 main: int f(int a, int b){ int c,i; c=0; for(i=0; i@A@ % ? , CFE G % D @ (J % ,? % ,DCML6@

(1)

A. Example As an example of how we derive a parametrized predicated controller, we look at a particular node of the the QR algorithm. Figure 4 shows the loop-iterators k and j and a number of predicates to activate the proper IPD operation, i.e., read from a FIFO, and OPD operation, i.e., write to a FIFO. In each iteration of the loop-iterators, some predicates are evaluated to read the correct data from the correct FIFO. This is given by the READ part. Next the function Func is executed in the EXECUTE part. Finally data is written to the correct FIFO in the WRITE part. From the description given in Figure 4, the linear expressions are extracted from the loop-iterators as well as the predicates. After applying common expression elimination, a data dependence graph (DDG) is constructed as given in Figure IVA. At the top the DDG, 4 registers are given. Two registers for the parameters  and N and two registers for the current loop-iterators PO and A O . The last two registers get update with the values of QOR  and KOR after each evaluation of the DDG. Each time the DDG is evaluated, a particular switch pattern appears at the IPD and IPD outputs driving the Read and Write units in the virtual processor as shown in Figure 1. The parameterized predicate controller implementation allow us

162

S

S

0 void P2 ::main() 1 for (int k = 1 ; j = 0) in_1 = read(FIFO3); if (j-1 == 0) in_1 = read(FIFO4);

3 4 6 7 9 10 12 13

T

T

EXECUTE

if (-k+T-1 >= 0) write(out_0,FIFO1); if (-j+N-1 >= 0) write(out_1,FIFO5); if (k-T == 0) write(out_2,FIFO6);

WRITE OPD_1 OPD_2 OPD_3

// for k // for j

Fig. 4. Description of a KPN node process take from the QR algorithm, each read (resp. write ) operation is guarded by one or more expression predicates, that are function of the loop parameter and indices

| }\~x „ƒ  t u v u wyx z x v‚ t u v u wyx z x v{ _`a_Bb B_ c _Bd B_ e _Bf +1

−1

−1

Z\^

_` j ƒ  …8`

Mux1

_Y b ` V

−1

+1

−2

−1

_−` `

Z\−[ ]



1

−2

| }~\x 6€  _Bg _Bh

Y_` bV

_ ` V c _ ` V d Y_ ` eV _Y ` f V _W ` X g _W ` X h _Bb i lm!n k lm!rU lU m9q k l m9o k l m k l!m!n l m9s p p €  …B` p [ 1

Mux2

Fig. 5. data dependence graph associated to the loop described in figure 1. light gray nodes correspond to operations that depends on the most inner loop index, and therefore need to be evaluated at each new iteration, dark Gray nodes correspond to operations that depends on the most outer loop index, and therefore only need to be re-evaluated when the inner loop upper bound is reached.

to handle large parameterized iteration domains. We observe two issues. The first issue relates to the complexity of the domain (size, shape) spanned by the loop-iterators and on the number of IPDs/OPDs involved in a node, since all predicates and expressions have to be evaluated at every cycle, therefore requiring more hardware resources as the domain gets more complex and involves more linear expression that needs to be evaluated. As a consequence, the controller implementation might use a lot of area. A second issue relates to the fact that we map all predicates evaluation as pure combinational functions. As a consequence, the controller speed (i.e. the frequency at which the controller can run) might not be scalable: computational complexity (i.e its number of terms in the predicate expressions) usually increases with the number of dimension of the iteration domain. The resulting controller critical path will therefore be very dependent on the number of dimension of the domain. V. PARTITIONED PARAMETERIZED P REDICATE C ONTROLLER The parameterized predicate controller is sensitive to the amount of linear expression that needs to be evaluated in each iteration. It is however possible to drastically reduce the amount of computations, by taking advantage of the fact that only expressions predicate that

THIRD INTERNATIONAL WORKSHOP ON SYSTEMS, ARCHITECTURES, MODELING, AND SIMULATION (SAMOS III), JULY 2003

To READ Unit

Most inner loop index computation

IPD1

Fig. 6.

OPD1

OPDn

=

Parallel combinational evaluation of inner loop index dependent predicates

+1 idx

   (             & '  !#"$ " % 

the sequential controller. The remaining sub-graphs are mapped on the sequential controller. We sequentially schedule the operations in the remaining sub-graphs, by performing a topological sort to ensure that data dependences constraints as satisfied. Then we generate the global sequential schedule by concatenating all these local schedules in a decreasing depth order. Values that need to be communicated to the parallel controller are mapped on communication ports with the parallel controller. A. Example

nxt_loop_rdy

+1

PC

ALU

 

          

Data RAM

Program ROM

end_loop

IPDn

To WRITE Unit

163

Partitioned hardware loop controller architecture

depends on the most inner loop index have to be evaluated a every cycle. Other linear expressions only need to be re-evaluated when one of their iterators change value. This happens when an inner-loop iterator has reached its upper-bound. The partitioned parameterized predicate controller tries to take advantage of the dependence on loop-iterators to get a simplified controller that operates fast. As less expressions have to be evaluated, simpler hardware can be used to realize the controller. In partitioned parameterized predicate controller, the Controller is split into two parts: a parallel datapath and a sequential controller. The parallel datapath is similar to the one presented in IV. Its purpose is to evaluate all expressions depending on values of the most inner loopiterator, that is provided by the sequential machine. The sequential controller evaluates all the expressions that depend on the outer loop iterators and parameters (including the most inner loop upper and lower bound) and forwards the results to the parallel datapath. In Figure 6, the parallel datapath is given in the top part and the sequential controller is given in the lower part. In the sequential controller, a small sequential program is stored in a Program ROM. When running this program on a simple micro-engine, some data values are computed and stored in the Data RAM. These values needed by the parallel datapath are forwarded. The parallel datapath evaluates at each cycle the parallel combinational logic, using the forwarded values. The only value that is re-computed, is the inner loop-iterator iteration. The partitioned parameterized predicate controller is generated in two steps. 1) DDG Partitioning: In the first step, we apply again common expression elimination and construct the data dependence graph as explained previously in IV-.2. The main difference is that in this case, we will partition this graph into a set of sub-graphs (one for each loop index). Each sub-graph contains all the nodes which have as input argument the sub-graph corresponding loop index. In case a node has two distinct loop-iterators as argument, we map the node to the sub-graph that is associated with the most inner loop-iterator. 2) Hardware mapping: In the next step, we partition the computation between the parallel and sequential controller. This partitioning use the sub-graphs obtained in the DDG partitioning. All the nodes of the sub-graph associated to the most inner loop index are mapped on the parallel controller. All arguments that depend on values calculated by the sequential controller are mapped on communication ports with

As an example of how we derive a partitioned parametrized predicated controller, we look again at a node of the the QR algorithm as given in Figure 4. The first step is to obtain the partitioned DDG. The DDG shown in figure 5 is therefore decomposed in three subgraphs: )+*,.-/,.0 , )  and, )  , which are respectively associated to  and to the two loop indices  and  . In the parameters  and Figure 5, the color of a node indicates to which sub-graph it belongs to. Next we need to map a sub-graph onto the parallel controller or the sequential controller. Since  is the inner loop iterators, we map )  onto the parallel controller. The two remaining subgraphs are mapped onto the sequential means  controller. )   that 547698;: , This 2 1 and the sequences of operations 1 ) * B3 9< 6 #=765>?6 @< 6 ;6 #>?65=?6  6 A : are sequentialized to a pro2 gram that runs on the sequential controller. This sequential program is given in Figure 7. B. Discussion Some problems might appear when the most inner loop domain becomes very small (only a few iterations). In such a case, the sequential machine might not be able to compute the partial results needed by the inner-loop parallel datapath fast enough, therefore slowing-down the overall speed of the controller. However, in most applications, processes very very seldom need to fire at every cycle (either because they are in a blocking-read blocking write state), or because the IP core cannot start the execution of a new operation every cycle. The impact of such hazards is then very unlikely to impact the overall network performance. // 1 2 3 4 // 5 6 7 8 9 10 11

parameters n3 := N-1; write(n3,port1); n5 := T-1; write(n5,port2); k outer loop n7 := (k-2); n16 := n7>=0; write(n16,port3); n8 := (k-1); n17 := n8=0 write(n17,port4); n11 := (k-T);

Fig. 7.

12 13 14 15 16

17 // 18 19

n18 := n11=0; write(n18,port5); n6:= k+1; n12 := n11>=0; if n12 n20:=n6; else n20:=1; end if; k_next : =n20; sync with datapath synchronize() jump 3

Some possible schedule for the sequential controller

C. Experimental results To observe the benefits of our approach with respect to the ROM Table implementation, we used the same applications as in Table I. For each of them we derived both a partitioned and non-partitioned parameterized predicate controller, which was then mapped on a Xilinx Virtex-2 FPGA. The results are given both in terms of frequency and resources usage (in FPGA slices) are shown in table II. From these results, we can make the following remarks :

THIRD INTERNATIONAL WORKSHOP ON SYSTEMS, ARCHITECTURES, MODELING, AND SIMULATION (SAMOS III), JULY 2003



Size 8 16 64

16 64 256

320 640 1024

200 400 640

QR factorization

Stereo-vision

Optical-flow

Nonpartitioned  Area Partitioned  Area

 

320 640 1024

 

200 400 640

140 133 121

 97 100 100

 129 118 126

29 68 89 Area 133 148 153 Area 97 110 113

100 85 74

 65 74 71

 76 72 75

112 133 163 Area 120 123 126 Area 98 103 106

the variant domains of a process are derived using the DomainIntersection function provided by PolyLib [12] and the FourierMotzkin elimination [13] provided by the Omega library [14]. argument lists:

L2 − in1 arg list

L1 − in0 arg list N= 8

N= 8

7

7

IPD4

IPD1

6

6

5

5

4

4

3

3

1

k 1

2

3

4

5

6

1

k

IPD2 0

TABLE II

0

7

0

8

1

2

3

4

N= 8

N= 8

7

7

7

OPD1

OPD2

6 5

5

4

4

4

3

3

3

2

2

1

1

k 3

4

5

6

7

0

8

0

j

variant domains:

OPD3

1

0 2

8

2

k

k 0 1

7

6

5

0

6

L5 − out2 arg list

L4 − out1 arg list

L3 − out0 arg list N= 8

6

5

j

j

E XPERIMENTAL RESULTS VARIOUS PROCESSES HARDWARE CONTROLLER .

IPD3

2

2

0

The controller resource usage can vary a lot, depending on both the application and on the domain size (which influences the bitwidth of the operators in the datapath). It appears that the fixed area cost overhead caused by the sequential controller used in the partitioned approach is quite important, and make this implementation strategy only viable for very large and complex domains. In all cases, the partitioned controller is slower than its counterpart. This slow-down is mostly due to the sequential controller, which suggest that its realization deserves more attention.

164

1

2

3

4

5

6

7

0

8

1

2

3

4

5

6

7

8

j

j

k V6

V3

V9

T=6 5

V8

V5

V2 4 3 2

More generally, these two strategies provide very interesting results, even for very simple and small domains, and therefore suggest that the ROM table approach is not really appropriate in the context of Compaan. VI. RUN -T IME D ISTANCE E VALUATION A PPROACH A. Motivation If you look to the code in Figure 4, you can observe that in most of the cases an IPD/OPD selection will remain active for a certain number of consecutive cycles. By taking advantage of this regularity it is possible to further reduce the control computational cost of a process i.e., instead of evaluating a set of predicates associated to an IPD/OPD at each cycle, we can compute for each IPD/OPD the number of cycles during which they will remain in the active state and load counters with these values (these values will later on be refereed as pattern distances). Then, at each firing of the process, these counters will decrement themselves, until they reach 0. They will then have to be reloaded with a new value corresponding to the next control computations. The main problem with this approach, is that we need one counter for each function argument. This is likely to be impractical in many cases since it is not rare in image processing algorithms to have functions with more than fifty arguments. Besides, scheduling the counter loading operations is also likely to be difficult. Therefore, we chose to merge the information of all the IPD/OPD active states into what we call variant domains, and then to compute the distances corresponding to the number of cycles/iterations during which the process remains into the same variant domain. B. Variant domains For each process firing, the data is read/write from/to a configuration of active FIFO channels. Such a configuration together with the process function form a process variant. The notion of variant was introduced in [11]. For each such FIFO channels configuration corresponds a configuration of IPDs/OPDs. The polytope domain onto which this configuration remains unchanged is called variant domain. Having the polytope description of the IPDs and OPDs,

V4

1

V7

V1

0 0

Fig. 8.

1

2

3

4

5

6

N=7

j

Deriving the variant domains corresponding to the Rotate process



input/output arguments. For Consider a process function with  each function argument there is a union  of disjoint polytope domains representing the active IPDs/OPDs that (depending on the iteration) deliver the data. We say  is the argument list correspond ing to the function argument. A variant domain  is therefore given by the intersection of components of an arbitrary element taken from the Cartesian product of the argument lists:







$   & \&   *      &   ,     O  L6CF> B( ) K)"!#%$'&(#()!( L ) O 

(2)

that contains at least one integer point (a empty variant domain corresponds to an IPD/OPD configuration that will never appear during the process execution). The emptiness of a variant domain is checked using FourierMotzkin elimination:

, 0  +*-,/1 . 32 6L E +$ ( #%$'4 L !5'6 ()

(3)

Following the procedure explained before, the construction of the variant domains for the Vectorize process is graphically depicted in Figure 10. Hence, the process from Figure 4 contains 9 different variants: V1=(IPD2,IPD4,Vec,OPD1,OPD2,*) V2=(IPD1,IPD4,Vec,OPD1,OPD2,*) V3=(IPD1,IPD4,Vec,*,OPD2,OPD3) V4=(IPD2,IPD3,Vec,OPD1,OPD2,*) V5=(IPD1,IPD3,Vec,OPD1,OPD2,*)

V6=(IPD1,IPD3,Vec,*,OPD2,OPD3) V7=(IPD2,IPD3,Vec,OPD1,*,*) V8=(IPD1,IPD3,Vec,OPD1,*,*) V9=(IPD1,IPD3,Vec,*,*,OPD3)

The “*” symbol present in some of the variant formats, means that the corresponding function argument although is produced will not be distributed to any other processes form the network. Each variant is active inside the corresponding variant domain represented by a set of nested if-statements. Figure 10 shows the representation of the Vectorize process house-keeping code based on variant domains:

THIRD INTERNATIONAL WORKSHOP ON SYSTEMS, ARCHITECTURES, MODELING, AND SIMULATION (SAMOS III), JULY 2003

S

S

0 void P2 ::main() 1 for (int j = 1 ; j = 0)) 6 Execute(V2); 7 if ((j-1 == 0) && (k-T == 0)) 8 Execute(V3); 9 if ((k-1 == 0) && (j-2 >= 0) && (-j+N-1 >= 0)) 10 Execute(V4); 11 if (( 2

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.