MODELING LOAD ADDRESS BEHAVIOUR THROUGH [PDF]

vated shifting from CISC ISA to RISC ISA. Regularity mainly comes from code reuse (to traverse one or several ... de Zar

2 downloads 15 Views 113KB Size

Recommend Stories


oil spill behaviour and modeling
No amount of guilt can solve the past, and no amount of anxiety can change the future. Anonymous

Nonlinear Snap-Through Load Analysis
Be like the sun for grace and mercy. Be like the night to cover others' faults. Be like running water

Data Analysis Through Modeling
I cannot do all the good that the world needs, but the world needs all the good that I can do. Jana

Encouraging Ethical Behaviour Through Design
What we think, what we become. Buddha

Modeling and Finite Element Analysis of Load
Never let your sense of morals prevent you from doing what is right. Isaac Asimov

[PDF] Download Organizational Behaviour
Ego says, "Once everything falls into place, I'll feel peace." Spirit says "Find your peace, and then

[PDF] Organizational Behaviour
Ego says, "Once everything falls into place, I'll feel peace." Spirit says "Find your peace, and then

Download PDF of Load Chart
Everything in the universe is within you. Ask all from yourself. Rumi

if the PDF doesn't load
Ask yourself: How am I fully present with the people I love when I'm with them? Next

Resilient Machines Through Continuous Self-Modeling
The butterfly counts not months but moments, and has time enough. Rabindranath Tagore

Idea Transcript


MODELING LOAD ADDRESS BEHAVIOUR THROUGH RECURRENCES L. Ramos, P. Ibáñez, V. Viñals and J.M. Llabería* Dpto. Informática e Ingeniería de Sistemas. Univ. de Zaragoza, Zaragoza, Spain *Dpt. Arquitectura de Computadors, Univ. Politècnica de Catalunya, Barcelona, Spain ABSTRACT Addresses of load instructions exhibit regularity in their behaviour, which is modelled through several models (locality, repetitive patterns, etc.) and exploited in processor and memory hierarchy design. Nevertheless, sparse and symbolic applications are intensive in addressing patterns not entirely covered by current models. In this work we introduce a new recurrence among load pairs called “linear link” in order to identify more regularity from such applications. A linear link is a type of recurrence between the value read by a (producer) load and the address issued by a (consumer) load, which is detected tracking on-the-fly dependencies among loads. We consider a broad workload (Nas, Olden, Perfect, Spec95 and IAbench) and conclude that linear links together with stride recurrences can identify many address streams in symbolic and scientific applications traversing either dense, linked data structures or compressed forms of sparse arrays. The two recurrence combinations identify more than 90% of the addresses in more than a half the programs (in 24 out of 55), and more than 75% of the addresses in 90% of the programs (50 out of 55). Finally, we show several measures related to the use of linear links as address predictors for executing loads speculatively and for issuing data prefetches (prediction distance, ahead capacity, etc.).

I. INTRODUCTION Usually, computer design searches for a means of executing programs faster, with cheaper hardware or in a more cost-effective way. Studying dynamics of operation and addressing regularities is essential to achieve those goals in the design of processor and memory hierarchy. Technological opportunities and constraints and a given application domain frame this optimization process. So, observation of high-level language patterns (data and operations) in scientific computation led to vector ISA, and statistic study on (compiled) code execution along with the goal of an efficient pipelining motivated shifting from CISC ISA to RISC ISA. Regularity mainly comes from code reuse (to traverse one or several data structures), from logical connection among near data (neighbouring data in address space often are visited in near times) or from logical connection among close branches (the code path followed so far is a good predictor for the next path). A regularity model, by abstracting some capital aspects is able to reconstruct a given event stream in a more compact form than the stream itself. Based on such models many microarchitecture proposals track and exploit regularity on the fly. Examples are control flow prediction to reduce delay cycles when fetching instructions [19], dependence prediction between store/load pairs to guide out-of-order execution [5], or address prediction to prefetch data into memory levels closer to the processor [1][16]. In this work we will study load address behaviour by means of recurrences, a class of regularity models.

In current superscalar microprocessors, loads can significantly increase execution time since they are often located at the beginning of a dependence chain which is located in a critical timing path. Therefore, great effort is devoted to reduce the two components of load latency: address computation time (from fetch to address issuing) and data access time (memory service time). A multilevel cache hierarchy is targeted to reduce the second component, its ultimate goal being to serve all references at processor speed. To enhance the cache hierarchy performance, ways to predict memory activity are becoming increasi n g l y i m p o r t a n t . F o r e x a m p l e , s ev e r a l h i g h - e n d microprocessors can carry out simple non-sequential prefetching, either transparently to ISA [14] or by using compiler-inserted instructions [9]. Recently, as load hit latency has been increasing, the use of value or address regularity has been suggested to break the ILP barrier caused by data flow dependencies. Value prediction directly feeds operands into the dependent instructions which are speculatively executed [29]. On the other hand, address prediction allows loads to execute early in the pipeline, partially hiding the address computation time [23]; here we will term this concept address-speculative load execution or simply load speculation. Regularity can be recognised in the address stream itself or in the operations performed to compute it [29]. Accordingly, we can consider two broad models, a context-based model [16][3] and a computational model [1]. Computational models allow accurate prediction of still unknown values and context models capture complex patterns, provided that their storage capacity be large enough to hold a relevant set of contexts. Our goal is to identify more regularity in the load address stream by means of simple computational recurrences, specially for symbolic or numeric data-sparse computations where stride pattern presence is moderate. Thus, we extend the recurrences presented in [20] and [27], introducing a new recurrence suitable for single loads or for load pairs and we call it linear link. A linear link relates the address issued by a consumer load (ai) to the value read from memory by its producer load (vi) in such a way that ai = αvi + β. A second contribution deals with characterization of a broad workload, combining numerical and symbolic applications. In that workload we separately measure linear link and stride regularity, showing their overlap as well as their joint recognizing ability. For the symbolic and data-sparse subset, we report that the linear link model adds a net contribution of around 30%, reaching the combined recognizer 80%-90%. This paper is organized as follows. Section II reviews stride recurrence and introduces the value-address recurrence defining a linear link. Section III shows the way the linear link

model deals with the parameters and the simple mechanism devised to detect dependence relationships. Section IV presents the benchmarks and experimental framework that have been used. Section V analyses how much regularity stride and linear link models are able to identify. Section VI shows measures of the linear link potential to predict addresses, either for prefetching data or for doing address-speculative execution. In Section VII producer load patterns are analysed in depth, since the ability to advance consumer addresses depends on such patterns. Section VIII summarizes the work findings.

when the consumer load effectively depends on a single producer load, since the values read by other producers are constant in that execution path or other producers do not exist. Linear links are intended to model this typical case. Figure 1 shows two code examples illustrating this situation: a) inner product of a matrix row with a vector, being the matrix stored in Compressed Sparse Row format, and b) traversal of a linked list of records. a) Inner product of row i from matrix A with vector x. Matrix stored in CS R format do i = 1, n do k = jA(i), jA(i+1) - 1

II. COMPUTATIONAL RECURRENCES A computational recurrence tries to mimic the address generation algorithm. Firstly, we review the stride recurrence, which only considers the address stream of individual loads. Next, we introduce linear links, a type of value-address recurrence which takes into account load pairs so that the address issued by one load can be characterized as a linear transformation of the value read by the other load. A lot has been done into designing and evaluating stride pattern recognizers intended to predict addresses, either for doing prefetch [1][15][22] or for doing load speculation [23]. Other works focus on patterns not covered by stride recurrences. Mehrotra and Harrison introduced in [20] a value-address recurrence to model a single load visiting either pointer-chained records or 4-byte elements chained by index. Recently, dependence analysis were introduced by Roth et al. to partially extend the previous recurrence to load pairs. They relate the value read by a (producer) load (vi) with the address register used by a later (consumer) load (ri ) if both values match (ri = vi) [27].

A. Stride Recurrences Stride recurrences model typical array traversing. Recursive codes can also show this pattern since stack references from a given load can be equispaced across multiple recursive calls (stack allocation shapes a linear layout of equally-sized activation frames). Scalar variable referencing also issues address streams included in this pattern. The stride recurrence states that ai, the address issued by the i-th instance of a load, can be computed as ai = ai-1 + s, s being a constant stride defined as: s = Si = Si-1 = ... where Si = ai - ai-1. Scalar variable referencing arises when the stride is zero. It is also termed last address recurrence.

B. Value-Address Linear Recurrences Many symbolic and scientific applications do significant parts of their processing with data logically arranged in lists, trees, graphs or several compressed forms of sparse arrays. Elements of those structures are explicitly linked together through pointers or indexes to form the overall structure. To access a target element it is often required to cross over the previous elements. So, reading data from one or more elements (producer loads) is needed in order to access the target element (consumer load). Therefore, in a given execution path, the address computation of a consumer load can be dependent on several memory values read by several producer loads. A particular case arises

b) Traversal of a linked list of records for(e= first; e; e= e->next){ if (e->dat == ...) { ... }

y(i) = y(i)+ A(k)*x(jA(k))

Figure 1: Two examples containing linear links. a) producer load reads the index jA(k) and consumer load reads x(jA(k)). b) producer load (e=e->next) is used by two consumer loads (e->dat and e=e->next)

We can model this behaviour by introducing the concept of linear link. We say a producer/consumer load pair forms a linear link if they are related by the following recurrence: ai = αvi + β; where vi is the value read from producer load, ai is the address issued by the consumer load and (α, β) are the recurrence parameters, β = ai-1 - αvi-1 = ai-2 - αvi-2 =... and in this work parameter α is selected from the fixed set (1, 2, 4, 8). Table 1 summarizes this recurrence and the interesting particular cases we describe next. Types two loads

Recurrence ai = F (vi, α, β)

Parameters

ai, ai-1 ∈ consumer; vi, vi-1 ∈ producer

index link

ai = αvi + β

β = ai-1 - αvi-1 = ...

α =1,2,4,8

pointer link

ai = vi + β

β = ai-1 - vi-1 = ...

α=1

single load

ai = F (vi-1, α, β)

ai, vi-1, ai-1, vi-2 ∈ same load

index self-link

ai = αvi-1 + β

β = ai-1 - αvi-2 = ...

α =1,2,4,8

pointer self-link

ai = vi-1 + β

β = ai-1 - vi-2 = ...

α=1

Table 1: Value-address recurrences. ai is an address, vi a value read from memory

Linear links can be termed as index-links or pointer-links. In the first case the producer load gets an index from memory, parameter β is a base address of an array and parameter α is an index size in bytes (1, 2, 4, 8). In the second case the produced value is a base address of a record and parameter β is the record offset (α = 1). When pointer-links appear, addresses are surely used to link elements, whereas the existence of index-links confirms traversals based on array indexes. On the other hand, a linear link having the same load as a producer and a consumer is termed a linear self-link. By using these terms, we can say the prefetch mechanism devised by Mehrotra and Harrison [20] is able to detect stride and self-link recurrences with α= 1,4. Likewise, the prefetch mechanism of Roth et al. [27] detects pointer links and pointer self-link recurrences (α= 1), but only if the value read from a producer is not modified before the consumer uses it.

III. PARAMETER UPDATING AND LINK DETECTION Next we show how parameters are updated and links detected. To store recurrence parameters we use tables indexed by load program counters. Since we are interested in characterization, we consider unlimited table sizes to prevent results from being biassed by resource constraints.

A. Stride Recurrences For each load, a table entry keeps its last issued address (ai-1), the stride parameter (s) and a two-bit saturating counter Q to add confidence to the s parameter (regularity firmness). Q is incremented if two consecutive stride evaluations match (Si = s), otherwise is decremented. Hysteresis is used to update the stride: in a regularity burst two mistakes in a row are required to update the stride parameter1 [11]. We consider the current address ai to belong to a stride stream whenever the confidence level is high and the parameter s matches current stride S i , namely, a i ∈ S set if < Q = {2,3} and S i = s >, where S is the set of all issued addresses belonging to stride streams. In later sections, to quantify S we will use the percentage of addresses in S over all issued addresses.

B. Linear Link Recurrences A load pair can be a linear link if there is an instruction path between both loads in the flow-dependence graph. This path computes the consumer address from the producer data, and we will term it the address-computing chain (or simply chain, for short). During the program execution we will use a Dependence Table to detect the load ends of such address-computing chains (producer and consumer loads). As new links are detected they are entered into a Link Table. Later, each link re-execution updates parameters, checks for linearity and once a given confidence level is reached, we count consumer addresses as belonging to linear link streams. The idea is to identify an address-computing chain by means of its producer load, by tagging all destination registers with an identifier for that producer load. However, a consumer load can depend on several producers. But typically in iterative program structures, linear link will arise between the consumer and the most recent (youngest) producer. So, in order to select the right chain for each consumer, an algorithm based on load age can be used to prune the chain tree. Later in this section, we will suggest an improvement to this algorithm. Dependence Table (DT). It has an entry for each logical register, with two fields, TAG and TS (time stamp). When an instruction executes, the DT fields corresponding to its destination register are updated in the following way: i) load instruction (potential producer): put its Program Counter into TAG and update TS to reflect current time, ii) instruction delivering constant or other typified instructions: mark TAG as No-load-dependent2 (NLD), and iii) arithmetic/logic instruction (rd= rs1 op rs2): here TAG and TS are inherited from the DT entry pointed by the source register with the lowest TS (most recent load); if both source entries are marked NLD the destination entry is also marked NLD. This age selection implements the prune of the address-computing chain tree. Figure 2 shows an example of the dependence chain building. Link Table (LT). For each link we store in LT the last value read by the producer (vi), the four β parameters for each considered α (1,2,4,8) and four two-bit counters, Qα, giving 1. Updating parameter rule: s = Si whenever there is no match (Si != s) except if Q = 3. 2. For instance, save instruction in SPARC give a new register window with 16 fresh registers which will all be marked as NLD.

confidence to each β. Any time a consumer executes and issues its address ai, we check for equality among the current βi = ai-αvi and the four β parameters. If there is a match for a given α its corresponding counter Qα is incremented, being otherwise decremented and β updated with βi. c = T[k]; ...

T[k]

L$4: ld ... L$5: ld for (i=0; i, L being the set of all addresses belonging to linear link streams. Here hysteresis is not used to delay updating parameter β, since a mismatch evidences a change in the base address used and should be taken into account immediately. Detection Link Improvement. Consider a consumer load depending on several producers which are located in a single execution path. In this situation a linear link can arise only if all producers but one read constants3. Our mechanism will properly detect such a link only if the producer which reads variable values is the last executing producer. In order to solve this situation we record two chains per register (two fields per DT entry). We firstly check linearity with the youngest chain and if confidence is lost (Q=0 for all α), then enter a new link into LT and switch the check to the other chain. This improvement significantly increases linear link detection in several programs. An address percentage ranging from 1% to 8.5% is added to the identified regularity in 6 programs out of 55. Consequently, we will activate the two-chain propagation in all the measures that follow.

IV. SELECTED WORKLOAD Our experimental set relies on five benchmark suites: Olden [26], Spec95, Nas [2], Perfect [8] and IAbench. • Olden programs mainly work with (integer) dynamic data structures chained by means of pointers; we use the serial version of the original, multicomputer-targeted workload. • Spec95int is an assorted collection of integer programs. • Spec95fp, Nas and Perfect are mostly dense (floating-point) numeric applications. • The programs we have gathered in the Indirect-Access Benchmarks (IAbench) suite are intensive in computations involving index arrays as auxiliaries to access compressed data structures (numeric sparse computations). 3. For example, a loop using pointer variables to store base addresses can re-load that pointer variables in each iteration for safety reasons (the compiler was not able to exclude aliasing) but, finally, aliasing does not occur at run time.

IAbench (C and Fortran) prog cs/#inst %si bil 0/526 100 hop 0/842 100 lps 0/1362 100 nis 0/684 100 onm 0/2090 100 smv 0/2733 100 che 0/1535 100 s13 0/589 100 spi 4500/4000 40 sqr 0/1031 100 Aver 94

%ld 24 33 26 38 25 19 16 35 23 22 26

Olden (C)

Spec95int (C)

prog cs/#inst %si %ld bh 0/1432 100 27 em3 0/83 93 25 per 0/1199 36 11 pow 0/896 100 19 tre 0/166 100 15 bis 0/330 14 15 hea 0/165 86 31 mst 0/180 100 15 tsp 0/353 99 19 vrn 0/241 3 26 Aver 73 20

prog cs/#inst %si %ld com 0/4000 8 13 ijp 5000/4000 22 13 m88 10000/4000 18 18 cc2 0/545 100 17 go 10000/4000 46 19 li2 0/4000 7 21 prl 500/4000 16 20 vor 4000/4000 11 19 Aver 29 17

prog app aps fp2 hyd mgr su2 swi tom tur wav

Spec95fp (Fortran) cs/#inst %si 4000/4000 13 3000/4000 15 3000/4000 2 1000/4000 9 3000/4000 130 6000/4000 20 500/4000 11 5000/4000 18 1000/4000 2 1000/4000 12 Aver 23

Nas (Fortran) %ld 24 24 23 24 35 27 32 27 9 26 25

Perfect (Fortran)

prog cs/#inst %si %ld abt 0/1052 100 44 asp 0/843 100 33 buk 67/13 100 24 cgm 0/512 100 37 emb 0/4000 44 28 fft 0/1587 100 24 Aver 91 31

prog cs/#inst %si lg 0/1165 100 lw 0/4000 22 mt 0/645 100 na 0/3340 100 oc 0/4000 48 sd 0/1379 100 sm 1800/4000 14 sr 0/4000 50 tf 0/1746 100 ti 0/2458 100 ws 1500/4000 84 Aver 74

%ld 17 31 26 32 24 23 19 40 31 38 23 28

Table 2: Used Benchmarks: (cs) = cold start lengths x106, (#inst) = number of simulated instructions x106, (%si) = percentages of cold start length plus the number of simulated instructions over the total number of instructions, (%ld) = load percentages.

IAbench complements Olden in two ways: by adding sparse computations and by increasing the number of non-kernel applications. In Appendix A, we give a brief overview of each IAbench program along with their input specifications. We use Sun SPARC V7 as the target architecture, compiling with the highest optimization level (cc -O4, f77 -O4, gcc -O3) for SunOS Release 4.1.3_U1 and including the library code in the binaries (-Bstatic or -static option). Addresses, load values and registers identifiers have been collected using the Shade tool [6]. Most benchmarks were run until completion but, for the largest ones only 4000 million instructions were simulated (sometimes after an initialization period in which no statistic was gathered). See Table 2 for details.

V. LINEAR LINK AND STRIDE RECURRENCE CHARACTERIZATION According to the models presented in Section III we have measured the regularity sets L (linear link) and S (stride), distinguishing also in the L∩S set streams following both recurrences at once. Linear-link-only regularity is termed L-S (see Figure 3.a). Overall regularity is named L∪S. We further split stride regularity into two disjoint sets named sc (scalar referencing, s=0) and str (true-stride referencing, s≠0), with S = sc∪str (see Figure 3.b). Moreover, we will show scalar and true-stride regularity in the L ∩ S set (sc∩L, str∩L) separately, as well as in the remainder of S (sc-L, str-L), see Figure 3.c. L

L∩S

S str

L-S sc a) L, S, L∩S and L-S

str

L

L

b) S = sc ∪ str

S

str-L sc

∩L

∩L S

sc-L

c) sc ∩ L, sc-L, str-L, str ∩ L

Figure 3: Set representation of address regularity.

To ease the study we will split the suites into two relatively similar groups: Group A: IAbench, Olden and Spec95int, which share a noticeable use of dynamically linked or index-based data structures, and Group B: Spec95fp, Nas and Perfect, which mostly perform numeric computations in dense arrays. Table 3 shows average scalar regularity (split into sc-L and sc∩L), true-stride regularity (split into str-L and str∩L), linear-link-only regularity (L-S) and overall regularity (L∪S) for each suite. These averages represent what address percentage a given set accounts for, weighting the same all programs

in a suite and assuming then that each program represents a given behaviour, no matter the number of instructions it executed. Overall regularity (L∪S) is over 80% in all suites and in Group B is slightly higher than in Group A. Scalar-only regularity (sc-L) is greater than 28% in all suites, and very similar in Groups A and B. The overlap between scalar and linear link ( sc ∩ L ) is smaller than 10,3% in Group A, whereas it is smaller than 2,4% in Group B. In summary, around one third or more of all references follow scalar regularity ( sc ). True-stride regularity (str-L) is very high (>40%) and similar in all suites of Group B. In contrast, Group A shows smaller and very scattered percentages which, anyhow, are not negligible.The overlap between true-stride and linear link (str∩L) is important when compared with stride-only (str-L) in Group A (even dominant in Olden) but in Group B such overlap is less important. Overall, around half the references follow true-stride in group B and roughly three times less in Group A. S sc

Suite

str

A

IAbench Olden Spec95int

sc-L 28.8 31.5 34.5

sc∩L 3.9 10.3 6.1

str-L 24.0 4.6 11.7

str∩L 5.9 4.8 1.2

L-S 28.5 29.5 27.8

L ∪S 91.1 80.7 81.4

B

Spec95fp Nas Perfect

28.6 33.7 31.4

2.4 0.4 2.3

57.4 42.6 50.2

7.8 3.8 2.4

1.0 16.7 1.7

97.1 97.1 88.0

Table 3: Average percentages of addresses following the recurrence defined in a given regularity set.

Linear-link-only regularity (L-S) accounts for more than one fourth of the references in Group A, whereas it is negligible in Perfect and Spec95fp (and also in Nas if we remove buk and cgm programs). Figure 4 plots the contribution of each regularity set for all programs. Next we will analyse each group separately, looking at individual programs occasionally in order to explain noticeable behaviour.

A. Group B: Spec95fp, Nas and Perfect As expected, scalar and linear referencing is dominant in all programs and hence S is very high (80-96%, see first four bottom bars in Figure 4.b. Linear-link-only recurrence is significant in buk and cgm from Nas. • buk sorts an integer array using an auxiliary index array, and cgm solves an unstructured sparse linear system with the matrix stored in a compressed format.

If we exclude these two programs from the Nas average, the resulting regularity given by linear-link-only recurrence in Group B is negligible ( 15% & (str-L) S. In this subgroup at least 50% of the addresses follow linear link recurrences, a 25% follow scalar-only recurrences and less than 5% of the addresses follow true-stride-only recurrences. • Subgroup 2 has a border behaviour between dense-numeric and linear link-intensive applications. They are characterized by comparable linear-link-only and true-stride-only regularities of (6-14%) and (8-37%) respectively. Scalar-only is dominant in all suites (30-44%). In contrast with the former subgroup, stride regularity (67-77%) dominates link regularity by far (12-26%).

Summarizing, from this subsection we can conclude that high degrees of address regularity can be picked up from a random sample of numeric and symbolic programs only when both linear link and stride recurrences work together in modelling such regularity.

Subgroup 1

IAbench Olden Spec95int

S sc str sc-L sc∩L str-L str∩L L-S 26.4 5.3 4.5 3.3 49.1 23.6 7.9 1.0 2.7 52.2 28.8 9.4 4.1 0.4 39.2

L∪S 88.6 87.3 82.0

S 39.5 35.1 42.8

L 57.7 62.8 49.0

Subgroup 2

IAbench Olden Spec95int

30.4 39.3 44.0

92.7 74.0 80.5

77.9 67.2 71.6

25.3 26.3 12.0

Suite

2.9 12.7 0.6

37.0 8.3 24.4

7.6 6.8 2.5

14.8 6.8 8.9

Table 5: Average regularities in Subgroups 1 and 2.

C. Linear Link Recurrences: α parameter and link/self-link breakdown Table 1 in Section II introduced several particular cases for linear links: pointer links (α=1) and index links (α=1,2,4,8). Also, self-links were introduced as particular cases of linear links where producer and consumer match. Figure 5 shows linear link regularity, breaking it into 7 classes by separating self-link (below x-axis) from link (above x-axis) and by taking into account the α parameter value (α=1 in dotted bars and α=2,4,8 in stripped bars). We show individual data for all Group A programs (IAbench, Olden and Spec95int) and averages for the six suites. Self-link with α =8 was not found in any program, and α = 2 is a very infrequent case.

Linear link recurrences with α different from 1 are very common, mainly in numerical programs to access elements of either 4 or 8 bytes by means of index arrays (IAbench, Spec95fp and Nas account for 56%-74% of all addresses in linear link streams). Note that the prefetch mechanism devised by Roth et al. [27] is only able to detect pointer links and pointer self-link recurrences (α= 1).

m 8.9 11.4 3.0

mS

mL

mL∪S

3.2 2.9 0.6

5.1 10 1.2

8.1 10.6 1.6

10 0 10

bil hop lps nis onm smv che s13 spi sqr bh em3 per pow tre bis hea mst tsp vrn com ijp m88 cc2 go li2 prl vor IAben olden SPint SPfp nas perfe

20 30

sl2

sl4

l1

l2

l4

l8

Figure 5: Linear link breakdown: self-link (“sl” bars below x-axis) and link (“l” bars above x-axis). α= 1 (dotted bars) and α= 2,4,8 (stripped bars). Individual data only for Group A. The six suite averages are shown on the right.

Self-link contribution to regularity is significant in several benchmarks. For instance, in s13, hea and spi it represents between 24.2% and 27% of L addresses. Nevertheless, for the whole Group A, self-link contribution to regularity is rather small (5.6-16.7% of L addresses).

VI. APPLICATIONS Several applications can benefit from linear link regularity. In this work we will focus on two microarchitecture applications: data prefetching [1][15] and address-speculative load execution [23]. Both applications require the generation of addresses in advance. In our experiments, these addresses are consumer load addresses. So, each time a load is executed the suitable bookkeeping actions must to be performed on a hardware implementation of the Dependence and Link Tables, as described in Section III. We have not carried out comprehensive measures to assess the size of a hardware table, since it is beyond the scope of this paper. Nevertheless, we have simulated a 4K entry, direct-mapped Link Table with eight (four) consumers per entry, and obtained that, on average, 95% (84%) of the linear link regularity is still recognized for all programs in the Subgroup 1 defined in Section V.

A. Prefetch Data prefetching tries to overlap the load miss latency with the execution of the instructions prior to a given missing load. We show two measures related to the use of linear link recurrence as a prefetch predictor: the miss ratio induced by the consumer loads (those receiving correct producer predictions), and the number of instructions which can be overlapped with correct prefetches. The m column in Table 6 shows the average data miss ratios for all suites in a 2-way set associative cache of 16KB with a block size of 16B. The mS, mL and mL∪S columns are the miss ratios of the addresses included in the

Regarding the amount of available prefetch overlap, the number of executed instructions between producer and consumer loads, which we term link distance, is a good approximation. Figure 6 shows the link distance distribution for all programs in Group A and the averages for all suites. In Group A, link distances below 5 instructions are observed in 37.4-49.3% of the addresses belonging to linear links. Greater distances, above 64 instructions, only account for small link percentages (2.8-4.3%) Roth et al. reported noticeable performance improvements in the Olden suite, when a restricted form of linear link based prefetching is used in an out-of-order 4-way processor [27], in despite of the small link distances we have measured. 80 70 60 50 40 30 20 10 0 bil hop lps nis onm smv che s13 spi sqr bh em3 per pow tre bis hea mst tsp vrn com ijp m88 cc2 go li2 prl vor IAben olden SPint SPfp nas perfe

40 30 20

% loads

% loads

sl1

Group A IAbench Olden Spec95int

Table 6: Average miss ratio percentages of all the addresses (m) and of the addresses included in the regularity sets S, L and L∪S (mS, mL, mL∪S), using a 16KB , 2-way set associative cache with block size = 16B.

70 60 50

regularity sets S , L and L ∪ S respectively. In Group A (IAbench, Olden and Spec95int) 57%, 88% and 40% of the total misses belong to addresses in the L set.

1-4

5-16

17-64

>64

Figure 6: Distance distributions for consumer addresses belonging to the L set. Four distance intervals are considered. Percentages are relative to the total number of issued addresses.

B. Address-speculative load execution Address generation interlocks delay the issue of load instructions.We envisage an engine using linear links in order to compute addresses of consumers loads in a predictive way. So, consumer loads will proceed without waiting for their operands, and the consumer-dependent instructions could execute speculatively. Observe that small link distance implies that a producer and its consumer probably share the instruction window of an out-of-order processor concurrently. In this case, using the linear link recurrence to predict the consumer address can be seen as a collapse of the address-computing chain. We term link computational latency, or link latency for short, the number of instructions belonging to an address-computing chain. A zero link latency means no value transformation between producer and consumer4. Figure 7 shows the link latency distributions for all programs in Group A and the averages for all suites. In the Olden 4. If we assume that the cost of a prediction is one cycle, a minimum reduction of link latency cycles can result. This assumes making a prediction in the cycle following the delivery of the value read by a producer. To do so, a shift and an addition have to be done in a processor cycle. Functional units implementing in a single cycle those operations have been implemented, for instance, in POWER2 [30].

suite, zero link latency is dominant. In contrast, in IAbench and Spec95int, link latencies greater than zero are frequent (67.6% and 51.3 of L addresses, respectively). 80 70 60 50 40

% loads

Example a Prod

30 20

bil hop lps nis onm smv che s13 spi sqr bh em3 per pow tre bis hea mst tsp vrn com ijp m88 cc2 go li2 prl vor IAbe olden SPint SPfp nas perfe

10 0

0

1

2

3

>=4

Figure 7: Link latency distributions for consumer addresses belonging to the L set. Five link latencies are considered. Percentages are relative to the total number of issued addresses.

Using the link latency definition, we can say that the regularity identified through the prefetch mechanism by Roth et al. [27] can be approached by that of linear links with α=1 and zero link latency.

VII. BEHAVIOUR OF PRODUCER LOADS In both applications the advance ability can be insufficient if the amount of available work for overlapping (link distance) is limited. Therefore, predicting a consumer address K instances ahead may be needed. For a linear link to predict a consumer address, the value returned by the producer is needed, requiring a memory access. Therefore, the delay in predicting a consumer address K instances ahead, depends in turn of the ability in predicting the producer address. 80 60

stride recurrence, then, in order to advance producer addresses K instances ahead, a simple computation is required (ai+K = ai + K·stride). Therefore, to predict also the consumer address K instances ahead only one memory access is required. Figure 9.a shows an example with this behaviour (index-link load: s=s+c[j]).

% loads

40 20 0 20 40

bil hop lps nis onm smv che s13 spi sqr bh em3 per pow tre bis hea mst tsp vrn com ijp m88 cc2 go li2 prl vor IAben olden SPint SPfp nas perfe IAbench1 IAbench2 olden1 olden2 spec-int1 spec-int2

60

Vsc

Vstr

Asc

Astr

Cycle

Cycle-C

Tail

Middle

Figure 8: Consumer addresses belonging to the L set split according to the recurrence followed by their producer loads.

In this section we evaluate some important cases where the consumer address can be predicted without the above mentioned limitations. Figure 8 splits addresses belonging to the L set according to the advance ability. The meaning and the weight of each bar is explained by considering the following five cases: 1) If the value stream read by a producer follows an scalar or stride recurrence, the consumer address stream also follows the same kind of recurrence. These addresses exactly match the L∩S regularity set defined in Section V. They appear in the plot below the x-axis, being labelled as Vsc and Vstr according to their scalar or stride pattern. For the three Group A suites, the average Vsc and Vstr together, amounts between a 21% and a 34% of the L addresses. 2) If the producer address stream follows either a scalar or

Cons

for(i=0;i

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.