CONTENTS [PDF]

Mar 4, 2001 - (b) Using the old compiler, the CPI (clock cycle per instruction) of the .... Assume that, in MIPS process

29 downloads 19 Views 7MB Size

Recommend Stories


BRIEF CONTENTS [PDF]
HOWARD MOSS, Shall I Compare Thee to a Summer's Day? 526. WILLIAM SHAKESPEARE, My Mistress' Eyes Are Nothing Like the Sun. (Sonnet No. 130) 527.

BRIEF CONTENTS [PDF]
HOWARD MOSS, Shall I Compare Thee to a Summer's Day? 526. WILLIAM SHAKESPEARE, My Mistress' Eyes Are Nothing Like the Sun. (Sonnet No. 130) 527.

Contents - ANU Press [PDF]
others, like those of Rafael Lara Martinez and Silvia Lucinda .... formed part of the volume Cuentos de. Barro, Narrativa ...... García Márquez. Some viewers may have noticed the signs on the streets of Havana in David. Bradbury's recent film, Fond

Household contents claim form (pdf)
What we think, what we become. Buddha

ConTEnTs
Come let us be friends for once. Let us make life easy on us. Let us be loved ones and lovers. The earth

contents
Your big opportunity may be right where you are now. Napoleon Hill

contents
If your life's work can be accomplished in your lifetime, you're not thinking big enough. Wes Jacks

contents
Your task is not to seek for love, but merely to seek and find all the barriers within yourself that

Contents
The best time to plant a tree was 20 years ago. The second best time is now. Chinese Proverb

contents
I tried to make sense of the Four Books, until love arrived, and it all became a single syllable. Yunus

Idea Transcript


CONTENTS 101 年台大電機 ............................................................................................................ 3 100 年台大電機 ............................................................................................................ 9 99 年台大電機 ............................................................................................................ 14 98 年台大電機 ............................................................................................................ 18 97 年台大電機 ............................................................................................................ 23 101 年台大資工 .......................................................................................................... 27 100 年台大資工 .......................................................................................................... 33 99 年台大資工 ............................................................................................................ 37 98 年台大資工 ............................................................................................................ 40 97 年台大資工 ............................................................................................................ 44 101 年台聯大電機 ...................................................................................................... 49 100 年台聯大電機 ...................................................................................................... 57 99 年台聯大電機 ........................................................................................................ 66 98 年台聯大電機 ........................................................................................................ 74 97 年清大電機 ............................................................................................................ 80 97 年清大通訊 ............................................................................................................ 86 101 年清大資工 .......................................................................................................... 88 100 年清大資工 .......................................................................................................... 91 99 年清大資工 ............................................................................................................ 97 98 年清大資工 .......................................................................................................... 103 97 年清大資工 .......................................................................................................... 106 101 年交大資聯 .........................................................................................................111 100 年交大資聯 ........................................................................................................ 121 99 年交大資聯 .......................................................................................................... 132 98 年交大資聯 .......................................................................................................... 139 97 年交大資聯 .......................................................................................................... 147 97 年交大電子所 ...................................................................................................... 159 97 年交大電控 .......................................................................................................... 166 101 年成大資工 ........................................................................................................ 170 100 年成大資工 ........................................................................................................ 172 99 年成大資工 .......................................................................................................... 175 98 年成大資工 .......................................................................................................... 179 97 年成大資工 .......................................................................................................... 181 101 年成大電機 ........................................................................................................ 184 100 年成大電機 ........................................................................................................ 189 99 年成大電機 .......................................................................................................... 194 98 年成大電機 .......................................................................................................... 198 0

97 年成大電機 .......................................................................................................... 202 101 年成大電通 ........................................................................................................ 206 100 年成大電通 ........................................................................................................ 209 99 年成大電通 .......................................................................................................... 211 98 年成大電通 .......................................................................................................... 213 97 年中央電機 .......................................................................................................... 215 101 年中央資工 ........................................................................................................ 220 100 年中央資工 ........................................................................................................ 224 99 年中央資工 .......................................................................................................... 227 98 年中央資工 .......................................................................................................... 232 97 年中央資工 .......................................................................................................... 237 101 年中正資工 ........................................................................................................ 243 100 年中正資工 ........................................................................................................ 247 99 年中正資工 .......................................................................................................... 251 98 年中正資工 .......................................................................................................... 253 97 年中正資工 .......................................................................................................... 257 101 年中正電機 ........................................................................................................ 259 101 年中山電機 ........................................................................................................ 265 97 年中山電機 .......................................................................................................... 268 101 年中山資工 ........................................................................................................ 272 98 年中山資工 .......................................................................................................... 280 97 年中山資工 .......................................................................................................... 288 101 年中興電機 ........................................................................................................ 298 100 年中興電機 ........................................................................................................ 303 101 年中興資工 ........................................................................................................ 309 97 年政大資科 .......................................................................................................... 312 101 年暨南資工 ........................................................................................................ 315 99 年暨南資工 .......................................................................................................... 318 99 年暨南資工 .......................................................................................................... 322 98 年暨南資工 .......................................................................................................... 325 97 年暨南資工 .......................................................................................................... 327 100 年台師大資工 .................................................................................................... 331 99 年台師大資工 ...................................................................................................... 333 98 年台師大資工 ...................................................................................................... 337 97 年台師大資工 ...................................................................................................... 339 101 年彰師大電子 .................................................................................................... 343 100 年彰師大電子 .................................................................................................... 346 99 年彰師大電子 ...................................................................................................... 349 98 年彰師大電子 ...................................................................................................... 351 97 年彰師大電子 ...................................................................................................... 355 1

101 年彰師大資工 .................................................................................................... 360 100 年彰師大資工 .................................................................................................... 363 99 年彰師大資工 ...................................................................................................... 366 98 年彰師大資工 ...................................................................................................... 369 101 年台科大資工 .................................................................................................... 372 101 年台科大電子 .................................................................................................... 374 100 年台科大電子 .................................................................................................... 376 99 年台科大電子 ...................................................................................................... 378 98 年台科大電子 ...................................................................................................... 381 97 年台科大電子 ...................................................................................................... 384 101 年東華資工 ........................................................................................................ 388 101 年海洋資工 ........................................................................................................ 390 100 年海洋資工 ........................................................................................................ 393 97 年海洋資工 .......................................................................................................... 398

2

101 年台大電機 1.

Which of the following statements are true? (a) The input/output units are the devices that allow a computer system to communicate and interact with the outside world as well as store information. (b) A disk stores information in units called sectors. (c) An I/O controller handles the details of input/output and compensates for any speed differences between I/O devices and other parts of the computer. (d) A register is a storage cell that holds the operands of an arithmetic operation and that, when the operation is complete, holds its result. (e) The three components of the ALU- the registers, the interconnections between components, and the ALU circuitry- are together called the datapath.

Answer: (b), (c) 註(a):Input/output units are not intended to store information 註(d):An accumulator is a storage cell that holds the operands of an arithmetic operation and that, when the operation is complete, holds its result. 註(e):The three components of the ALU are together called the CPU. 2.

A processor has a clock rate of 500 MHz, and the following measurements have been made using a simulator. Instruction class CPI Frequency A 2 40% B 3 25% C 3 25% D 3 10% Later the compiler of the processor was improved to enhance the performance. The instruction improvements from this enhanced compiler have been estimated as follows. Instruction class Percentage of instructions executed vs. old compiler A 90% B 80% C 85% D 90% Which of the following statements are true? (a) Using the old compiler, the MIPS (million instructions per second) for the processor is 192.3. (b) Using the old compiler, the CPI (clock cycle per instruction) of the processor is 2.6. (c) Using the enhanced compiler, the CPI for the processor becomes 2.5826. (d) Using the enhanced compiler, the processor becomes 1.008 times faster than using the old compiler. (e) None of the above statements are true. 3

Answer: (a), (b), (c) 註(a): CPIold = 2.6, MIPS = 500 / 2.6 = 192.3 CPInew = (2×0.9×0.4 + 3×0.8×0.25 + 3×0.85×0.25 + 3×0.9×0.1) / (0.9×0.4 + 0.8×0.25 + 0.85×0.25 + 0.9×0.1) = 2.5826 Speedup = (2.6×IC) / (0.8625IC×2.58) = 1.168 3.

You are going to enhance a machine, and there are two possible improvements; either make multiply instructions run four times faster than before, or make memory access instructions run two times faster than before. You repeatedly run a program that takes 100 seconds to execute. Of this time, 20% is used for multiplication, 50% for memory access, and 30% for other tasks. Which of the following statements are true? (a) The speedup will be 1.176 if you improve only multiplication. (b) The speedup will be 1.333 if you improve only the memory access. (c) The speedup will be 1.667 if both improvements are made. (d) You then repeatedly run a second program that takes 50 seconds to execute. Of this time, 10% is used for multiplication, 15% for memory access, and 75% for other tasks; in this case, you will end up with the same speedup whether you improve only multiplication or only memory access. (e) None of the above statements are true.

Answer: (a), (b), (c) 註(a):Speedup =

1  1.176 0.2  0.8 4

1

 1.333 0.5  0.5 2 1  1.667 註(c):Speedup = 0.2 0.5   0.3 4 2 註(b):Speedup =

註(d):Speedupmultiplication = 1.098, Speedupmemory_access = 1.156

4.

Consider the following four architecture styles: Accumulator, Memory-Memory, Load-Store, and Stack. Suppose that we are asked to write a sequence of assembly code that implements the C code a = b + c – d. Assume: • The opcode is always 1 byte. • All memory addresses are 2 bytes. • All data operands are 2 bytes. • All instructions are in an integral number of bytes in length. • The variables a, b, c, and d are initially in memory. 4

• For Load-Store architecture, there are 16 general-purpose registers. Which of the following statements are True? (a) For Accumulator architecture, the length of our sequence of assembly code will be at least 20 bytes. (b) For Memory-Memory architecture, the length of our sequence of assembly code will be at least 26 bytes. (c) For Load-Store architecture, the length of our sequence of assembly code will be at least 30 bytes. (d) For Stack architecture, the length of our sequence of assembly code will be at least 22 bytes. (e) None of the above statements are true. Answer: (e) 註: ISA types

5.

Accumulator

Code Sequence

load add sub store

b c d a

Instruction bytes

3+3+3+3=12

Mem-Mem

add a, b, c sub a, a, d

7+7=14

Load-Store

Stack

load $t0, b load $t1, c load $t2, d add $t3, $t0, $t1 sub $t3, $t3, $t2 store $t3, a

push b push c add push d sub pop a

4+4+4+3+3+4=

3+3+1+3+1+

22

3=14

Which of the following statements are true? (a) For a 16-bit addition, if we use ripple adder and assume that the delay from ci to ci+1 of any full adder is 1 nsec, then the 16-bit addition takes 16 nsec. (b) If we use carry-lookahead adder, the 16-bit addition is divided into four 4-bit adders. In each 4-bit adder, there is a 4-bit carry-lookahead circuit. Assume that there is one gate delay for all gi and pi, and two gate delays for each carry c4, c8, c12, and c16, and finally three gate delays for s15. Then 9 gate delays are required generating c16 after X, Y, and c0 are applied as input. (c) Assume that each gate delay requires 0.5 nsec. This 16-bit carry-lookahead adder requires 4.5 nsec for a single addition operation. (d) A faster way to apply carry-lookahead circuit is to add a second-level lookahead circuit. With a second-level lookahead circuit, 5 gate delays is required for generating c16 after X, Y, and c0 are applied as input (e) Assume that each gate delay requires 0.5 nsec, This 16-bit carry-lookahead adder requires 2.5 nsec for a single addition operation. 5

Answer: (a), (b), (d) 註(b): This 16-bit CLA is formed, like ripple carry adder, by serial connecting the four 4-bit CLA. The delay of c16 is 3 + 2 + 2 + 2 = 9 gate dealy. 6.

Assume that, in MIPS processor, the times required to perform the operations of memory access, ALU operations, and register access are 150 picoseconds (ps), 80 ps, and 50 ps, respectively. And the functions of instruction execution can be divided into 5 units as instruction fetch (IF), decode and register read (ID), ALU operation (EX), data memory access (MEM), and register write back (WB). There are five instruction classes, which are R-type (ALU instructions), load, store, conditional branch, and unconditional jump, and the execution of each instruction class needs to perform part or the whole of the above functional units. Assume the following instruction mix: 30% ALU instruction, 20% load, 20% store, 20% conditional branch, and 10% unconditional jump. Which of the following statements are true? (a) Unconditional jump requires the least amount of time to execute because it only uses IF. (b) The average execution time per instruction is 480 ps for the single-cycle and fixed clock cycle length implementation approach. (c) The average execution time per instruction is 352 ps for the single-cycle but variable-length clock approach, i.e., the length of clock cycle is only as long as the instruction needs to be. (d) The average execution time per instruction is 585 ps for the multi-cycle (not pipeline), where each functional unit requires one clock cycle, and fixed clock cycle length approach. (e) None of the above statements are true.

Answer: (b), (c), (d) 註(a):In MIPS multiple cycle machine, jump and branch both require 3 clock cycles to execute. 註(b): Instruction Instruction Register ALU Data Register Total class memory read operation memory write R-type 150 50 80 0 50 330 ps Load word 150 50 80 150 50 480 ps Store word 150 50 80 150 430 ps Branch 150 50 80 280 ps Jump 150 150 ps 註(c): Average execution time per instruction = 0.3  330 + 0.2  480 + 0.2  430 + 0.2  280 + 0.1  150 = 352 ps 註(d): CPI = 0.3  4 + 0.2  5 + 0.2  4 + 0.2  3 + 0.1  3 = 3.9 Average execution time per instruction = 3.9  150 ps = 585 ps 7.

Assume that a pipelined processor executes instructions at the throughput of one instruction per clock cycle when no pipeline stalls occur. The pipelined processor executes the subsequent instructions of a jump or branch instructions as if the jump or branch instruction would not 6

change the control flow, until the program counter if overwritten by the jump or branch instruction. The pipelined processor carries out the following operations in the first 3 clock cycles of unconditional jump and conditional branch instruction execution. • First cycle: instruction fetch. • Second cycle: instruction decode, calculation of the target address, and writing to the program counter if the instruction is a jump. • Third cycle: determination of the branch action and writing to the program counter if the instruction is a branch. Now, an architect decides to break the instruction fetch into two pipeline cycles. Assume that this action reduces the clock cycle time from 10 ns to 8 ns. • 5% of the instruction executed are unconditional jumps; • 20% of the instruction executed are conditional branches; • 60% of the conditional branches executed turn out to be taken, i.e., the control flow branches to the target address of the branch. Which of the following statements are true? (a) The new design is 1.1 times faster than the old design. (b) The new design is 1.1 times slower than the old design. (c) The old design is 1.1 times faster man the new design. (d) The old design is 1.1 times slower than the new design. (e) None of the above statements are true. Answer: (a), (d) 註: CPIactual for old pipeline = 1 + 0.2 × 0.6 × 2 + 0.05 × 1 × 1 = 1.26 Instruction time for old pipeline = 1.26 × 10 ns = 12.6 ns CPIactual for new pipeline = 1 + 0.2 × 0.6 × 3 + 0.05 × 1 × 2 = 1.46 Instruction time for new pipeline = 1.426 × 8 ns = 11.408 ns Speedup = 12.6 / 11.41 = 1.1 8.

Which of the fallowing statements are true for the forwarding unit in a 5-stage pipelined processor? (a) The forwarding unit is used to detect the instruction cache stalling. (b) The forwarding unit is a sequential circuit which detects the true data dependency for EXE pipeline stage and selects the forwarded results for the execution unit (c) The forwarding unit is a pipeline register which detects the true data dependency for EXE pipeline stage and selects the forwarded results for the execution unit. (d) The forwarding unit compares the source register number of the instructions in the MEM and WB stages with the destination register number of the instruction in the decode stage. (e) None of the above statements are true.

Answer: (e) 7

9.

Which of the following statements are true? (a) A control hazard is the delay in determining the proper data to load in the MEM stage of a (b) (c) (d) (e)

pipelined processor. A load-use data hazard occurs because the pipeline flushes the instructions behind. To flush instructions in the pipeline means to load the pipeline with the requested instructions using the predicated program counter. A branch prediction buffer is a buffer that the compiler uses to predict a branch. None of the above statements are true.

Answer: (e) 10. Which of the following statements are true for an Intel x86-based PC? (a) DMA mechanism can be applied to delegate responsibility from the CPU. (b) USB 2.0 is a synchronous bus using handshaking protocol. (c) The CPU can fetch and translate IA-32 instructions. (d) The CPU can reduce instruction latency with deep pipelining, (e) None of the above statements are true. Answer: (a), (c) 註(b):False, (USB 2.0 is an asynchronous bus) 註(d):False, (pipeline cannot reduce single instruction‘s latency)

8

100 年台大電機 1.

A computer with a direct-mapped cache has an instruction miss rate of 2% and data miss rate of 1.7%. Suppose the computer has a clock cycle time of 20 ns, and the CPI without memory stall is 2 cycles per instruction. Suppose that the miss penalty is 200 ns, and there is, on average, one memory reference for data every two instructions. How much time, on average, does the computer need to execute a program with 10000 instructions? (a) 457s (b) 459s (c) 461s (d) 463s (e) None of the above are correct.

Answer: (a) CPIeffective = 2 + 1  0.02  (200 ns / 20 ns) + 0.5  0.017  (200 ns / 20 ns) = 2.285. Execution time = 10000  2.285  20 ns = 457000 ns = 457s 2. A program repeatedly performs a three-step process. It reads in a 4-KB block of data from disk, does some processing on the data, and then writes out the result as another 4-KB block elsewhere on the disk. Each block is contiguous and randomly located on a single track on the disk. The disk drive rotates at 7200 RPM, has an average seek time of 8 ms, and has a transfer rate of 20 MB/sec. The controller overhead is 2 ms. No other programs are using the disk processor, and there is no overlapping of disk operation with processing. The processing step takes 20 million clock cycles, and the clock rate is 400 MHz. What is the overall throughput of the system in blocks processed per second? (a) 11.69. (b) 12.69. (c) 13.69. (d) 14.69. (e) None of the above are correct. Answer: (b) (Seek time + Rotational delay + Data transfer time + Control time) × 2 + processing time  (8 ms + 0.5 / (7200 / 60) sec + 4K / 20M + 2 ms) × 2 + (20 × 106) / (400 × 106) = (8 + 4.17 + 0.2 + 2) × 2 + 50 = 78.37 ms 9

Block processed/second = 1/78.37 ms = 12.76 3. Assume that on a 2 GHz machine, it takes 1 clock cycle to execute an add, addi, or sub instruction and 2 clock cycles to execute a beq or j instruction. What is the MIPS achieved by the program below? add $t0, $zero, $zero loop: beq $a1, $zero, finish add $t0, $t0, $a0 sub $a1, $a1, 1 j loop finish: addi $t0, $t0, 100 add $v0, $t0, $zero (a) 1342 (b) 1344 (c) 1346 (d) 1348 (e) None of the above are correct. Answer: (e) 註: $a0, $a1 未給初始值,因此本題無法計算。若假設$a0 = 3, $a1 = 20 則 MIPS clock rate  = CPI  106

4

clock rate 2  109   1344 clock cycles 125 6 6  10  10 instruction count 84

Forwarding is a common technique to eliminate data hazards occurring among pipelining instructions. However, not all data hazards can be eliminated by forwarding. Suppose on a particular machine, if an instruction following a load instruction depends on the results of the load instruction, then a data hazard has occurred, and the pipeline is stalled for one cycle. Which of the following statements are not correct? (a) In addition to data hazards, there can be control hazards in this situation because we cannot determine the proper data to load in the MEM pipeline stage. (b) Assume the percentage of load instruction is 20% in a program, and half the time the instruction following a load needs the results of the load 10

instruction. Then the performance degradation due to the data hazard is 1.1. (c) Instruction scheduling can be used to eliminate this kind of data hazards. (d) Instruction scheduling does not have any run-time overhead, as it is done at compile time by the compiler; however, it usually increases compilation time because the compiler needs to perform some extra work. (e) All of the above are correct. Answer: (a), (b), (e) 註(b): CPIeffectuve = 1 + 0.2  0.5  1 = 1.1 The performance degradation = (1.1 – 1)/1 = 0.1 = 10% 5. Suppose there are two possible strategies for improving the performance of a hypothetical machine. We can either make multiplication instructions run four times faster, or we can make memory-access instructions run two times faster. Suppose the benchmark program we use spends 20% of its execution time in multiplication instructions. 50% in memory access, and the rest 30% in other instructions. Which of the following statements are not correct? (a) If we improve multiplication instructions alone, then the speed-up is 1.176. (b) It we improve memory-access instructions alone, then the speed-up is 1.333. (c) It we improve both multiplication and memory-access instructions, then the speed-up is 1.667. (d) The second strategy (making memory-access instructions two times faster) is always more effective as long as the original benchmark program spends 50% or more time in memory-access instructions. (e) All of the above are correct. Answer: (d), (e) 註(d): Suppose the original benchmark program spends 40% and 60% of the time in multiplication and memory-access instructions, respectively. 1

 1.428 0.4  0.6 4 1  1.428 The speedup to improve memory-access instructions = 0.6  0.4 2

The speedup to improve multiplication instructions =

11

Hence, making memory-access instructions two times faster as the original benchmark program spends 50% or more time is not always more effective. 6. Two compilers are being tested for a 2.5 GHz machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three cycles, respectively. Both compilers are used to produce code for a large piece of software. The first compiler's code uses 5 billion Class A instructions, 2 billion Class B instructions, and 1 billion Class C instructions. The second compiler's code uses 10 billion Class A instructions, 1 billion Class B instructions, and 1 billion Class C instructions. Which of the following statements are not correct? (a) Compiler 1's code is slower than compiler 2's in terms of MIPS. (b) Compiler 1's code is faster than compiler 2's in terms of execution time. (c) Compiler 1's code has a CPI of 1.5 cycles per instruction. (d) Compiler 2's code has a CPI of 1.25 cycles per instruction. (e) All of the above are correct. Answer: None Compiler 1 Execution Time

(5  1  2  2  1 3)  109 2.5  109

Compiler 2 = 4.8 s

(10 1  1 2  1 3) 109 2.5 109

=6s

MIPS

(5  2  1) 109 = 1667 4.8 106

(10  1  1) 109 = 2000 6 109

CPI

(5  1  2  2  1 3)  109  1.5 (5  2  1)  109

(10  1  1  2  1  3)  10 9  1.25 (10  1  1)  10 9

7. What is the value in $v0 after the execution of the program below?

loop:

finish:

add $a0, $zero, 3 add $a1, $zero, 20 add $t0, $zero, $zero beq $a1, $zoro, finish add $t0, $t0, $a0 sub $a1, $a1, 1 j loop addi $t0, $t0, 100 add $v0, $t0, $zero 12

(a) (b) (c) (d) (e)

157 160 163 166 None of the above are correct.

Answer: (b)

8. Which of the following statements about caches are not correct? (a) A one-way set associative cache performs the same as a direct-mapped cache. (b) Split cache applies parallel caches to improve cache speed. (c) Translation look-aside buffer (TLB) is a cache for page table and can help accessing virtual memory faster. (d) Write-through caches have better consistency between main memory and cache. (e) All of the above are correct. Answer: (b), (c), (e) 註(b):Split cache is applied to improve cache bandwidth 註(c):TLB and can help accessing physical (main) memory faster. 9. Which of the following statements about the IEEE single-precision floating point computer numbering format are not correct? (a) There are 1 bit in sign, 8 bits in exponent, and 24 bits in mantissa. (b) If sign = 0, exponent = 011111012, and significand = 010000000000000000000002, then the number represented is 0.3125. (c) The largest positive number that can be represented is 2(1 – 2-24)  2127. (d) The smallest positive number that can be represented is 1  2-126. (e) All of the above are correct. Answer: (a), (d), (e) 註(c):The largest positive number = (1 + 1 – 2-23)  2127 = 2(1 – 2-24)  2127

13

99 年台大電機 Multiple Choices (There might be one or more choices in each question) 1. Which of the following statements are true? (A) MTTF (Mean Time to Failure) is a commonly used reliability metric for hard drive (B) Rotation latency is the required time for the desired sector of a hard disk to rotate under the read/write head (C) Using RAID 0 allow higher performance than a single disk (D) Instead of storing a complete copy of the original data for each disk, RAID 3 only adds enough redundant information to restore the lost information on a failure (E) None of the above Answer: (A), (B), (C), (D) 2. Which of the following statements are true? (A) TCP/IP is a commonly used transport layer protocol (B) HTTP is a session layer protocol (C) OSI reference model has 5 layers (D) CSMA/CD is widely used in local area network (E) None of the above Answer: (D) 註(A):TCP is a transport layer protocol but IP is a network layer protocol 註(B):HTTP is an application protocol 註(C):OSI reference model has 7 layers 3. Which of the following statements are true? (A) Synchronous bus uses a handshaking protocol for coordinating usage rather than a clock (B) Backplane bus allows processors, memory, and I/O devices to coexist on a single bus (C) Firewire and USB 2.0 use asynchronous bus (D) By using DMA controller, bus can be released during a bus transaction while the requester is waiting for the data to be transmitted, which frees the bus for access by another requester (E) None of the above Answer: (B), (C) 註(D): By using split transaction protocol, bus can be released during a bus 14

transaction while the requester is waiting for the data to be transmitted, which frees the bus for access by another requester. 4. Which of the following statements are true? (A) Quantum computer can solve all NP-complete problems in polynomial time (B) Currently, there is no quantum computer hardware available to do any quantum computing task (C) Quantum computer could efficiently decipher security codes that are generated by classical integer factorization (D) Quantum computer could apply principles of quantum mechanics to achieve higher CPU clock rate (E) None of the above Answer: (B), (C) 註(A):The class of problems that can be efficiently solved by quantum computers is called BQP. BQP is suspected to be disjoint from NP-complete and a strict superset of P, but that is not known. There is a common misconception that quantum computers can solve NP-complete problems in polynomial time. That is not known to be true, and is generally suspected to be false. 5. Assume a processor has a clock rate of 500 MHz and an ideal CPI (no memory misses) of 1.0. What is the effective CPI if a program with a mix of 50% arithmetic and logic, 30% load/stores and 20% control instructions is run, if 10% of the data memory operations and 1% of the instructions have a miss penalty of 50 cycles. (A) 1 (B) 2 (C) 3 (D) 4 (E) None of the above Answer: (C) 註:Effective CPI = 1 + 1  0.01  50 + 0.3  0.1  50 = 3

15

6. Suppose we have a deeply pipelined processor, for which we implement a branch-target buffer for the conditional branches, which are 15% of the instructions. Assume that the mis-prediction penalty is always 3 cycles and the buffer miss penalty is always 6 cycles. Assume 90% hit rate in the buffer and 75% accuracy of the buffer prediction. Assume a base CPI without branch stalls of 1. What is the CPI? (A) 1 (B) 1.09 (C) 1.19 (D) 1.675 (E) None of the above Answer: (C) 註:Effective CPI = 1 + 0.15  (1 – 0.75)  3 + 0.15  0.75  (1 – 0.9)  6 = 1.18 7. A memory system has 1M words. Each word is an 8-bit byte. The memory is organized into blocks of 8 words each. The cache has 256K words, organized into cache lines of 8 words each. The memory cache is organized into 4-way set associative cache. (A) 13 bits are needed for cache set address. (B) 4 bits are for Tag (C) 21 bits are needed to address all words (D) 17 bits are needed to address all blocks (E) None of the above Answer: (A), (B), (D) 20 bits

註:address format:

4

13

3

Tag

Index

Block offset

8. A hard disk has a track seek time of 10ms. The disk rotation speed is 9000 rpm. Each track on the disk has 600 sectors. Each sector has total 512 bytes data. What is the average time it takes in read 1024 bytes data? (A) 10.5 ms (B) 13.355 ms (C) 14.55 ms (D) 15.333 ms (E) None of the above 16

Answer: (B) 1

註:Seek time = 10 ms  0.5  9000  9000 60

60

1024  600  512

= 13.355 ms

9. Which of the following statements are true? (A) In floating point number, the binary point is not fixed (B) Units in the last place (ulp) is the number of bits in error in the most significant bits of the significand between the actual number and the number that can be represented. (C) When a negative exponent of a floating point number becomes too large to fit in the exponent field, underflow occurs. (D) In floating-point addition, before addition of the significands, the decimal point of the number that has the smaller exponent should be aligned. (E) None of tire above Answer: (A), (C), (D) 註(B):Units in the last place (ulp) is the number of bits in error in the least significant bits of the significand between the actual number and the number that can be represented. 10. Which of the following statements are true? (A) If the data requested by a load instruction has not yet become available when it is requested, the load-use data hazard occurs. (B) Data hazard may be resolved with bypassing technique which retrieve the hitting data element from internal buffers rather than waiting for it to arrive from external memory. (C) If the proper instruction cannot execute in the proper clock cycle because the instruction that was fetched is not the one that is needed, the data hazard occurs. (D) Delayed branch is used to solve control hazard. The delayed branch always executes the next sequential instruction, with the branch taking place after that one instruction delay (E) None of the above Answer: (D) 註(B):Data hazard may be resolved with bypassing technique which retrieve the missing data element from internal buffers rather than waiting for it to arrive from external memory. 註(C):Control hazard

17

98 年台大電機 1. Let's compile the following C code on a 64-bit machine. #include int main( ) { int *p = (int*)malloc(sizeof(int)); int *q = (int*)malloc(sizeof(int)); printf("%x\n", &p); printf("%x\n", &q); } Suppose the output of the line "printf ("%x\n", &p)" is "bffff4f8". What will be the output of the following line "printf ("%x\n", &q)"? (A) bffff4f0 (B) bffff4f4 (C) bffff4f7 (D) bffff4fc (E) bffff500 Answer: (E) bffff4f8 + 8 = bffff500 註: A 64-bit computer architecture generally has integer and addressing register that are 64 bits wide. 2. The IEEE 754 defines the following floating number format: 1

8

23

sign

exponent

fraction

What is the floating number for the binary representation ―1100 0001 0101 0100 0000 0000 0000 0000‖ (A) -0.38095 (B) -0.7619 (C) -6.625 (D) -13.25 (E) -26.5 (Note: the exponent bias for the single-precision floating number is 127). Answer: (D) 1100 0001 0101 0100 0000 0000 0000 0000  -1.10101223 = -1101.012 = -13.2510

18

3. Compiling a same C program on two different computers, A and B, as follows, which of the following is correct? Instruction count Clock rate CPI

A 10 billion 4 GHz 1.0

B unknown 5 GHz 1.2

(A) (B) (C) (D)

The instruction count on B must be greater than 9.6 billion. The program will spend less CPU time on B. Computer B must have higher MIPS rating. If the program runs faster (i.e. takes shorter CPU time) on A, the instruction count on B must be greater than 10 billion. (E) Computer A must consume less power than B

Answer: (C) 註(C):MIPSA = 4G/1M = 4000 < MIPSB = 5G/1.2M = 4166 註(D):CPU_TimeA =

10  109  1 ICB  1.2 = 2.5 < CPU_TimeB = 9 4  10 5  109

 ICB > 10.42 billion 註(E):功率的消耗正比於clock rate和電壓源平方的乘積,雖然B的clock rat較高 但在電壓源未知的情況下,並無法確定A或B誰所消耗的功率高或低。 (The following tables are for problems 4-7) Assume the operation times for the major functional units in a CPU are as follows:

time

Memory units 200 ps

ALU and adders 100 ps

Register read/write 50 ps

others 0 ps

Assume the instructions are divided into 5 classes and their inclusions of the functional units during executions are: Instruction class I II III IV V

Functional units used by the instruction class during an instruction execution Instruction Register Register ALU memory access access Instruction Register Memory Register ALU memory access access access Instruction Register Memory ALU memory access access Instruction Register ALU memory access Instruction memory 19

4. If the above CPU is implemented in a single-cycle per instruction mechanism, what is the maximum clock rate for this CPU? (A) 1.67MHz (B) 1.67 GHz (C) 200MHz (D) 200 GHz (E) 5.0 GHz Answer: (B) Clock cycle time = 200 + 50 + 100 + 200 + 50 = 600 ps Clock rate = 1/600ps = 1.67 GHz 5. If the above CPU is implemented in a multi-cycle per instruction mechanism, as shown in the second table, that is, instructions I, II, III, IV, and V take 4, 5, 4, 3, and 1 cycles, respectively, what is the maximum clock rate for this CPU? (A) 1.67MHz (B) 1.67 GHz (C) 200MHz (D) 200 GHz (E) 5.0 GHz Answer: (E) Clock cycle time = 200 ps  Clock rate = 1/200ps = 5.0 GHz 6. Suppose a program has the following instruction mix: I (25%), II (10%), III (45%), IV (15%), and V (5%), and suppose all the instructions are executed sequentially (i.e. non-pipeline), what is the ratio of the runtimes by the CPUs in Problems 4 over 5, assuming both are running on the maximum clock rates? (A) 0.33 (B) 0.79 (C) 1.27 (D) 1.32 (E) 3.0 Answer: (B) Instruction Time for single cycle machine = 600 ps Instruction Time for multi-cycle machine = (4  0.25 + 5  0.1 + 4  0.45 + 3  0.15 + 1  0.05)  200 ps = 760 ps Ratio of the runtimes = 600 / 760 = 0.79

20

7. Assume the average CPIs for the above instructions under a 5-stage pipeline design are: 1.5, 1, 1, 1.25, and 2, respectively, what is the speed-up of this CPU over the single-cycle implementation in Problem 4, assuming both are running on the maximum clock rates? (A) 1.21 (B) 2.0 (C) 2.47 (D) 3.37 (E) 5.0 Answer: (C) Instruction Time for pipeline = (1.5  0.25 + 1  0.1 + 1  0.45 + 1.25  0.15 + 2  0.05)  200 ps = 242.5 ps Speedup = 600/242.5 = 2.47 (The following information about the cache of a CPU is for Problems 8-9) * The hit rates for instruction and data caches are 95% and 90%. * The CPI for this processor is 2 when cache hits. * The penalty for all cache misses is 50 cycles. * The memory load and store instructions account for the 40% of the total instructions. 8. What is the CPI when considering the cache misses? (A) 2.5 (B) 4.5 (C) 6.5 (D) 8.5 (E) 10.5 Answer: (C) 註:CPI = 2 + 1  (1 – 0.95)  50 + 0.4  (1 – 0.90)  50 = 6.5

21

9. If the CPU clock rate is quadrupled (i.e. 4X), while the memory speed remains unchanged, what will be relative performance ratio to the CPU in Problem 8 when considering cache misses? (Please note that since the CPU architecture and instruction set is unchanged, the CPI for cache hit is unchanged). (A) 1.3 (B) 1.75 (C) 2.6 (D) 4.0 (E) 6.5 Answer: (A) CPI4 = 2 + 1  (1 – 0.95)  200 + 1  (1 – 0.90)  0.4  200 = 20 6.5  T Performance ratio = = 1.3 20  0.25T 10. Which of the following is NOT correct for pipeline hazards? (A) Generally speaking, a pipeline hazard happens when the next instruction cannot execute in the following clock cycle. (B) An example of structural hazards can be the condition when two instructions are accessing the same memory in the same cycle. (C) Data hazards arise from the dependence of one instruction on an earlier one that is still in the pipeline. (D) Control hazards arise from the need to make a decision based on the results of one instruction (e.g. branch instruction) while others are executing. (E) "Forwarding" is a common technique to relieve the control hazards. Answer: (E)

22

97 年台大電機 1. Symbol table matches names of labels to the addresses of the memory words that instructions occupy. Assemblers keep track of labels used in branches and data transfer instructions in the symbol table. (A) True (B) False Answer: (A) 2

Java programming language uses a software interpreter called Java Virtual Machine (JVM) to translate the interpreted code segments into native code of the computer at runtime. (A) True (B) False

Answer: (B) 註:JVM can execute the interpreted code (Java byte code) directly. 3. The accuracy in floating point is normally measured with ULP (units in the last place). ULP (units in the last place) is defined as the number of bits in error in the LSBs of the significand between the actual number and the number that can be prepresented. (A) True (B) False Answer: (A) 4. In low power CPU design, the usage of low clock rates allows a processor to sleep and conserve power since CMOS technology does not consume power when it is idle. (A) True (B) False Answer: (A) 5. If a branch is taken, the branch target address specified in a branch, which becomes the new PC. For example, the branch target is given by the sum of the offset field of the instruction and the address of the instruction following the branch. (A) True (B) False Answer: (A) 23

6. Micro-instruction is an advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle. (A) True (B) False Answer: (B) 註:Micro-instruction is a representation of control using low-level instructions, each of which asserts a set of control signal. 7. Virtually addressed cache is a cache that keeps track of recently used address mappings to avoid an access to the page table. (A) True (B) False Answer: (B) 註:TLB is a cache that keeps track of recently used address mappings to avoid an access to the page table 8. Non-blocking cache is used to hide the cache miss latency by using out-of-order processors. Hit under miss allows additional cache hits during a miss. Miss under miss allows multiple outstanding cache misses. (A) True (B) False Answer: (A) 9. The alternative of dedicated I/O instructions in the processor is to use memory-mapped I/O. The memory-mapped I/O assigns a memory address for a command to an I/O device and records the device number. (A) True (B) False Answer: (B) 註:The memory-mapped I/O is an I/O shceme in which portion of address space are assigned to I/O devices and reads and writes to the address are intepreted as commands to the I/O device. 10. DMA (direct memory access) mechanism uses the DMA controller as bus master to directly read or write memory without using processor. As a result, the DMA mechanism could reduce the occupancy of the processor cycles. (A) True (B) False Answer: (A) 24

11. RAID 1 has the highest check disk overhead among RAID levels 1, 3, and 5. On the other hand, RAID 3 and 5 have the same throughput for large writes. (A) True (B) False Answer: (A) 12. Split transaction protocol is used for bus design. The protocol releases during a bus transaction while the receiver is waiting for the data to be transmitted, which frees the bus for access by another receiver. (A) True (B) False Answer: (B) 註:The protocol releases during a bus transaction while the requester is waiting for the data to be transmitted, which frees the bus for access by another requester. 13. What is the functionality of the following design? What is the benefit of the design? ●







Answer: 25



(a) This design is a multiplier array of adders (Fast multiplier) (b) 1. faster than sequential multiplier 2. easy to optimizations to gain further improvements 3. easy to pipeline

26

101 年台大資工 1.

[Overview] Please explain the following terms in English. For each term, please spend no more than 100 English words. (a) basic block (b) non-uniform memory access architecture (c) GPGPU (d) cloud computing (e) virtual machine monitor

Answer: (a) Basic block: A sequence of instructions without branches (except possibly at the end) and without branch targets or branch labels (except possibly at the beginning). (b) Non-uniform memory access architecture: A type of single address space multiprocessor in which some memory accesses are faster than others depending on which processor asks for which word. (c) GPGPU: Using a GPU for general-purpose computation via a traditional graphics API and graphics pipeline. (d) Cloud computing: Cloud computing refers to the delivery of computing and storage capacity as a service to a heterogeneous community of end-recipients. (e) Virtual machine monitor: The software that supports VMs is called Virtual machine monitor. It presents a software interface to guest software and must isolate the state of guests from each other and protect itself from guest software. 2.

[CPU Performance] Calculate the performance of a processor taking into account stalls due to data cache and instruction cache misses. The data cache has a 92% hit rate and a 2-cycle hit latency. Assume that latency to memory and the cache miss penalty together is 100 cycles. The instruction cache has a hit rate of 90% with a miss penalty of 50 cycles. Assume the load never stalls a dependent instruction and assume the processor must wait for stores to finish when they miss the cache. Finally, assume that instruction cache misses and data cache misses never occur at the same time. (a) Calculate the average memory access latency for the data cache. (b) Assume the base CPI using a perfect memory system is 1.0. Calculate the additional CPI of the pipeline due to the Instruction cache stalls. (c) Same as (b), but calculate the additional CPI due to the data cache stalls. (d) Assume 30% of instructions are loads and stores. Calculate the overall CPI for the machine.

Answer: (a) AMAT = 2 + 0.08 × 100 = 10 (b) The additional CPI due to instruction cache stalls = 1  0.1  50 = 5 27

(c) Suppose 30% of instructions are loads and stores. The additional CPI due to data cache stalls = 0.3  0.08  100 = 2.4 (d) Overall CPI = 1 + 0.3  0.08  100 + 1  0.1  50 = 8.4 3.

[Large Systems] The figure below shows a programmer's view of storage hierarchy of a typical datacenter. A server consists of a number of processor sockets, each with a multicore CPU and its internal cache hierarchy, local shared and coherent DRAM, and a number of directly attached disk drives. The DRAM and disk resources within the rack are accessible through the first-level rack switches, and all resources in all racks are accessible via the cluster-level switch.

(a) How would you write a program to utilize the resources in the entire system? (b) For a server, the latency required to access a data item in the datacenter depends on the location of the data. Assume the latency for a network packet to pass through one switch is 100s. (1s = 10-6 s). Complete the chart below to characterize the relationship between the latency and the data location. (c) Assume each group of servers Is connected through a 1-Gbps link to a rack-level switch that has an additional eight 1-Gbps ports used for connecting the rack to the cluster-level switch. Complete the figure below to quantify the bandwidth of the datacenter system. Bandwidth (Mbyte/s) (d) Energy consumption is important to the operation cost of a datacenter. The figure below shows the power usage of the main subsystems for a recent Google server as the compute load varies from idle to full activity levels. Dynamic Voltage and Frequency Scaling (DVFS) is one way to save energy use on a server. How much saving can DVFS achieve? In addition to DVFS, how would you further reduce the energy cost? Answer: 28

(a) From a programmer‘s standpoint, hardware clusters have a deep and complex memory/storage hierarchy, heterogeneous components, and scarcity of some resources. The size of problems being solved obviously always complicates matters further. Some types of operations or subsets of problems are common enough in large-scale services that it pays off to build targeted programming frameworks that can drastically simplify the development of new products. (b) and (c)

(d) Voltage and frequency scaling may refer to:  Dynamic voltage scaling, a power management technique in computer architecture, where the voltage used in a component is increased or decreased, depending upon circumstances  Dynamic frequency scaling, a technique in computer architecture whereby the frequency of a microprocessor can be automatically adjusted "on the fly," either to conserve power or to reduce the amount of heat generated by the chip Voltage and frequency scaling are often used together to save power in mobile devices including cell phones. When used in this way it is commonly known as DVFS, or Dynamic Voltage and Frequency Scaling. In addition to DVFS, the PMU can be used further to reduce the energy cost. The PMU controls almost every power-consuming function in a computer. It is constantly running diagnostics on the various power-related operations and checking them against the current Energy-Saver settings, allowing the PMU to actively manage power consumption for optimum user performance. 29

4.

[Multicore Systems-on-Chips] Please read the following story from NVIDlA's whitepaper before answering the questions. In February 2011, NVIDIA introduced and demonstrated its Project Kal-El mobile processor, the world's first quad core mobile processor. Project Kal-El will enable new mobile applications, new experiences, more robust multitasking, higher-quality gaming, and faster web browsing. In addition, Project Kal-El will further extend battery life by operating its CPU cores at lower frequency, yet still be able to complete more work than dual core or single core processors. NVIDIA's Project Kal-El processor implements a so called Variable Symmetric Multiprocessing (vSMP) technology, which includes a fifth CPU core (the "Companion" core) built using a special low power silicon process that executes tasks at low frequency for active standby mode, music playback, and even video playback, as shown in the figure below. The four main "quad" cores are built using a standard silicon process to reach higher frequencies, while consuming lower power than dual core solutions for many tasks. All five CPU cores are identical ARM Cortex A9 CPUs, and are individually enabled and disabled (via aggressive power gating) based on the work load. The ―Companion‖ core is OS transparent, unlike current Asynchronous SMP architectures, meaning the OS and applications are not aware of this core, but automatically take advantage of it. This strategy saves significant software efforts and new coding requirements. Power consumption of a silicon device is equal to the sum of leakage power and dynamic power. Leakage power is mainly determined by the silicon process technology, and dynamic power is determined by silicon process technology and by operating voltages and frequencies. The main cores build on a fast silicon Process, so they consume high leakage power under idle or active standby conditions, but ore capable of running at higher frequency ranges without requiring significant Increases in operating voltage. The Companion core on a low power process technology has low leakage power but has slower switching times at normal voltage levels, and to enable It to switch faster (for high frequency operations) It needs higher than normal voltage ranges. (a) Draw the power-frequency curves of the Kal-El processor in the following 5 settings: (1) the Companion core is ON, and the main cores are OFF; (2) the Companion core is OFF, and one of the main cores is ON; (3) the Companion core is OFF, and 2 of the main cores are ON; (4) the Companion core is OFF, and 3 of the main cores are ON; (5) the Companion core is OFF, and 4 of the main cores is ON;

30

Power consumption

Frequency (b) With the power-frequency curves, how would you maximize power-performance ratio for the system when there is only ONE single-thread application running on the system, but the workload increases? (c) How would you maximize power-performance ratio for the system when the number of tasks running on the system increases from 1 to 8? (d) Draw the Ideal power-performance curve for the system when your strategy in (p) works perfectly. Please briefly explain your curve.

Power consumption

Performance (Throughput) (e) The Kal-El processor is a system-on-chip (SoC) which also include intellectual properties (IP) other than CPU cores. Why are these IP's included in the SoC? Why not have more or bigger general-purpose cores, similar to x86-based desktop or server processors? Answer: (a)

31

Power consumption

(1) (5) (4) (3) (2)

Frequency

(b) Excess power consumption is inefficient; it will consume unnecessary energy and generating waste heat, which can in turn decrease reliability. If only one single-thread application running on the system, but the workload increases. Turn on the companion core and one of the main cores will appear better power-performance ratio. (c) Turn on all the main cores as well as the companion core. (d)

(e) IP core has less power consumption and has less functionality than a CPU cores. However, using multiple IP cores can even gain more performance efficiency than a single power CPU like x86 but still consume less power. That is, multiple IP cores can has higher power-performance than a single power CPU.

32

100 年台大資工 1.

You are given a bus-based multiprocessor system which implements a snooping-based cache coherency protocol. There are two processors P1, P2 in the system. Each processor has its own L1 cache. The memory system specification is given below: •

L1 Cache: physical addressed, 64KB, direct-mapped, 32B block size, write-back cache



Virtual Memory: 8KB page, 1 GB virtual address space



TLB: fully associative, 128 entries

• 64 MB of addressable physical memory. x1, x2, and x3 represent one-word variables. The physical addresses (represented in hex) of x1, x2 and x3 are 0x00AF0804, 0x00AF0808, and 0x00AE0804, respectively. Answer the following questions considering the possible events: TLB-hit/miss, page-table hit/miss, L1-hit/miss, write-back, invalidate. (a) What are the sizes of the tag and data arrays of the TLB? (b) Please write down the correct order of the events listed above for detecting a page fault. Note that some events may not occur. (c) Assume TLB hits for all the memory requests. x1 and x2 are in the L1 caches of P1 and P2. Show the bus activities for each memory request. Time 1 2 3 4 5

P1 Write x1

P2

Bus activity

Read x2 Write x1 Read x3 Read x2

Answer (a) The size of a Tag = 30 – 13 = 17 bits The size of a physical page number = 26 – 13 = 13 bits The size of tag array = 17  128 = 2176 bits The size of data (physical page number) array = 13  128 = 1664 bits (b) TLB miss, page table miss (c) Variable x1 and x2 is contained in the same block since block size is 32B. 33

Write-invalid snooping protocol: Time P1 1 Write x1 2 3 Write x1 4 5 Read x2

P2 Read x2 Read x3

Bus activity Yes Yes Yes Yes No

註(c):若 snooping protocol 改為 write-update 則答案為: Time

P1

1

Write x1

P2

Yes

2 3

Read x2 Write x1

No Yes

4 5

Bus activity

Read x3 Read x2

Yes No

2. Multiple choice Which of the following statements about pipelining are correct? (a) Two instructions with data dependency will cause data hazards. (b) Assume a 5-stage pipeline of MIPS architecture. All the data hazards in the following code sequence can be avoided by data forwarding. add $4, $2, $3 // $4 is the destination register add $5, $4, $2 lw $6, 8($5) add $8, $6, $4 (c) The best speedup from pipelining is equal to the number of pipeline stages. (d) Pipelining helps both instruction throughput and latency. (e) Once an exception occurs, all instructions in pipeline need to be flushed. (f) Control hazard can be resolved with dynamic branch prediction. Answer: (c), (f) 3. The following figure shows a typical I/O system. The storage contains HDDs with a read/write bandwidth of 128 MB/sec and the average seek and rotational latency of 2ms. The bandwidth of the front side bus, memory channels, and 34

PCIe are denoted in the figure. Additional system configurations are listed below: • Each core sustains 3 billion instructions per second • The user program uses 200,000 instructions per I/O operation • The operating system uses 100,000 instructions per I/O operation. Note that all the numbers in the question are represented in base 10 (i.e., 1K = 1000). Please answer the following question assuming the workload of 64 KB random reads. PCIe bandwidth = 2 GB/sec

Core 1 Core 2 Core 3 Core 4

Disk Controller Front side bus Bandwidth = 10GB/sec

North Bridge

HDD storage

Channel Bandwidth = 5GB/sec

CPU

Memory DIMMs

(a) What is the maximum sustained I/O rate that 4 cores can deliver? (b) What is the maximum sustained I/O rate of 8 HDDs can deliver ignoring disk conflicts and assuming the disk controller is not the bottleneck? (c) What is the maximum sustained I/O rate that the described I/O system can deliver with 8 HDDs? Please explain your answer. (d) How many HDDs can the described I/O system accommodate? Answer

3  10 9  4  40000 I/Os per second (a) 200000  100000 1  8  400  8  3200 I/Os per second 64 KB 2 ms  128 MB (c) I/O rate for Front side bus = 10GB/64KB = 156250 I/O rate for channel bus = 5GB/64KB = 78125 I/O rate for PCe bus = 2GB/64KB = 31250 8 HDDs is bottleneck; hence, the maximum system I/O rate is 3200

(b)

35

 31250   78 HDDs  400  

(d) The I/O system can accommodate 

4. You are considering adding a register-memory addressing mode to the MIPS instruction set. With this new addressing mode, there would be 5% increase in cycle time and the CPI of the ALU instruction type (including both the register-register and register-memory types) becomes 2. (a) What is the average CPI of the old implementation? (b) What type of instructions could be reduced? (c) If half of the instructions that you answer in question (b) can be reduced with the new addressing mode, what is the average CPI of the new implementation? Will you decide to add the new addressing mode? Please explain your answer. OP ALU Load Store Branch

Freq 40% 20% 20% 20%

CPI 1 3 3 2

Answer (a) CPIold = 1  0.4 + 3  0.2 + 3  0.2 + 2  0.2 = 2 (b) Load (c) Suppose the instruction count is IC and the cycle time for the old implementation is T, then the execution time for the old implementation is IC  2  T CPInew = (2  0.4 + 3  0.1 + 3  0.2 + 2  0.2) / 0.9 = 2.33 The execution time for the new implementation is 0.9  IC  2.33  1.05T = IC  2.2  T So, to add the new addressing mode is not a good choice.

36

99 年台大資工 1. Consider a computer architecture with an instruction set which is composed of N different instruction classes. The execution time of class-i instruction is 4 + i ns, where 1  i  N. (a) If the computer is implemented with a single-cycle design, what is the minimum clock cycle time? Why? (b) Consider a multicycle control implementation with a clock cycle of 1 ns. Ignore the overhead associated with multicycle control. What is the performance advantage (speedup) of multicycle control relative to single-cycle control, assuming that the various instruction classes are used with the same frequency? (c) With the performance benefit function derived in part (b), would a speedup factor of 5 be possible? What would be the requirement to achieve that speedup? (d) Repeat part (b) for the case when the relative frequency of class-i instruction is proportion to i. (e) Repeat part (b) for the case when the clock cycle time is 2 ns and the class-i instruction requires (4 + i)/2 clock cycles. Assume N = 5 in this case. Answer: (a) The minimum clock cycle time is (4 + N) ns since in a single-cycle design, clock cycle time is determined by the longest execution time instruction. (b) Instruction time for single cycle machine = (4 + N) ns Instruction time for multicycle machine = N

1 ns 

 4  i  i 1

N

 (4.5 + 0.5N) ns, Speedup =

4 N 4.5  0.5 N

4 N (c) = 5  N = -12.33. 4.5  0.5 N A speedup of 5 is impossible unless pipeline technique is used. (d) Instruction time for multicycle machine = N

1 ns 

 4  i  i i 1

N N  1 2

 (4 + 2 N  1 ) ns, Speedup = 4  N  12  3N 2 N  1 13  2 N 3 4 3

(e) Instruction time for single cycle machine = 4 + 5 = 9 ns

37

Instruction time for multicycle machine = 2 ns 

5

4  i 

i 1

2



5

 7 ns

Speedup = 9/7 = 1.2857 2. The CPU in a computer system uses two levels of caches L1 and L2. Level L1 is accessed in one clock cycle and supplies the data in case of an L1 hit. For an L1 miss, occurring 4% of the time, L2 is consulted. An L2 hit incurs a penalty of 10 clock cycles while an L2 miss implies a 100-cycle penalty. (a) Assuming pipelined implementation with a CPI (cycle per instruction) of 1 when there is no cache miss, what is the effective CPI if L2's local miss rate is 25%? (b) If we were to model the two-level cache system as a single cache, what miss rate and miss penalty should be used? (c) Changing the mapping scheme of L2 from direct to two-way associative can improve its local miss rate to 20% while its hit penalty is increased to 12 clock cycles due to the more complex access scheme. What is the effective CPI after this change? Is this change a good idea? Answer: (a) The effective CPI = 1 + 0.04  10 + 0.04  0.25  100 = 2.4 (b) Miss penalty = 100 cycles; Miss rate = 0.04  0.25 = 0.01 (c) 0.25  (1 – 0.2) = 0.2 The effective CPI after change = 1 + 0.04  12 + 0.04  0.2  100 = 2.28 The change is a good idea. 3. Consider a bus-based shared memory system consisting of three processors. The shared memory is divided into four blocks, a, b, c, d. Each processor has a cache that can fit only one block at any given time. Each block can be in one of two states: valid (V) or invalid (I). Assume that caches are initially flushed (empty) and that the contents of the memory are as follows: Memory block a b c d

Contents 10 20 40 80

Consider the following sequence of events given in order: (1) P1: read(a) (2) P2: read(a) (3) P3: read(a) (4) P1: a = a + 20 (5) P1: read(c) 38

(6) (7) (8) (a)

P2: read(a) P3: a = 15 P1: c = c+10 There are two fundamental cache coherence policies: write-invalidate and write-update. Write-invalidate maintains consistency by reading from local caches until a write occurs. When any processor updates the value of X through a write, it invalidates all other copies by setting the state of the block to invalid (I). Suppose write-back and write-invalidate protocols are used in the system, what would be the contents of the caches and memory and the state of cache blocks after the above sequence? (b) On the other hand, write-update maintains consistency by immediately updating all copies in all caches. When a block is updated, the state of the block is set to valid (V). Suppose write-through and write-update protocols are used in the system, what would be the contents of the caches and memory and the state of cache blocks after the above sequence? Answer: (a) Cache 1 Cache 2 Cache 3 state value state value state value Initial I I I (1) P1: read(a) V 10 I I (2) P2: read(a) V 10 V 10 I (3) P3: read(a) V 10 V 10 V 10 (4) P1: a = a + 20 V 30 I I (5) P1: read(c) V 40 I I (6) P2: read(a) V 40 V 30 I (7) P3: a = 15 V 40 I V 15 (8) P1: c = c + 10 V 50 I V 15 (b) Cache 1 Cache 2 Cache 3 Even state value state value state value Initial I I I (1) P1: read(a) V 10 I I (2) P2: read(a) V 10 V 10 I (3) P3: read(a) V 10 V 10 V 10 (4) P1: a = a + 20 V 30 V 30 V 30 (5) P1: read(c) V 40 V 30 V 30 (6) P2: read(a) V 40 V 30 V 30 (7) P3: a = 15 V 40 V 15 V 15 (8) P1: c = c + 10 V 50 V 15 V 15 Even

39

a 10 10 10 10 10 30 30 30 30

Memory block b c 20 40 20 40 20 40 20 40 20 40 20 40 20 40 20 40 20 40

d 80 80 80 80 80 80 80 80 80

a 10 10 10 10 30 30 30 15 15

Memory block b c 20 40 20 40 20 40 20 40 20 40 20 40 20 40 20 40 20 50

d 80 80 80 80 80 80 80 80 80

98 年台大資工 1. Based on their applications, computers can be divided into three categories; desktop, server, and embedded. Let us focus on embedded applications here, and please answer the following questions; (a) How would you define embedded applications? (b) List three examples and explain why they are embedded applications. (c) Which of the following techniques can be used to reduce the footprint (i.e. the memory usage) of an embedded application? A. code compression B. loop unrolling C. dynamic loaded library D. software pipelining (d) For the following program, which techniques could be used to improve the performance? for(int = 0; i < 128; ++i){ a[i] = b[i] + c[i] } A. SIMD (single-instruction multiple-data) operations B. loop unrolling C. memory access prefetch D. larger memory capacity (e) What are the digital signal processors? How would an embedded application take advantage of a digital signal processor? (f) Multi-core embedded systems have become popular. Are they mostly homogeneous or heterogeneous? Why? Answer: (a) An application that permanently resides in an industrial or consumer device to perform one or a few dedicated functions. Contrast with a general-purpose computer that can be used to run all kinds of applications. (b) Embedded systems span all aspects of modern life and there are many examples of their use. Consumer electronics include PDAs, mp3 players, mobile phones, videogame consoles, digital cameras, DVD players, and GPS receivers. Many household appliances, such as microwave ovens, washing machines and dishwashers, are including embedded systems to provide flexibility, efficiency and features. (c) A, C (d) A, C (e) A digital signal processor (DSP) is a specialized microprocessor designed specifically for digital signal processing, generally in real-time computing. DSPs can be embedded for the following applications: audio signal processing, audio compression, digital image processing, video compression, speech processing. 40

(f) Multi-core embedded systems are usually heterogeneous since they match workload to core that achieves best efficiency according to some objective function. Heterogeneous multi-core architectures can provide significant throughput advantages over equivalent-area homogeneous multi-core architectures. 2. Memory cycle times have been much longer than processor cycle times, which is also known as the processor-memory speed gap. Please answer the following questions: (a) Which of the following techniques can be used to overcome the processor-memory speed gap? A. cache memory B. replacing DRAM with SRAM C. flash memory D. moving data from ROM to RAM (b) When you increase the processor-memory bus from 32-bit to 64-bit, which of the following are true? A. latency for loading a 32-bit word from main memory remains unchanged B. more memory modules are needed C. the bus bandwidth is doubled D. the memory address space increases (c) For a virtual memory system, which of the following might happen during a memory access? A. Hit in TLB, Hit in Page Table, Miss in Cache B. Hit in TLB, Miss in Page Table, Miss in Cache C. Miss in TLB, Miss in Page Table, Hit in Cache D. Miss in TLB, Hit in Page Table, Hit in Cache (d) Suppose we have a quad-core computer, where processor cores may communicate via the shared memory. What are the potential advantages and disadvantages if an on-chip shared cache is used? (e) Suppose we have an 8-core shared-memory computer built with two quad-core processor chips, and every two cores share one cache. For one processor core to load the word from location X after another core has written a word to location X, what would be the latency for the load instruction? Please list all possibilities. Answer: (a) (b) (c) (d)

A, B A, C A, D Advantages: If a sufficient amount of data is shared by multiple processors, then a shared cache can increase throughput. Disadvantages: If less data is shared by multiple processors, then access to 41

memory from processors should slower. (e) If the two cores share the same cache then the latency is L1 cache latency. If the two cores do not share the same cache but in the same chip then the latency is L1 + L2 cache latency. If the two cores are not in the same chip then the latency is L1 + L2 cache latency + Memory latency. Core1

Core2

L1 cache

Core3

Core4

Core1

L1 cache

Core2

L1 cache

L2 cache

Core3

Core4

L1 cache

L2 cache

BUS Memory

3. Network bandwidth and storage capacity are growing even faster than the Moore's Law. Today, network-attached storage (NAS) is used popularly by people to share files between multiple computers via the network. (a) For a computer to acquire a file from a NAS server in a local area network (LAN) environment, which of the following might affect the time needed to acquire the file? A. the network bandwidth B. the physical distance between the computer and the NAS server C. the speed of the processor on the NAS server D. the speed of the disk on the NAS server (b) Calculate the minimum time to backup 100 GB of data to the NAS server over the 802.11g wireless network. The peak network bandwidth is 54 Mbit per second. (c) Would it be better if you copy the data onto a USB hard disk first, move the USB disk to connect to the NAS server, and then copy data to the NAS server? The peak bandwidth of the USB 2.0 interface is 480 Mbps. (d) Compared to hard disks, what are the advantages and disadvantages of flash memory'? Answer: (a) A, D 100  109  8  14815 (second) (b) 54  106 42

(c) 2 

100  109  8  3333 (second). Use USB is better. 480  106

(d) Advantages: faster (100 to 1000 times), smaller, lower power, and more robust Disadvantages: more expensive, flash bits wears out after 1000‘s of accesses 4. Consider two different processors. M1 and M2, of the same RISC instruction set. There are five classes of instructions (A, B, C, D and E). When the cache hit rate is 100%, the CPI for each instruction class for M1 and M2 are listed in the following table: Class M1 M2 A 8 4 B 6 6 C 10 2 D 8 2 E 8 8 Suppose M1 has a clock rate of 200 MHz. An engineer analyzed the workload and found out the instruction mix - A: 20%, B: 10%, C: 10%, D: 40%, E: 20%. (a) Based on this information, what would be the minimum clock rate required by M2 to execute the workload as fast as M1, assuming 0% cache miss rate? Another engineer further analyzed the workload from the above question and found out the following: 30% of the instructions access memory for data, D-cache miss rate was 10%, I-cache miss rate was 5%, and memory access time is 50 nanoseconds. (b) Considering this new information, what is the average CPI for M1? (c) What would be the minimum clock rate required by M2 to execute the workload as fast as M1? Answer: (a) CPI for M1 = 0.2  8 + 0.1  6 + 0.1  10 + 0.4  8 + 0.2  8 = 8 CPI for M2 = 0.2  4 + 0.1  6 + 0.1  2 + 0.4  2 + 0.2  8 = 4 The instruction time for M1 = 8/200M = 40 ns Suppose that the clock rate for M2 is x then 4/x = 40 ns  x = 100 MHz The minimum clock rate for M2 = 100 MHz (b) Miss penalty = 50 ns  200 M = 10 clocks Average CPI for M1 = 8 + 1  0.05  10 + 0.3  0.1  10 = 8.8 (c) Suppose that the clock rate for M2 is x GHz Miss penalty = 50 ns  x GHz = 50x clocks Average CPI for M2 = 4 + 1  0.05  50x + 0.3  0.1  50x = 4 + 4x 8.8/200M = (4 + 4x)/x  x = 0.1 GHz = 100 MHz

43

97 年台大資工 1. I/O and Storage Redundant Arrays of Inexpensive Disks, as named by the inventors and commonly referred to as RAID, is a technology that supports the integrated use of two or more hard-drives in various configurations for the purposes of achieving greater performance, reliability through redundancy, and larger disk volume sizes through aggregation. RAID 0 is a striped set without parity. RAID 1 is a mirrored set without parity. The two different schemes are illustrated below. A1 stores the first block of file A, A2 stores the second block of file A, etc. RAID 0

RAID 1

A1

A2

A1

A1

A3

A4

A2

A2

A5

A6

A3

A3

A7

A8

A4

A4

Let the size of the disk be 320 GB each. (a) What is the total capacity of a RAID 0 set with 2 disks? (b) What is the total capacity of a RAID 1 set with 2 disks? (c) For write operations, which RAID set has the higher throughput? (d) For read operations, can the RAID 1 set deliver the same throughput as the RAID 0 set? (e) Which RAID set has the higher availability? For the following questions, suppose that you have 4 disks and you are considering the following RAID configuration: RAID 0: a 4-disk striped set RAID 0 + 1: 2 striped sets in a mirrored set RAID 1 + 0: 2 mirrored sets in a striped set (f) Draw a picture to illustrate how a file is stored on the RAID 0 + 1 set (g) Which configuration has the highest write throughput? (h) When one of the disks has failed, which configuration has the highest write throughput? Answer:

44

(a) (b) (c) (d) (e) (f)

320 GB  2 = 640 GB 320 GB RAID 0 Yes (One Write or two Reads possible per mirrored pair) RAID 1 Configuration of RAID 0 + 1: RAID 1 RAID 0

RAID 0

A1

A2

A1

A2

A3

A4

A3

A4

A5

A6

A5

A6

A7

A8

A7

A8

(g) RAID 0 (h) RAID 1 + 0 2. Multiprocessor Systems Multiprocessor systems can be found in everyday use. Many new personal computers (PC) are equipped with multi-core processor chips. Because the processor cores on such a system typically share everything on the system bus and the I/O bus, this kind of system is classified as Symmetric Multiprocessor (SMP). (a) The multi-core design is popular because most existing applications running on a PC would benefit from the additional processor cores. True or False? (b) Having multiple cores share the Level-2 cache on the same chip is always a good idea because most of the data in the cache are shared by multiple cores. True or False? (c) The shared-memory programming model is most natural for developing a parallel application on an SMP system. For that, which of the following can be used? (You can have multiple choices.) Message-Passing Interface (MPI) OpenMP POSIX Threads XML HTML Java 45

People sometimes pay too little attention to I/O and the operating system (OS) when they develop parallel applications for SMP systems. For example, an application has been fully parallelized to run on a SMP system, but for an 8-processor run, 50% of time a processor has to stall because the processor is waiting to access the disk. There is only one disk in the system. Assume that the OS issues one disk accesses at a time. (d) Suppose that the application runs for 100 seconds when only one processor is used, how many seconds would the application requires to run on the 8-processor SMP system? (Note that the 100 seconds is the wall clock time, measured from the beginning to the end of the application). (e) For the 8-processor run, suppose that the disk is transferring data during 80% of the execution time. What is the minimum execution time even if the system has unlimited number of processors? (f) In order to improve the application performance, the disk is replaced with a 5-disk RAID 0 system. What is the minimum execution time required for this application, with the new RAID disk? Answer: (a) False (The amount of performance gained by the use of a multicore processor depends on the problem being solved and the algorithms used, as well as their implementation in software) (b) False (Only if a sufficient amount of data shared by multiprocessor, then a shared cache can increase throughput) (c) OpenMP, POSIX Threads (d) Suppose x is the I/O time 100  x  x  x  20 0.5  8

 Execution time = 20 + 20 = 40 (seconds)

(e) 40  0.8 = 32 (seconds) (f) 80/8 + 20/5 = 14 (seconds) 3. True or False (a) VLIW is a software_based approach to exploit ILP (b) Pipelining improves both instruction latency and throughput. (c) The delayed branch mechanism has lost popularity compared to more expensive but more flexible dynamic branch prediction as processors go to both longer pipelines and wider issue width. (d) Two instructions with data dependencies will cause data hazard in pipelining execution. (e) A one-bit branch predictor could sometimes achieve higher branch prediction accuracy than a two-bit branch predictor. Answer:

46

(a) True (b) False (c) True (d) False (e) True 註(e):若 branch 指令是由高階語言的迴圈所造成,則 2-bit 會有較高的正確率, 若是由 if-than-else 指令所造成的,則不一定誰的正確率較高 4. For the following memory system parameters: • Cache: virtually-addressed but physically-tagged, 8KB, direct-mapped, 32B block size • TLB: fully associative, 32 entries • 32-bit physical address • Virtual Memory: 8KB page, 64-bit virtual address (a) How many bits are needed in each tag, index, and block offset fields of the cache? (b) How many bits are needed in each tag field of the TLB? (c) Why do we want to use a virtually-addressed but physically-tagged cache? Answer: (a) The number of blocks in the cache = 8KB/32B = 256 = 28 The number of bits in the index field = 8 The number of bits in the tag field = 32 – 8 – 5 = 19 The number of bits in the block offset field = 5 – 2 = 3 Tag

Index

19

8

Block offset 3

Byte offset 2

(b) The number of bits in the tag field of the TLB = 64 – 13 = 51 (c) When the cache is accessed with a virtual address and pages are shared between programs, there is the possibility of aliasing. A virtually-addressed but physically-tagged cache can avoid aliasing. 5. Multiple Choices An embedded processor has both 32 and 16-bit instructions. 16-bit instructions are a subset of the most commonly used 32-bit instructions. Compared to codes written in 32-bit instructions, codes written in 16-bit instructions could potentially (a) increase or decrease instruction count? (b) increase or decrease code density? (c) increase or decrease I-cache hit rate? 47

Answer: Instruction count Code density I-cache hit rate

32-bit instructions decrease increase decrease

16-bit instructions increase decrease increase

註:Computers with high code density also often have complex instructions. 6. List one advantage of variable-length encoding over fixed-length encoding. Answer: Variable-length encoding generally has higher memory utilization (compression rate) than fixed-length encoding.

48

101 年台聯大電機 1.

Consider two different implementations, M1 and M2, of the same instruction set. There are three classes of instructions (A, B, and C) in the instruction set. M1 has a clock rate of 500MHz, and M2 has a clock rate of 250MHz. The average number of cycles per each instruction class on M1 and M2 is given in the following table : Class

CPI on M1 CPI on M2

C1 usage

C2 usage

Third-party usage

A

4

2

30%

30%

50%

B

6

4

50%

20%

30%

C

8

3

20%

50%

20%

The table also contains a summary of how three different compilers use the instruction set. C1 is a compiler produced by the makers of M1, C2 is the compiler produced by the makers of M2, and the other compiler is a third-party product. Assume that each compiler use the same number of instructions for a given program but that the instruction mix is as described in the table. (1) Assume that peak performance is defined as the fastest rate that a machine can execute an instruction sequence chosen to maximize that rate. What are the peak performances of M1 and M2 expressed as instructions per seconds? (2) Using the third-party compiler on both M1 and M2, what is the CPI for each machine? What is the execution time for each machine if the number of instructions for the given program is 1,000,000 in this case? What clock rate should M2 have to have the same (3) (4) (5) (6)

performance as M1 in this case? Using C1 on both M1 and M2, how much faster can the makers of M1 claim that M1 is compared with M2? Using C2 on both M2 and M2, how much faster can the makers of M2 claim that M2 is compared with M1? If you purchase M1, which compiler would you use? (You must give your reason to support your answer. Simply write down the compiler without justification gets no score.) If you purchase M2, which compiler would you use? (You must give your reason to support your answer. Simply write down the compiler without justification gets no score.)

Answer (1) Peak performance for M1 = 500M / 4 = 125M instr./sec Peak performance for M2 = 250M / 2 = 125M instr./sec (2) CPI for M1 = 0.5  4 + 0.3  6 + 0.2  8 = 5.4 CPI for M2 = 0.5  2 + 0.3  4 + 0.2  3 = 2.8 Execution Time for M1 = (1M  5.4) / 500M = 10.8ms Execution Time for M2 = (1M  2.8) / 250M = 11.2ms Suppose the clock rate for M2 should be x, then x = (1M  2.8) / 10.8m = 259.26MHz (3) CPI for M1 = 0.3  4 + 0.5  6 + 0.2  8 = 5.8 49

CPI for M2 = 0.3  2 + 0.5  4 + 0.2  3 = 3.2 Execution Time for M1 = (1M  5.8) / 500M = 11.6ms Execution Time for M2 = (1M  3.2) / 250M = 12.8ms M1 is 12.8 / 11.6 = 1.10 times faster than M2 (4) CPI for M1 = 0.3  4 + 0.2  6 + 0.5  8 = 6.4 CPI for M2 = 0.3  2 + 0.2  4 + 0.5  3 = 2.9 Execution Time for M1 = (1M  6.4) / 500M = 12.8ms Execution Time for M2 = (1M  2.9) / 250M = 11.6ms M2 is 12.8 / 11.6 = 1.10 times faster than M1 (5) The execution for M1 under C1 is 11.6ms The execution for M1 under C2 is 12.8ms. The execution for M1 under Third-party is 10.8ms. So, third-party should be used. (6) The execution for M2 under C1 is 12.8ms The execution for M2 under C2 is 11.6ms The execution for M2 under Third-party is 11.2ms So, third-party should be used. 2. An array of integers is stored in memory beginning in the address given in register $s0. The array length is provided in register $s1. The MIPS code for finding the maximum value in the array is shown below. lw $t0, 0($s0) # $t0 = array[0] add $t1, $zero, $zero L1: addi $t1, $t1, 1 beq (a) sll $t2, $t1, 2 # $t2 = $t1 * 4 add (b) lw $t3, 0($t2) (c) $t4, $t0, $t3 beq $t4, $zero, L1 add $t0, $t3, $zero j L1 Exit: (1) Write down the missing parts in (a), (b), (c) to complete the code. (2) Suppose we place the code starting at location 0x00000A20 in memory. What is the MIPS machine code for the instruction: j L1 in hexadecimal representation? Note that the value of the 6-bit jump opcode is 2ten. (3) What is the value of the 16-bit address for the instruction: beq $t4, $zero, L1 in decimal number? (beq  opcode = 4 , $t4  12) Answer 50

(1)

3.

(a)

(b)

(c)

$t1, $s1, Exit

$t2, $t2, $s0

slt

(2)

000010 000000000000000010100010102 = 0800028A16

(3)

-710

Consider two MIPS instructions: xor (exclusive OR) and nor (not OR). The table shown below defines these operations on a bit-by-bit basis. A

B

A xor B

A nor B

0

0

0

1

0

1

1

0

1

0

1

0

1 1 0 0 For each of the following new instructions, use only xor and nor to provide the minimal instruction sequence to accomplish the same thing. (1) The instruction not takes the one‘s complement of the source register and places it in the destination register: not $ t0, $t1 # $t0 = not ($t1) (2) The instruction swap exchanges two registers, After the instruction is executed, the destination register has the original value of the source register, and the source register has the original value of the destination register: swap $t0, $t1 # $t0  $t1 Answer (1) nor

4.

(2)

$t0, $t1, $zero

xor xor xor

$t0, $t0, $t1 $t1, $t0, $t1 $t0, $t0, $t1

Consider the following four code sequences: (1) ADDI R1, R1, #4 LW R2, 7(R1) (2) ADD R3, R1, R2 LW R2, 7(R1) (3)

LW R2, 7(R1) LW R2 , 200(R7) (4) BEZ R1, L1 SW R1, 7(R1) For each code sequence, identify each type of dependence that exists and describes what data flow or control structure causes that dependence. Answer 51

Dependence

Description It arises when the next instruction tries to read a source

(1)

RAW (read after write)

before the previous instruction writes to it. So, the next instruction gets the old value incorrectly. It arises when the next instruction writes to a destination before the previous instruction reads it. In this case, the previous instruction gets a new value incorrectly.

(2)

WAR (write after read)

(3)

It is the situation when the next instruction tries to write WAW (write after write) to a destination before a previous instruction writes to it and it results in changes done in the wrong order.

(4)

Control Dependency

An instruction is control dependent on a preceding instruction if the outcome of latter determines whether former should be executed or not.

5.

There is a deeply-pipelined processor having the following pipeline stages. IF1

IF2

ID1

ID2

EXE1 EXE2 EXE3 EXE3 MEM1

MEM2

WB

The branch target address will be calculated out at ID1- stage (3rd stage) and the branch will be resolved at EXE2-stage (6th stage). Please describe the branch penalties (in cycles) for each strategy to handle control hazard in the following table. Assume that the compiler can always find instructions to fill branch delayed slot. Unconditional branch

Conditional branch taken

untaken

Predict taken Predict not-taken Delayed branch Answer Unconditional branch

Conditional branch taken

untaken

Predict taken

0

0

5

Predict not-taken

3

5

0

Delayed branch

0

0

0

52

6.

Consider the following pipelined MIPS datapath with forwarding and stall control. The signals Src1 and Src2 are 2-bit control signals, while all the other control lines are 1-bit signals. Note that $1 in the given code shown below denotes the register with index 1. Assume the pipeline is initially empty, and all register writes occur at the end of the clock cycle. sll add lw add sub

$2, $1, 2 $2, $2, $3 $4, 0($2) $5, $4, $1 $6, $4, $1

Hazard detection unit IF/IDWrite

ID/EX.MemRead

Stall

EX/MEM

PCWrite

Control

PC

Instruction

IF/ID

Instruction memory

ID/EX

0

MEM/WB

0 1

0 1 2

Registers

MtoR Src1

0 ALU

0 1 2

Data memory

1

Src2 RegDst

IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRd

Rt Rd

ID/EX.RegisterRt Rt Rs

EX/MEM. RegisterRd

0 1 Forwarding unit

MEM/WB.RegisterRd

(1) Assume I-Mem, Mux, ALU, Register files, D-Mem have latencies of 50ps, 10ps, 30ps, 20ps, and 50ps, respectively. Except register files, the latency of all other registers (including PC) is 15ps. Each control unit (Control, Hazard detection, Forwarding) has a latency of 20ps. What is the minimum clock cycle time of this Processor? Please explain your answer· (2) Assume the given code is executed on this processor. Please draw the simple (traditional) multiple-clock-cycle pipeline execution diagram of the first 8 clock cycles, from beginning including all necessary stalls or nops. (3) In (2), what are the values of those MUX selection signals (Stall, Src1, Src2, RegDst, MtoR) at the 5th cycles from beginning? (4) In (2), what are the values of those MUX selection signals (Stall, Src1, Src2) at the 6th cycles from beginning? 53

(5) If we want to implement the forwarding capability for the store operation immediately after a load instruction, what circuit should be added? Please redraw the datapath including the added parts on your answer sheet. Answer (1) The longest delay is in EX stage and is 20ps + 10ps + 30ps = 60ps The clock cycle time = 60ps + 15ps = 75ps (2) c1 c2 c3 c4 c5

(3)

sll

$2, $1, 2

IF

add

$2, $2, $3

lw

$4, 0($2)

add

$5, $4, $1

sub

$6, $4, $1

c6

c7

c8

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

WB

IF

ID

ID

EX

MEM

IF

IF

ID

EX

Stall

Src1

Src2

1

xx

xx

Stall

Src1

Src2

RegDst

MtoR

0

10

11

0

1

(4)

註(3):假設 Src2 為 11 時 Sign Extension Unit 的輸出可穿過 MUX 到達 ALU 第二個(下面)輸入 (5)

We introduce an MUX, which controlled by signal WDSrc, to select the data either from register file or from data memory to apply to write data input of data memory. Hazard detection unit IF/IDWrite

ID/EX.MemRead

Stall

EX/MEM

PCWrite

Control

PC

Instruction

IF/ID

Instruction memory

ID/EX

0

MEM/WB

0 1

0 1 2

Registers

MtoR Src1

0 ALU

0 1 2

WDSrc

0

Src2

Rt Rd

ID/EX.RegisterRt Rt Rs

54

WD

1

RegDst

IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRd

Data memory

EX/MEMRegisterRd

0 1 Forwarding unit

MEM/WB.RegisterRd

1

7.

Consider two chip multiprocessors (CMP) P1 and P2 and each processor has dual cores. The CMPs contains L1 and L2 caches on a single chip. The CMP P1 has private L2 cache and the P2 has shared L2 cache. If the two CMPs are evaluated with benchmark A, the resulting L1 and L2 miss rates and hit latency as follows. CMP

L1 miss rate

L2 miss rate

P1

5%

2%

P2

5%

0.2%

L1 hit latency 0.6 ns

L2 hit latency Private cache

Shared cache

Main memory access time

6 ns

8 ns

160 ns

If the two CMPs are evaluated with benchmark B, then the L1 and L2 miss rate and hit latency follows. CMP

L1 miss rate

L2 miss rate

P1

5%

0.1%

P2

5%

0.04%

L1 hit latency 0.6 ns

L2 hit latency Private cache

Shared cache

Main memory access time

6 ns

20 ns

160 ns

(1) What is the average memory access time (AMAT) for P1 for benchmark A? (2) Which processor has better AMAT for benchmark A and why? (3) Which L2 cache design has better performance for benchmark A? Please use data to support the answer. (4) Which benchmark has better performance for the shared L2 cache design? Please use data to support the answer. (5) Which processor is better for multithreaded workloads? (6) Assume that the shared cache latency increases linearly with the CMP size and the shared cache miss rate changes inverse proportionality with the CMP size. Which processor has better performance in benchmark A if the number of cores doubles for both CMPs, P1 and P2? (7) With doubled number of cores, can we determine how much more off-chip memory bandwidth is needed to maintain the same level of per-core performance for P1 and P2? (8) For benchmark A, assume the base CPI = 1 and nonblocking cache is used to improve the concurrent misses (i.e. more than one outstanding miss in parallel) from 1 to 2, how much performance improvement can be achieved for P1 and P2? Answer 55

(1) AMATP1 = 0.6ns + 0.05 × 6ns + 0.02 × 160ns = 4.1ns (2) AMATP2 = 0.6ns + 0.05 × 8ns + 0.002 × 160ns = 1.32ns So, P2 has better AMAT than P1 for benchmark A (3) Since P1 and P2 have the same L1 cache and AMATP1 is greater than AMATP2 for benchmark A, the shared L2 cache has better performance than private L2 cache for benchmark A. (4) AMATP2 for benchmark A = 0.6ns + 0.05 × 8ns + 0.002 × 160ns = 1.32ns AMATP2 for benchmark B = 0.6ns + 0.05 × 20ns + 0.0004 × 160ns = 1.664ns For shared L2 cache, benchmark A has better performance. (5) A shared cache is preferred as a L2 cache design because the shared L2 design maintains a single copy of shared data in the shared cache, resulting in a larger effective cache size. The shared L2 design can also tolerate imbalances across the working sets of different multithreads running in each processor core. (6) AMATP1 = 0.6ns + 0.05 × 6ns + 0.02 × 160ns = 4.1ns AMATP2 = 0.6ns + 0.05 × 16ns + 0.001 × 160ns = 1.56ns For benchmark A, P2 has better AMAT than P1 if the number of cores doubled. (7) For P1, no off-chip memory bandwidth is needed since AMAT is not affected by doubling the number of cores. Suppose we can get 1byte per memory access For P2, we need (1 / 1.32ns) – (1 / 1.56ns) = 116.55Mbyte/second more off-chip memory bandwidth to maintain the same level of per-core performance. (8) Suppose L1 cache access time dominates the processor clock cycle time and the processor clock cycle time is 0.6ns. Nonblocking cache is used to hide miss penalty. Reducing the concurrent misses from 1 to 2 will result in halving the cache miss penalty. For P1 the original CPI = 1 + 0.05 × 10 + 0.02 × 266 = 6.82 For P1 with nonblocking cache the CPI = 1 + 0.05 × 5 + 0.02 × 133 = 3.91 P1 can achieve 6.82 / 3.91 = 1.74 times improvement. For P2 the original CPI = 1 + 0.05 × 10 + 0.002 × 266 = 2.03 For P2 with nonblocking cache the CPI = 1 + 0.05 × 5 + 0.002 × 133 = 1.516 P2 can achieve 2.03 / 1.516 = 1.34 times improvement.

56

100 年台聯大電機 1. (1) Some multicore designers claim their chips can attain a linear speedup with processor numbers. Please describe the possible problem set that fit such claim. (2) True or false, RAID systems rely on redundancy to achieve high reliability. (3) True or false, in the simultaneous multithreading case, both thread-level parallelism and data parallelism are exploited for parallel execution. (4) Given the following instruction sequence of three threads, (the vertical axis is time, the horizontal axis is issue slot) how many clock cycles will fine-grained multithreading and simultaneous multithreading use respectively? Annotation [ ] denotes the issue slot. Assume maximum four issue slots are allowed per cycle. Thread 1: [] [][] [][][] [][] (stall) [] [][][][] [][][] []

Thread 2: [][][] [][][][] [][] (stall) [] [][][] (stall) [][] (stall) [][][]

Thread 3: [][] [][][] (stall) (stall) [][][][] [] [] [][] []

Answer (1) The problem set that fit such claim should possess the following features:  100% Parallelizable  No data dependency between processes (threads) (2) False. (Availability) (3) False. (Thread-level parallelism and Instruction parallelism) (4) Fine-grained multithreading: 22 clocks Simultaneous multithreading: 13 clocks ( 49 / 4  13 )

57

2. (1) Please describes the Amdahl's law, and uses the Amdahl's law to explain why the multicore processor cannot get the similar improvement factor as the core numbers. (2) Suppose a program segment consists of a purely sequential part which takes 20 cycles to execute, and an iterated loop which takes 200 cycles per iteration. Assume the loop iterations are independent, and cannot be further parallelized. If the loop is to be executed 200 times, what is the maximum speedup possible using an infinite number of processors (compared to a single processor)? Answer (1) Amdahl‘s law makes it possible to determine the theoretical maximum speed increase in a system that can be obtained by using multicore processor. Since only some portions of the system can be run in parallel, a limit will eventually be reached where no more cores can be run alongside one another. That is, the seed-up of a program using multicore processor in parallel is limited by the time needed for sequential portions of the program to execute. (2) Execution time before improvement = 200  200 + 20 = 40020 cycles Execution time after improvement = 200 + 20 = 220 cycles Speedup = 40020 / 220 = 181.91

58

3. Assume variable key is associated with register $s1, variable h is associated with register $s2 and the base address of the array A is in $s3. Therefore, the following C program segment can be converted to the MIPS assembly code as shown below. for (h = 0; h < 10; h++) if (A[h] == key) break; add $s2, $zero, $zero L1: slti $t0, $s2, 10 (a) $t0, $zero, Exit (b) $t0, $s2, 2 add $t0, $t0, $s3 lw $t1, (c) beq $t1, $s1, Exit addi $s2, $s2, 1 j L1 Exit: (1) Please fill in the blanks (a), (b), (c) to complete this assembly code. (2) Assume this code is put at the memory address starting from 0x00000600, what is the machine code of the instruction beq $t1, $s1, Exit in hexadecimal form? (beq  opcode = 4, $t1  9, $s1  17) Answer (1) (a) beq

(b) sll

(c) 0($t0)

(2) Binary form Hex. form

OP 000100

Rs 01001

Rt 16-bit address 10001 0000000000000010 11310002

4. Assume we are designing a 16-bit MIPS CPU with 16-bit instruction words. Please adjust the instruction formats according to the following specifications. (1) Assume this CPU has 6 arithmetic/logic instructions and 10 other instructions. In addition, assume there are 16 registers in this CPU; each is 16-bit long. If the 3-operand format (R-type format) is still adopted, what 59

are the numbers of bits of the fields (a), (b), (c) in this 16-bit instruction format? Please explain your reasons. op rs rt rd shamt funct (a) (b) (c) (2) In the original 32-bit MIPS CPU, there is an instruction addi that can add a 16-bit immediate number with the content of a register. Using the specifications in (1), how to implement addi instruction in this 16-bit MIPS CPU if the bit width of immediate numbers should be kept as 16? Please propose one possible solution with brief explanations. (3) Assume the IEEE-754 floating-point representation is also adjusted to 16-bit long for this 16-bit MIPS CPU. If the floating-point numbers are required to represent the values within ±1018, what are the numbers of bits of the fields (d), (e), (f) in this 16-bit floating-point format? Please explain your reasons, (assume log102 = 0.3) sign exponent mantissa (d) (e) (f) (4) In the IEEE-754 floating-point representation, the exponent field employs the excess-N coding. What should be N in your 16-bit floating-point format if it follows the similar mechanism of IEEE 32-bit standard? Answer (1) (a) (b) (c) op rs rt rd shamt funct 4 bits 4 bits 3 bits (2) Suppose in the 16-bit MIPS CPU, the length of constant field in the I-type instruction is 8 bits. The 32 bit addi instruction (for example addi $s1, $s0, 16-bit-constant) can be implemented as the following 16 bit instructions: lui $t0, upper-half(16-bit-constant) ori $t0, $t0, lower-half(16-bit-constant) add $s1, $s0, $t0 (3) 1018  218/0.3 = 260 = 2  259. If size of exponent field is 7 bits, the maximum positive exponent will be 63 > 59 and if the size of exponent field is 6 bits, the maximum positive exponent will be 31 < 59. (d) (e) (f) sign exponent mantissa 60

1 bits (4) N = 26 – 1 – 1 = 31

6 bits

9 bits

5. Pipelining affects the clock cycle time of the processor. Following the three assumptions listed below, let's compare clock cycle times and execution times with single-cycle, multi-cycle, and pipelined organization of this processor. (A multi-cycle organization means each instruction takes multiple cycles but one instruction finishes before another is fetched. In this organization, an instruction only goes through stages it actually needs.) (i) individual stages (IF, ID, EX, MEM, WB) of the datapath of a simplified implementation (covering only lw, sw, beq and ALU instructions) have the following latencies: IF ID EX MEM WB 200ps 150ps 120ps 190ps 140ps (ii) the instructions executed by the processor are broken down as follows: ALU beq lw sw 30% 25% 30% 15% (iii) there are no stalls or hazards. (1) What is the clock cycle time in a pipelined and non-pipelined single-cycle processor? (2) If the execution time of the pipelined organization is denoted as T, what are the execution times of the single-cycle and multi-cycle organizations in terms of T? Answer (1) Processor

Clock cycle time 200 ps 800 ps

Pipelined Non-pipelined single-cycle

(2) CPI for Multi-cycle machine = 0.3  4 + 0.25  3 + 0.3  5 + 0.15  4 = 4.05 Single-cycle Multi-cycle Pipeline Cycle time 800 ps 200 ps 200 ps CPI 1 4.05 1 Instruction time 800 ps 810 ps 200 ps Execution times 4T 4.05 T T

61

6. Assume we want to execute the following addi instruction in the single-cycle datapath: addi $19, $29, 16 The single-cycle datapath diagram below shows the execution of this instruction. ?j

Add 4

Add

Shift left 2

RegWrite

PCSrc

?h 29 PC

Read address

Read reg 2

19

Read reg 1

Instruction memory 0

?a

MemtoReg

ALUSrc ALU

?b

Read data 2

Write Reg

MemWrite

?c

Read data 1

?d

Write data

16

Sign extend ?f

?e

Address

Write data

RegDst

Instruction [15-0]

?i

ALU control MemRead

Instruction [5-0] ALUOp

?g

(1) Please set the control signals in the following table so that addi can be executed properly. Use 'x' for 'don't care' if necessary. RegDst

ALUSrc

MemToReg

PCSrc

(2) Several of the datapath values are already shown in the diagram. You are to provide values for the ten remaining signals, which are labeled with a question mark (?) followed by a character (from a to j). Please write their decimal values in the following table. Assume register $29 and PC initially contain the value of 145 and 32, respectively. If a value cannot be determined, mark it as 'X'. a b c d e f g h i j

Answer (1) RegDst 0

ALUSrc 1

MemToReg 0

PCSrc 0

(2) a 19

b X

c 145

d 16

e X

f 16

g 161 62

h 64

i 100

j 36

7. Given a multiple-issue processor with 10 pipeline stages and issue width of 4. Assume that the processor always execute the maximum number of instructions per cycle if there is no control hazards. For control dependencies, the processor uses branch prediction and continues fetching from the predicted path. If the branch has been mispredicted, the instructions fetched after the mispredicted branch are discarded when the branch outcome is resolved in stage 8. In the next cycle, the processor starts fetching from the correct path. Assume 20% of all executed instructions are branches with 90% prediction accuracy. (1) How many register read ports should the processor have to avoid any resource hazards due to the register read? (2) If there are no branch mispredictions and no data dependences, what is the expected performance improvement over a 1-issue processor with the classical 5-stage pipeline? Assume that the clock cycle time decreases in proportional to the number of pipeline stages. (3) With branch mispredictions, how many instructions are expected to be executed between mispredictions? (4) How many branch instructions are expected to be in progress (i.e. fetch but not yet committed) at any given time? (5) What percentage of all cycles is entirely spent on fetching wrong-path instructions? (6) If we want to limit the stalls due to mispredicted branches to zero (i.e. no stalls), what should be the required branch prediction accuracy? Answer (1) 8 read ports (2) Speedup = (10  4) / (5  1) = 8 (3) Instructions between branch mispredictions = 1 / (0.2 × 0.1) = 50 (4) Suppose the instruction count is N The number of branch instructions is (40/N) × 0.2 × (7/10) (5) Instructions between branch mispredictions = 1 / (0.2 × 0.1) = 50 Cycles between branch mispredictions = 50 / 4 = 12.5 3 2 1 0    4 4 4 4  7.4 Stall cycles between mispredictions = 7  4

The stall cycle percentage = 7 / (12.5 + 7.4) = 0.352 (6) Prediction accuracy should be 100% to avoid any stall due to mispredicted branch 63

8. The following table summarizes the memory hierarchy parameters and their effects on performance. Copy this table to your exam paper and fill in each missing cell with the most proper statement from below. Design Change

Potential advantages

Possible Disadvantages

Increase cache size Increase block size Increase associativity More sophisticated replacement policy (a) (b) (c) (d) (e) (f)

Decrease capacity misses Decrease conflict misses Decrease compulsory misses Decrease access time Decrease decision time Decrease miss penalty

(g) (h) (i) (j) (k) (l)

Increase capacity misses Increase conflict misses Increase compulsory misses Increase access time Increase decision time Increase miss penalty

Answer Design Change Increase cache size Increase block size Increase associativity More sophisticated replacement policy

Potential advantages

Possible Disadvantages

(a) (c) (b)

(j) (l) (j)

(b)

(k)

9. Consider the virtual memory system of Processor XYZ described in the following table. Characteristic Virtual address Physical address Page size TLB Organization

Processor XYZ 48 bits 40 bits 4KB 1 L1 TLB for instructions and 1 L1 TLB for data. Both L1 TLBs are fully associative with 64 entries 64

1 L2 TLB for instructions and 1 L2 TLB for data. Both L2 TLBs are 4-way set associative with 256 entries Characteristics of the first-level cache (L1) and the second-level cache (L2) in Processor XYZ are summarized in the following table. Note that the cache size only includes the actual storage of instructions/data. Characteristic L1 cache organization L1 block size L1 cache associativity L1 cache size L2 cache organization L2 block size L2 cache associativity L2 cache size

Processor XYZ Split instruction and data caches 64 bytes 2-way set associative 16 KB each for instructions/data Unified (instruction and data) 64 bytes 4-way set associative 256 KB

(1) What is the number of tag bits in the virtual address feed to L1 data TLB? (2) What is the number of tag bits in the virtual address feed to L2 data TLB? (3) What is the number of index bits in the physical address feed to L1 data cache? (4) What is the number of tag bits in the physical address feed to L2 cache? Answer (1) Since the L1 TLB is a fully associative cache, the size of virtual page number = the size of a Tag = 48 – 12 = 36 bits (2) The Tag size = 48 – 12 – 8 = 28 bits (3) The number of blocks in L1 data cache = 16KB / 64B = 256 The number of set in L1 data cache = 256 / 2 = 128 The number of index bits = 7 (4) The number of blocks in L2 cache = 256KB / 64B = 4K The number of set in L2 cache = 4K / 4 = 1K The number of tag bits = 40 – 10 – 6 = 24

65

99 年台聯大電機 1. Assume three 32-bit variables x, y, and z, are stored in memory with addresses A, B, and C, respectively. (1) For the "load-store" instruction set architecture processor, write down the assembly codes to compute x + y – z and then stores the result back to C. (2) Suppose that all instruction operation codes are 6 bits, memory addresses are 24 bits (byte-addressable), and register addresses are 5 bits. What are the code sizes (in terms of bits) and the total memory accesses (in terms of bytes) for the codes in (1)? Note that the total memory accesses include both instructions and data moved to or from memory. And, the memory accesses must be aligned. Answer: (1) lw lw lw add sub sw

$s1, A $s2, B $s3, C $t1, $s1, $s2 $t2, $t1, $s3 $t2, C Total

(2) Codes size 6 + 5 + 24 = 35 6 + 5 + 24 = 35 6 + 5 + 24 = 35 6 + 5 + 5 + 5 = 21 6 + 5 + 5 + 5 = 21 6 + 5 + 24 = 35 182 bits

Memory accesses 8 + 4 = 12 8 + 4 = 12 8 + 4 = 12 4 4 8 + 4 = 12 56 bytes

註:長度 35 bits 指令若要對齊 word address 每個指令會佔用 

35  = 2 words = 8  32 

bytes (99 台聯大電機) 2. Consider the following binary data shown in the table. case a. case b.

1010 1101 0001 0000 0000 0000 0000 0010 2 1111 1111 1111 1111 1111 1111 1111 11112

(1) Write the MIPS code that creates the 32-bit constants listed above and stores that value to register $t1. Please use separate codes for the two cases. (2) If the current value of the PC is 0x00000600, can you use a single branch instruction to get to the PC address as shown in the table above? If the current value of the PC is 0x00400600, can you use a single branch instruction to get to the PC address as shown in the table above? Please give your answers for the two cases in the table. Answer: 66

(1)

case a case b case a

(2) case b

lui $t1, -21232 ori $t1, $t1, 2 addi $t1, $zero, -1 NO 0xAD100002 - 0x00000600 > 217 (bytes) NO 0xAD100002 - 0x00400600 > 217 (bytes) NO 0xFFFFFFFF - 0x00000600 > 217 (bytes) NO 0xFFFFFFFF - 0x00400600 > 217 (bytes)

3. Given a finite word-length size, overflow occurs when a result is too large to be represented accurately; however, underflow occurs when a number is too small to be represented correctly. The right table shows pairs of decimal numbers. (1) Assume A and B are signed 8-bit decimal integers stored in two's-complement format. Calculate A + B and A – B using saturating arithmetic for the two cases. The result should be written in decimal. Show your work. Is there overflow, underflow, or neither? (2) Assume A and B are unsigned 8-bit integers. Calculate A + B and A – B using saturating arithmetic for the two cases. The result should be written in decimal. Show your work. Is there overflow, underflow, or neither? case a. case b.

A 200 247

B 103 237

Answer:

A+B A–B A+B case b A–B case a

(1) - 56 + 103 = 47 - 56 – 103 = - 128 - 9 – 19 = - 28 - 9 + 19 = 10

(2) Neither 200 + 103 = 255 Underflow 200 – 103 = 97 Neither 247 + 237 = 255 Neither 247 – 237 = 10

Overflow Neither Overflow Neither

註:浮點數才有可能 underflow,整數只會 overflow 不管正數太大或負數太大。 不過題目既然定義數值太小無法表示就是 underflow 因此第(1)小題 case a: A – B 就要填 underflow  Saturation means that when a calculation overflows, the result is set to largest positive number or most negative number, rather than a modulo calculation as in two‘s complement arithmetic.

67

4. An instruction set is composed of L different instruction classes, denoted by C(k), k = 1, 2,..., L. For every instruction in class C(k), the execution time is given by 2 + k ns. Consider a multiple-cycle implementation with a clock cycle 1 ns and assume class C(k) instructions can then be executed in 2 + k clock cycles, ignoring any overhead associated with control. (1) Assume that all instruction classes are used with the same frequency. What is the performance advantage of multiple-cycle control relative to single-cycle control? Show the speedup factor. (2) If the memory access latency is varied from 1 ns to 2 ns, this variation surely affects the multiple-cycle implementation. Should you increase the clock cycle or should you change the actions performed within each cycle? Justify your solution. Answer: (1) Instruction time for single cycle machine = (2 + L) ns L

Instruction time for multi-cycle machine = 1 ns 

 (2  k ) k 1

L

 (2.5  0.5L) ns

Instruction time for single cycle machine 2 L  Instruction time for multicycle machine 2.5  0.5L (2) By increasing clock cycle, the instruction time for multi-cycle machine =

Speedup = L

2 ns 

 (2  k ) k 1

L

 (5  L) ns

By increasing one clock for instruction fetch, the instruction time for L

multi-cycle machine = 1 ns 

 (3  k ) k 1

L

 (3.5  0.5L) ns

Changing the actions within each cycle is better when 5 + L – 3.5 – 0.5L > 0 L>–3 Since L represents different instruction classes and is greater than 0  changing the actions within each cycle is always better than increasing clock cycle.

68

5. Consider the following multiple-cycle implementation with control lines shown. The signals ALUOp and ALUSrcB are 2-bit control signals, while all the other control lines are 1-bit signals. Consider the following section of MIPS code executing on the multiple-cycle processor. Instructions take from three to five execution steps. Note that $16 denotes the register with index 16. The code starts with clock cycle 1. Assume that all register writes occur at the end of the clock cycle. Because we increment PC by 4 in clock cycle 1, PC is set to the value 1004 at the end of clock cycle 1. . 1000 addi $8, $16, 4 1004 bne $8, $16, L1 1008 add $9, $17, $16 L1: 1012 sw $16, 4($19) 1016 lw $16, 100($18) 1020 sw $17, 104($18)

(1) What are the values of RegDst, MemtoReg, ALUSrcA and ALUSrcB at the third step of addi $8, $16, 4? For ALUSrcB, you can show its value with a decimal number. (2) What are the values of PC at the end of clock cycles 4, 5, 6, and 7, respectively? (3) Determine the average clock cycles per instruction (CPI) of the above MIPS code. (4) An engineer proposes to remove the memory data register (MDR) from the above multiple-cycle implementation. Her idea is to connect MemData (output port of the memory) directly to the multiplexor controlled by MemtoReg. With this modification, she claims that the lw instruction will take one less step. Does her modification effectively reduce the execution 69

time? Justify your answer. (5) Can the above multiple-cycle processor support the following register transfer language (RTL) in a step? Justify your answer. if (A = = B) PC  PC + (sign-extend (IR[15-0]) $t3 then $t1 = 1 otherwise $t1 = 0 Please find the shortest sequence of MIPS instructions to perform the same operation. (2) The ARM processor supports the following instruction. add, r3, r2, r1, LSL #3 ;r3 = r2 + (r1 217

Yes

No

114

2.

For the code sequence below, assume the two bne instructions are predicted as taken, but actually they are not taken. How many cycles does it take to complete the execution of the code by using the two CPUs below, respectively? For each CPU, please also identify which cycle(s) will become bubble (i.e., stall cycle) when executes the code (Note: the first instruction: ―lw $t0, $s1, $s2‖ starts from cycle 1). 1. 2. 3. 4. 5. 6. 7.

lw add lw bne add add bne

$t0, $s1, $s2 $s0, $t0, $t1 $s1, 0($t0) $s0, $s1, 10 $s2, $s0, $s1 $s3, $s1, $s0 $s2, $s3, 10

8. add $s3, $s2, $s3 (a) The pipelined CPU for which assume the forwarding mechanism has been designed and the branch outcome is determined at MEM stage. (b) The pipelined CPU for which assume the forwarding mechanism has been designed and some hardware has been added to determine the branch outcome at ID stage. Express your answer as the following table: Total number of cycles

Stall cycles

(a)

cycle 2,...

(b)

cycle 2,...

Answer Total number of cycles

3.

Stall cycles

(a)

5 – 1 + 8 + 1 + 1 + 3 + 3 = 20

c4, c7

(b)

5 – 1 + 8 + 1+ 2 + 1 + 1 + 1 = 18

c4, c7, c8, c12

For RAID0, RAID1 (Mirroring), RAID3 (Bit-interleaved Parity), RAID4 (Block-Interleaved Parity), and RAID5 (Distributed Block-Interleaved Parity), please answer questions below. (a) Which one(s) cannot have ―small writes‖ to occur in parallel? (b) Which one(s) cannot have ―small reads‖ to occur in parallel? (c) Which one has the best ―small reads‖ latency? (d) Which one is least tolerant to disk failure?

Answer (a)

(b)

(c)

(d)

RAID1, RAID3

RAID3

RAID0

RAID0

115

4.

Assume the system workload consists of repeated reads of 64KB block at random disk locations. For the following system setup, (a) what is the IOPS (I/Os per second) for CPU, memory bus, and disk drive, respectively? (b) what (CPU, memory bus, or disk) is the bottleneck in the system?  The user program uses 2 million cycles to process each I/O operation.  The operating system requires 1 million cycles of overhead for each I/O operation.  The clock rate is 3 GHz.  The maximum sustained transfer rate of the memory bus is 640MB/sec  One disk drive which rotates at 7200RPM, has an average seek time of 8ms, and has a transfer rate of 64MB/sec. The disk controller overhead is 2ms. Express your answer as the following table (a)

(b) IOPS

CPU

The bottleneck is

Memory bus Disk Answer (a)

(b) IOPS

CPU

3G / 3M = 1000

Memory bus

1  66 1 64 KB 8ms  0.5    2ms 120 64MB

Disk

5.

The bottleneck is Disk

640MB / 64KB = 10000

A processor and two options for improving its hardware and compiler design are described as follows:  The base machine, Mbase: Mbase has a clock rate of 200 MHz and the following measures:



Instruction class

CPI

Frequency

A

1

50%

B

2

25%

C

4

25%

The machine with improved hardware, Mhw: Mhw has a clock rate of 500 MHz and the following measures: Instruction class

CPI

Frequency

A

2

50%

B

4

25%

C

6

25%

116



The combination of the improved compiler and the base machine, Mcomp: The instruction improvements from this enhanced compiler are as follows: Instruction class

Percentage of instructions executed vs. Mbase

A

80%

B

80%

C

40%

Calculate the CPI (clock cycles per instruction) for each machine and the speedups of Mhw and Mcomp with respect to Mbase. Express your answer as the following table: Machine

CPI (cycles/instruction)

Speedup

Mbase

1

Mhw Mcomp Answer Machine

CPI (cycles/instruction)

Speedup

Mbase

2

1

Mhw

3.5

1.43

Mcomp

2.57

1.11

註: CPIMbase = 0.5 × 1 + 0.25 × 2 + 0.25 × 4 = 2 CPIMhw = 0.5 × 2 + 0.25 × 4 + 0.25 × 6 = 3.5 0.5 × 0.8 + 0.25 × 0.8 + 0.25 × 0.4 = 0.7 CPIMcomp = (0.5 × 0.8 × 1 + 0.25 × 0.8 × 2 + 0.25 × 0.4 × 4) / 0.7 = 2.57

IC  2 IC  2 6 ExeTime of M base 200  106 ExeTime of M base   1.43  200  10  1.11 IC  3.5 ExeTime of M hw ExeTime of M comp 0.7 IC  2.57 6 500  10 200  106 6.

Given the block diagrams of a 32-bit ALU, the typical 1-bit ALU, and the 1-bit ALU for the most significant bit (MSB), determine the values of control and CarryIn signals to perform each of the following operations: NOR: Result = (A OR B)‘ SUB: Result = A – B SLT: If A < B, Result = 1; else Result = 0 Express your answer as the following table: Instruction

Ainvert

Binvert

NOR SUB SLT

117

Operation

CarryIn

1-bit typical ALU Operation

Ainvert Binvert a

CarryIn

0

0

1 1 Result b

0

2

+

1 Less

3 Set Overflow detection

Overflow

1-bit ALU for the MSB Answer Instruction

Ainvert

Binvert

Operation

CarryIn

NOR

1

1

00

X

SUB

0

1

10

1

SLT

0

1

11

1

118

7.

Consider a computer with a data cache which has 16KB data and 16-Byte blocks. Assume that the main memory has 20-bit address and is byte addressable. (a) Determine the numbers of bits for Tag and Index fields in the address format if the cache is direct-mapped, 4-way set-associative, and fully associative. Express your answer as Column (a) in the following table. (b) Determine the size of a block (including valid bit, dirty bit, tag, and data) in bits if the cache is direct mapped and the size of a set (including valid bit, dirty bit, tag, and data) in bits if the cache is 4-way set-associative and fully associative. Then, calculate the total size in Kbits for each of the cache configurations. Express your answer as Column (b) in the following table. (a)

(b)

Tag

Index

Size of a Block/Set*

Total cache size

(# of bits)

(# of bits)

(# of bits)

(# of Kbits)

Direct mapped 4-way set-associative Fully associative *: Block for direct-mapped and Set for 4-way and fully associate. (c) Give a series of byte references in hexadecimal as shown in the following table. For the 4-way set-associative cache, determine the tag and index of each reference in hexadecimal and label each reference as a hit or a miss. Assume that the cache is empty initially and LRU replacement is adopted. Express your answer as the following table. Byte address (in hexadecimal)

Tag (in hexadecimal)

Index (in hexadecimal)

C64E5 C64C5 C74EA C64E8 C64CF F64E6 F74EA F84E6 C74CF C64E6 C64C9 C74E8 Answer

119

Cache Hit or Miss

(a)

(b)

Tag

Index

Size of a Block/Set*

Total cache size

(# of bits)

(# of bits)

(# of bits)

(# of Kbits)

Direct mapped

6

10

136

136

4-way set-associative

8

8

552

138

Fully associative

16

0

149504

146

Byte address (in hexadecimal)

Tag (in hexadecimal)

Index (in hexadecimal)

Cache Hit or Miss

C64E5

C6

4E

Miss

C64C5

C6

4C

Miss

C74EA

C7

4E

Miss

C64E8

C6

4E

Hit

C64CF

C6

4C

Hit

F64E6

F6

4E

Miss

F74EA

F7

4E

Miss

F84E6

F8

4E

Miss

C74CF

C7

4C

Miss

C64E6

C6

4E

Hit

C64C9

C6

4C

Hit

C74E8

C7

4E

Miss

120

100 年交大資聯 Multiple choices - Choose all right statements. 1. Consider the power and performance issues. (a) The power consumption of a computer is not close to zero even when its CPU loading approaches zero. (b) Intel has announced that the trend of designing CPU is towards increasing the number of CPU cores instead of the operating frequency since semiconductor technology can not shrink device size any more. (c) Millions Instructions Per Second (MIPS) is not a good metric to measure computer performance since it only counts the number of executed instructions and does not consider instruction complexity. (d) The computer with less CPI must run faster than another computer with larger CPI since the former computer can complete one instruction with less number of clock cycles. Answer: (a), (c) 註(a):Although dynamic power is the primary source of power dissipation in CMOS, static power dissipation occurs because of leakage current that flows even when a transistor is off. Multiple choices - Choose all right statements. 2. Assume we have three significant decimal digits, and the accuracy of a floating number is to the second decimal place. Round the value of 2.85  104 + 5.16  106 to the nearest decimal number with three significant decimal digits, first with guard and round digits (V_GR), and then without them (V_NoGR). (a) The value of V_GR is 5.19  106. (b) The value of V_NoGR is 5.17  106. (c) The ulp of V_GR is 0.15. (d) Sticky bit is another scheme to improve accuracy by setting it to the value of performing Boolean XOR operation on all bits to the right of the round bit. Answer: (a) 註(b):V_NoGR is 5.18  106 註(c):ulp = 2 註(d):Sticky bit is set whenever there are nonzero bits to the right of the round bit Multiple choices - Choose all right statements. 3. Consider the cache design. (a) Increasing block size can lower compulsory miss as well as miss penalty. (b) The reason why increasing cache size can lower capacity miss is the available block number in the cache increases. (c) Increasing block size within reasonable range can raise hit rate. 121

(d) If memory stores an integer array in a row-based manner, it is more efficient to traverse this array by setting row variable in the outer loop and column variable in the inner loop. Answer: (b), (c), (d) Multiple choices - Choose all right statements. 4. Floating-point format (a) Suppose there was a 16-bit IEEE 754 floating-point format with 5 exponent bits. The likely range of numbers it could represent would be ±1.0000 0000 00two  2-14 to ±1.1111 1111 11two  215, ± 0, ± ∞, NaN. (b) For 32-bit IEEE 754 floating-point standard, the smallest positive single precision normalized number is: 1.0000 0000 0000 0000 0000 000 two  2-125. (c) For 32-bit IEEE 754 floating-point standard, the smallest single precision denormalized number is: 0.0000 0000 0000 0000 0000 001 two  2-126. (d) IEEE-754 floating-point standard adopts bias scheme to increase maximum number in exponent part. Answer: (a), (c) 註(c): The answer should be -0.1111…. 1two  2-126, but this subject also appear in 2010 and the announced answer by NCTU is 0.0000….. 1two  2-126. So, (c) should be chosen in this place. Multiple choices - Choose all right statements. 5. Which of the following statements are correct? (a) More power instructions mean higher performance. (b) Since the commercial binary compatibility is very important, successful instruction sets don't change. (c) Write in assembly language may not obtain the higher performance. (d) The sequential word addresses in machines with byte addressing differ by one. Answer: (c)

122

Multiple choices - Choose all right statements. 6. Assume that the variables a and b are assigned to register $s1 and Ss2, respectively, and the base address of arrays A and B are in registers $s3 and $s4, respectively. Given a MIPS assembly code sequence, which of the following comments of instructions are correct? (a) 1 sll $t0, $s1, 2 # $t0 = a  4 . (b) 3 lw $t1, 8($t2) # $t1 = A[a + 8]. (c) 5 add $t2, $t1, $s4 # $t2 = address of array B[A[a + 2]]. (d) 6 lw $s2, 4($t2) # b = B[A[a + 2] + 4]. 1 2 3 4 5 6

sll add lw sll add lw

$t0, $s1, 2 $t2, $s3, $t0 $t1, 8($t2) $t1, $t1, 2 $t2, $t1, $s4 $s2, 4($t2)

Answer: (a), (c) 註(b):# $t1 = A[a + 2] 註(d):# b = B[A[a + 2] + 1] Multiple choices - Choose all right statements. 7. Give a C program and its corresponding MIPS assembly code fragment. The MIPS code has errors. Which of the following statements are true for the correction of the MIPS code? (a) The saving and restoring of $s0 in the prolog and epilog of function f respectively, may be deleted. (b) $ra should be saved and restored in the prolog and epilog of function f respectively. (c) After the first call to function func, $v0 should be moved to $a1 and $s0 should be moved to $a0. (d) The index of the stack should be adjusted in the epilog of function Int f(int a, int b, int c){ return func (func (a, b), c)}; 1 2 3 4 5 6 7

f:

addi sw move jal move move jal

$sp, $sp, -4 $s0, 0($sp) $s0, $a2 func $a0, $v0 $a1, $s1 func 123

8 9

lw jr

Answer: (b), (d) 註:Correct code is as follows. f: addi sw sw move jal move move jal lw lw addi jr

$s0, 0($sp) $ra

$sp, $sp, –8 $ra, 4($sp) $s0, 0($sp) $s0, $a2 func $a0, $v0 $a1, $s0 func $ra, 4($sp) $s0, 0($sp) $sp, $sp, 8 $ra

Multiple choices - Choose all right statements. 8. Which of the following techniques are associated with hardware-based only approaches to exploiting ILP? (a) Register renaming. (b) Dynamic scheduling. (c) VLIW. (d) Branch prediction. Answer: (b) Multiple choices - Choose all right statements. 9. Compare the accuracies of different branch predictors. Assume that the 1-bit predictor starts off in the "predict taken'' state, and the 2-bit predictor is a counter-based predictor with the state transitions shown below and starts off in State 3. If the pattern of branch outcomes is (T, T, NT, T, NT, T), which of the following statements are correct? Not taken

Taken

State 0

Taken

State 1

Predict not taken

State 2

State 3

Predict taken

Predict not taken Not taken

Not taken

Taken

Taken Predict taken Not taken

(a) The accuracy of always-taken predictor is higher than that of always-not-taken predictor. (b) The accuracy of 1-bit predictor is higher than that of always-not-taken predictor. (c) The accuracy of 2-bit predictor is higher than that of always-taken predictor. (d) The accuracy of 2-bit predictor is higher than that of 1-bit predictor. 124

Answer: (a), (d) T 1 C

T 1 C

NT T State 1 0 1-bit Correct/Incorrect I I Accuracy 2/6 = 0.333 State 3 3 3 2 2-bit Correct/Incorrect C C I C Accuracy 4/6 = 0.667 4/6 = 0.667 Always-taken 2/6 = 0.333 Always-not-take

NT 1 I

T 0 I

3 I

2 C

Multiple choices - Choose all right statements. 10. Which of the following statements about RAID are true? (a) RAID 0 has no redundancy and no performance improvement. (b) RAID 4 has better system performance than RAID 3 as many read operations happen simultaneously. (c) When many writing operations are performed to a RAID 4 disk, the performance bottleneck will be the parity disk because all parity updates must be sequentially done. (d) The parity disk at RAID 5 can be used to recover the lost data if more than one data disk fails. Answer: (b), (c) 11. Which of the following statements about DMA are true? (a) In terms of lowest impact on processor utilization from a single I/O device, the order is DMA, interrupt driven, and polling. (b) Before DMA transfer. DMA controller must notify processor by supplying identity of the device, operation, memory address, number of bytes to transfer, etc. (c) During DMA transfer, DMA controller is the bus master which directs each Read/Write between devices and memory. (d) On the completion of DMA transfer, interrupt mechanism is used to notify processor. Answer: (a), (c), (d)

125

12. The table below shows the number of instructions of two applications compiled by two compilers for a program on two different machines. Machines A and B have a clock rate of 4 GHz and 8 GHz respectively. Two machines have the same three instruction types and the required number of cycles for each instruction type is: Machine A: FP 4, Int 2 and L/S 2, and Machine B: FP 6, Int 1 and L/S 1. (must show computation to get full credit)

Compiler 1 Compiler 2

Instruction number FP Int L/S 4.0E+9 4.0E+10 8.0E+9 2.0E+9 4.0E+9 4.0E+10 4.0E+9 8.0E+9 8.0E+9 2.0E+9 1.2E+10 2.0E+10

Ap1 by C1 Ap2 by C1 Ap1 by C2 Ap2 by C2

(a) If the workload is to run both applications once a week, please list workload time using compiler 1 on machine A. (b) Consider two applications compiled by compiler 1. If application 1 must run four times as often as application 2 in a week, please list the workload runtime of machine A? (c) Consider the application 2 compiled by compiler 2. Please list the average CPI of machine A. (d) Consider the problem in (b). If we refine floating point and integer operations of Machine A to perform five and two times faster than before, what's the workload time of new Machine A? Answer (a) Workload time = 4 109  4  1.2 1010  2  8 109  2 + 4 109 2 109  4  1.2 109  2  4 1010  2  14  22.6  36.6 (sec.) 4 109 (b) Workload time =



4



4 10

2 10

9



 4  1.2 1010  2  8 109  2 + 4 109



 4  1.2 109  2  4 1010  2  56  22.6  78.6 (sec .) 4 109 2 109  4  1.2 1010  2  2 1010  2  2.12 (c) Average CPI = 2 109  1.2 1010  2 1010 (d) Workload time = 9

4 10 4

9

 0.8  1.2 1010 1  8 109  2 + 4 109 126

2 10

9



 0.8  1.2 109 1  4 1010  2  31.2  20.7  51.9 (sec .) 4 109

13. Consider three different cache configurations below: * Cache 1: direct-mapped with one-word blocks. * Cache 2: direct-mapped with four-word blocks. * Cache 3: two-way set associative with two-word bocks and LRU replacement. Assuming that each cache has a total data size of 16 32-bit words and all of them are initially empty. 20-bit word address is used. (a) What's the size of tag field in one block used in each cache? (b) How many total bits are required for each cache, including tag and valid fields? (c) Given a series of word-address references: 42, 43, 56, 57, 58, 59, 34, 35, 42, and 43. For caches 2 and 3, please label each reference in the list as a hit or a miss, and show the final cache contents (the word addresses in each block or set) after all references. (d) Consider virtual memory system and cache. Please explain alias problem, and the advantage of physically tagged virtually indexed cache. Answer Cache 1

Cache 2

Cache 3

20 – 4 = 16 bits 20 – 2 – 2 = 16 bits 20 – 2 – 1 = 17 bits (1 + 16 + 32)  16 = (1 + 16 + 128)  4 = (1 + 17 + 64)  2  4 = (b) 784 bits 580 bits 656 bits (a)

(c) Cache 2 Word Block Tag Index Hit/Miss address address 42

10

2

2

Miss

43 56 57 58 59 34 35 42 43

10 14 14 14 14 8 8 10 10

2 3 3 3 3 2 2 2 2

2 2 2 2 2 0 0 2 2

Hit Miss Hit Hit Hit Miss Hit Miss Hit

Cache 3

127

Cache Block addr. 0 1 2 3

Content 32, 33, 34, 35 40, 41, 42, 43

Word Block Tag Index Hit/Miss Cache address address 42 21 5 1 Miss Set Block 0 Block 1 43 21 5 1 Hit 0 56, 57 56 28 7 0 Miss 1 34, 35 42, 43 57 28 7 0 Hit 2 58 29 7 1 Miss 3 59 29 7 1 Hit 34 17 4 1 Miss 35 17 4 1 Hit 42 21 5 1 Miss 43 21 5 1 Hit (d) Aalias problem 發生在一樣的物件有兩個名稱—兩個虛擬位址對到了相 同的 page。這種不清楚的情況造成了一個問題。因為在這一個 page 的某 一個字組可能會被快取到兩個不同的地方,每一個地方對映到一個不同 的虛擬位址。這種的狀況可能會讓一個程式寫入資料,但另一個程式卻 不知道資料已經被改變了。 使用虛擬索引、實體標籤是為了得到虛擬索引存取的高效能與實體位址 快取架構上簡化的優點。 14. Consider two code sequences for detecting the overflow of addition on two numbers in registers $t1 and $t2. One addition is signed addition while the other is unsigned addition. (a) The code sequence in bottom left figure is to detect signed addition overflow. Explain how it can detect signed addition overflow. (b) The code sequence in bottom right figure is to detect unsigned addition overflow. Explain how it can detect unsigned addition overflow. (You do not need to describe the meaning of each instruction. The main design concept inside the codes is enough) addu $t0, $t1, $t2 xor $t3, $t1, $t2 addu $t0, $t1, $t2 slt $t3, $t3, $zero nor $t3, $t1, $zero bne $t3, $zero, no_overf1ow sltu $t3, $t3, $t2 xor $t3, $t0, $t1 bne $t3, $zero, Overflow slt $t3, $t3, $zero bne $t3, $zero, Overflow Answer (a) 兩個異號數相加不會發生溢位,而兩個正數相加變負數或兩個負數相加變正 數便會發生溢位。 (b) $t1 + $t2 > 232 – 1  overflow $t2 > 232 – 1 – $t1  overflow $t2  $t1  overflow 128

15. Give the datapath for a single-cycle implementation of a computer and the definition and formats of its instructions as follows: add $rd, $rs, $rt #rd = $rs + $rt R-format lw $rt, addr($rs) #$rt = Memory[$rs + sign-extended addr] I-format sw $rt, addr($rs) # Memory[$rs + sign-extended addr] = $rt I-format beq $rs, $rt, addr #if ($rs = $rt) go to PC + 4 + 4 × addr I-format

-

-

-

Assume that control signals ALUOp = 00 for performing addition of the ALU unit, ALUOp = 01 for subtraction, and ALUOp = 10 while depending on the funct field of the instruction. (a) Assume that logic blocks needed to implement the datapath have the following latencies: (Delays for other components are ignored.) I-Mem

Add

Mux

ALU

Regs

D-Mem

Sign extend

Shiftleft-2

400ps

100ps

40ps

120ps

200ps

350ps

20ps

10ps

Compute the required delay time for each instruction and determine the minimum cycle time of the computer. (b) Specify the values of the control signals for instructions add, lw, and beq. Express your answer as the following table: (Denote a don‘t-care control signal as ―×‖.) signal ALU Scr instr

ALUOp (2 bits)

Mem Read

Mem Write

Answer 129

Memto Reg

Reg Dst

Reg Write

Branch

(a) The minimum cycle time of the computer = 1310 ps I-Mem Regs add lw sw beq

400 400 400 400

Mux

200 200 200 200

40

40

ALU D-Mem Mux Regs 120 120 120 120

40 40

350 350

200 200

40

Delay Time (ps) 1000 1310 1070 800

(b) instr

signal ALU Scr

Add Lw Beq

0 1 0

ALUOp (2 bits)

Mem Read

Mem Write

Memto Reg

Reg Dst

Reg Write

Branch

10 00 01

0 1 0

0 0 0

0 1 ×

1 0 ×

1 1 0

0 0 1

16. Give a sequence of instructions and the graphical representation of the typical 5-stage pipelined datapath of the MIPS architecture. Assume that when a register is read and written in the same clock cycle, the write is in the 1st half of the clock cycle and the read is in the 2nd half. Express your answer as the table given below. IF

IM

ID

Reg

EX

ALU

MEM

WB

DM

Reg

1 lw $t2, 4($t0) 2 lw $t3,4($t1) 3 or $t3, $t2, $t3 4 add $t4, $t2, $t3 5 lw $t5, 8($t1) 6 sw $t4, 16($t5) 7 sub St6, $t4, $t3 (a) Specify the data hazards in the instruction sequence by writing the instruction pair and identifying the register that causes the hazard in the 5-stage pipelined datapath. Copy the following table in the answer sheet and write your answer in column (a) of the table. (b) Assume that there is a full forwarding and stall mechanism in the pipeline. Determine whether each of the data hazards detected is resolvable by forwarding or not, and calculate the number of cycles required to complete the instruction sequence in the pipeline. Copy the following table in the answer sheet and write your answer in column (b) of the table.

130

(a) (b) Data hazard Resolvable by Execution forwarding Instruction pair Register time (cycles) (Yes/No) (e.g., I1 ~ I2) (e.g., $t2) . . .

. . .

. . .

Answer (a)

(b) Resolvable by forwarding (Yes/No)

Data hazard Instruction pair (e.g., I1 ~ I2) I1 ~ I3 I2 ~ I3 I3 ~ I4 I4 ~ I6 I5 ~ I6

Register (e.g., $t2) $t2 $t3 $t3 $t4 $t5

Yes No Yes Yes No

131

Execution time (cycles)

(5 – 1) + 7 + 2 = 13

99 年交大資聯 Section I: Multiple choices - please circle all correct answers (1) Pipelining is a technique that can effectively (a) reduce the latency of an operation (b) increase the throughput of instruction execution (c) increase the clock rate (d) reduce the CPI Answer: (b), (c) 註(d):CPI for single cycle machine (without pipeline) = 1; effective CPI for pipeline ≥1 (2) Consider a machine, whose floating-point (FP) and multimedia (MMX) instructions have two times throughput than integer (INT) instructions. Suppose we enhance the machine by making FP and MMX instructions 4 and 2 times faster, respectively, with no improvements to the INT instructions, which of the following statements are correct? (a) The enhancement comes from an increased clock rate. (b) The enhancement comes from increased cache sizes. (c) Two benchmarks both consist of only INT, FP, and MMX instructions. The mix of instructions of the two benchmarks are 1:3:6 and 2:4:4, respectively (INT:FP:MMX). The first benchmark is more ideal to show off the enhanced machine. (d) It is impossible to speedup the first benchmark in (c) six times on the enhanced machine. Answer: (c), (d) 1 1 23 8 4 8 1 1 28 CPI for benchmark2 = 2  1  4   4   8 4 8 1 1 1 1  3   6  2 2  5.5  6 註(d):Speedup = 1 1  3  0  6  0

註(c):CPI for benchmark1 = 1 1  3   6  

(3) Which of the following must be saved during a context switch so that a process may resume? (a) All general purpose registers (b) PC (Program Counter) (c) All Pipelined registers (d) The page table register Answer: (a), (b), (d)

132

(4) In a single-cycle datapath design of the MIPS architecture, which of the following descriptions are correct? (a) The data flow of R-type instructions does not go through the data memory. (b) The data flow of SW goes through all components in a clock cycle. (c) The data flow of LW goes through all components at most once in a clock cycle. (d) The data flow of J-type instructions goes through all components. Answer: (a), (b) (5) The JR (Jump Register) instruction in the MIPS architecture is typically used for which of the following C constructs? (a) A For loop (b) A Switch statement (c) A nested If-then-else statement (d) A function Answer: (b), (d) (6) In the typical five-stage pipelined datapath of the MIPS architecture, which of the following statements are correct? (a) Code sequence (lw $t1, 0($t0); lw $t2, 4($t0); add $t3, $t1, $t2; sw $t3, 12($t0); lw $t4, 8($t0); add $t5, $t1, $t4; sw $t5, 16($t0)) can avoid stalls by rescheduling the code sequence. (b) Code sequence (add $t1, $t0, $t0; addi $t2, $t1, #5; addi $t4, $t2, #5) can avoid stalls by data forwarding. (c) Code sequence (sw $t1, addr; beq $t1, $0, target) will stall. (d) Code sequence (lw $t0, 0($t0); add $t1, $t0, $t0) will stall. Answer: (a), (b), (d) 註:由交大所公佈的答案,(a)小題是假設有 forwarding unit (7) I is an integer variable. Which of the following expressions give the correct result of (I / 8)? (a) (I + 7) >> 3 (b) (I > 0 ? I >> 3: (I + 7) >> 3) (c) I >> 3 (d) (I + ((I >> 31) & 7)) >> 3 Answer: (b), (d)

133

(8) In a pipelined processor, control hazard may be reduced by (a) Predicated execution (b) Branch prediction (c) Moving branch decision up to an earlier pipe stage (e.g. to the ID stage) (d) Data forwarding Answer: (a), (b), (c) (9) Consider the IEEE 754 single precision floating-point (FP) format. Which of the following statements are correct? (a) The FP number representation is (-1)sign  (significand)  2exponent - bias (b) The smallest positive normalized number is: 1.0  2-125. (c) The smallest denormalized number is: 1.0  2-149. (d) Converting an integer variable to a single precision FP number will lose precision. Answer: (a), (c), (d) 註(a):The FP number representation: (-1)sign  (1.fraction)  2exponent - bias or (-1)sign  (significand)  2exponent - bias (10) Which of the following values have identical representation no matter whether the machine big-endian or little-endian? (a) Integer 0 (b) IEEE single precision +0.0 (c) Integer-1 (d) IEEE single precision NaN Answer: (a), (b), (c) (11) A 64-bit word in memory stores 64 bit of 0's. What could this word be? (a) A NULL pointer (b) A part of a C character string (c) Two MIPS NOP instructions (d) A double precision FP value +0.0 Answer: (a), (c), (d) (12) Which of the following can reduce Cold (or Compulsory) misses? (a) Larger cache line size (b) Sequential cache prefetching (c) Higher associativity (d) LRU replacement algorithm Answer: (a), (b) 134

(13) Consider the performance of SPECweb99 listed in the following table. Which of the following statements are correct? (a) Clock rate determines the web service performance. (b) The system with 2 1GHz processors, 7 disks, and 3 network connections is likely better than the system with 2 1.4GHz processor, 2 disks, and 1 network connection for this benchmark. (c) Throughput is a more suitable metric than response time in measuring the performance of SPECweb99. (d) Since this is a web service, increasing the number of networks always improve the response time. Processor

# of disk drives

# of CPU

# of networks

Clock rate GHz)

Result

P3 P3 P3 xeon P4 xeon

3 8 5 10

2 2 4 2

1 4 4 4

1.4 1.13 0.7 2.2

1810 3435 4200 4615

Answer: (b), (c) (14) Which of the following statements are correct? (a) To improve the availability of an I/O device, we have to increase its Mean Time to Failure (MTTF) and/or decrease its Mean Time to Repair (MTTR). (b) For a hard disk where seek time dominates the average read time, placing the data of a block on the same cylinder can improve the average read time. (c) DMA is a better way for I/O devices to communicate with the processor than either polling or interrupt-driven I/O, because DMA works more efficiently with data caches. (d) RAID technology gives you disk subsystems with higher performance and greater dependability. Answer: (a), (b), (d) 註(a):Availabili ty 

MTTF . So, increase MTTF and/or decrease MTTR will MTTF MTTR

improve availability. 註(c):DMA (direct memory access) is for memory access not for cache access. (15) Which of the following statements about caches are true? (a) TLB is a cache for the page table (b) SSD may be used as a cache for disk files (c) Disks can be used as a cache for web pages across the internet (d) eDRAM can be used for on-chip caches Answer: (a), (b), (c), (d) 註(b):A solid-state drive (SSD) is a data storage device that uses solid-state memory 135

to store persistent data. An SSD emulates a hard disk drive interface, thus easily replacing it in most applications. Recently, NAND based flash memory has become the standard for most SSDs. 註(d):eDRAM stands for "embedded DRAM", a capacitor-based dynamic random access memory usually integrated on the same die or in the same package as the main ASIC or processor, as opposed to external DRAM modules and transistor-based SRAM typically used for caches. (16) Which of the following on-chip cache hierarchies are often implemented in modern processors? (a) Split L1 data and instruction caches (b) Combined L2 caches (c) Split L2 data and instruction caches (d) A write-through L2 cache Answer: (a), (b) (17) Which of the following implementation techniques may exploit more ILP? (a) Superscalar implementations (b) Control speculations (c) Deep pipelining (d) Out-of-order instruction issuing Answer: (a), (b), (c) 註(d):Out-of-order instruction execution Section II: 1. The following acronyms are commonly used in the computer architecture literature. Please briefly explain their meanings. AMAT VLIW RISC SIMD GPGPU Answer: AMAT (average memory access time) is the average time to access memory considering both hits and misses and the frequency of different accesses. VLIW (very long instruction word): A style of instruction set architecture that launches many operations that are defined to be independent in a single wide instruction, typically with many separately opcode fields. RISC (Reduced instruction set computer) is a computer which is based on the design strategy that simplified instructions can provide higher performance if this simplicity enables much faster execution of each instruction. 136

SIMD (Single Instruction stream, Multiple Data streams): A multiprocessor. The same instruction is applied to many data streams, as in a vector processor or array processor. GPGPU (General-purpose GPU): Using a GPU for general-purpose computation via a traditional graphics API and graphics pipeline. 2. What are the four design principles advocated by Patterson and Hennessy in their textbooks? For each principle, give one example from the MIPS architecture that follows that principle. Answer: Design principle Simplicity favors regularity Smaller is faster Make the common case fast Good design demands good compromises

Example Keeping all instructions a single size MIPS has 32 registers rather than many more PC-relative addressing for conditional branches The compromise between providing for larger addresses and constants in instructions and keeping all instructions the same length

3. Consider a computer with the following data cache and physical memory:  1GB of physical memory.  A physically indexed data cache with 64KB data and 256-Byte blocks (lines).  A 32-bit virtual address space and the page size is 64KB.  The system uses a single level page table. Assume TLB and data cache are initially empty. The partial contents of the page table are shown below: (VPN: virtual page number, PPN: physical page number) VPN 0x0111 0x103a 0x22ea 0x3d2d 0x4eae 0x54f8 0x6044 0x7e1a

Valid 0 1 1 0 1 1 1 0

PPN 0x2d45 0x2ae8 0x350a 0x2292 0x20f3 -

VPN 0x8a01 0x9f22 0xa34d 0xbb3a 0xcdf5 0xcf4a 0xe301 0xef27

Valid 0 1 1 0 1 1 1 0

PPN 0x324d 0x378e 0x3d1a 0x0933 0x2e62 -

(a) The CPU includes a fully associative TLB with 2 entries and the LRU replacement policy. The CPU data cache is direct-mapped. Data reads from the following virtual addresses are performed (in the order listed): 0x103a e221, 0x4eae 58a3, 0x103a e251, 0x4eae 5a09, 0xcf4a 5add, 0x4eae 5aaa, 0x103a e292. Please complete the Table given below:

137

Virtual Address 0x103a e221 0x4eae 58a3 0x103a e251 0x4eae 5a09 0xcf4a 5add 0x4eae 5aaa 0x103a e292

Physical Address Data Cache Hit/Miss

TLB Hit/Miss

(b) What are the 32-bit address formats to access the cache and the cache sizes (in terms of Kbits, including valid, dirty, tag, and data bits) for various caches? Please fill in the following table with your answers. Address format Cache Type

Tag (bits)

Cache size

Index Block offset Byte offset No. of Sets Total size (bits) (bits) (bits) (Kbits)

Direct-mapped 4-way associative Fully associative

6 6 6

2 2 2

Answer: (a) Virtual Address

Physical Address

0x103a e221 0x4eae 58a3 0x103a e251 0x4eae 5a09 0xcf4a 5add 0x4eae 5aaa 0x103a e292

0x2d45 e221 0x350a 58a3 0x2d45 e251 0x350a 5a09 0x0933 5add 0x350a 5aaa 0x2d45 e292

Data Cache Hit/Miss Miss Miss Hit Miss Miss Miss Hit

TLB Hit/Miss Miss Miss Hit Hit Miss Hit Miss

(b) Cache Type

Tag (bits)

Direct-mapped 4-way associative Fully associative

16 18 24

Address format Block Index offset (bits) (bits) 8 6 6 6 0 6

138

Byte offset (bits) 2 2 2

Cache size Total No. of size Sets (Kbits) 256 516.5 64 517 1 518.5

98 年交大資聯 1. Assume that $s1 and $s2 are used for the input and both initially contain the integers, a and b, respectively. Assume that $v0 is used for the output. 80004000h loop:

finish:

add beq add sub j addi add

$t0, $zero, $zero $s2, $zero, finish $t0, $t0, $s1 $s2, $s2, 1 loop $t0, $t0,100 $v0, $t0, $zero

(1) Please describe in one mathematic formula what it computes (in terms of a, b). (2) If we assume the above code starting at location 8000 4000h in memory, and the register numbers of $s1 and $zero are 17 and 0, respectively. What are the MIPS machine codes for the beq (OP code is 4) and j (OP code is 2) instructions above? Please answer both in hexadecimal format. Answer: (1) The procedure computes a  b + 100 (2) Instr. Binary format 000100 10010 00000 beq 0000000000000011 j 000010 00000000000001000000000001

Hexadecimal format 12400003h 08001001h

2. Suppose there was a 16-bit IEEE 754 floating-point format with 5 exponent bits. (1) What would be the bias in the exponent? (2) Show the binary representation of the number 20.510 using this 16-bit IEEE 754 floating-point format. (3) Which one would be the range of the numbers this format could represent? (a) 1.0000 0000 00  20 to 1.1111 1111 11  231, 0 (b) ±1.0000 0000 0  2-14 to ±1.1111 1111 1  215, ±0, ±∞, NaN (c) ±1.0000 0000 00  2-14 to ±1.1111 1111 1  215, ±0, ±∞, NaN (d) ±1.0000 0000 00  2-15 to ±1.1111 1111 1  214, ±0, ±∞, NaN Answer: (1) bias = 25-1 – 1 = 15 (2) 20.510 = 10100.12 = 1.010012  24 S 0

E 10011

F 0100100000 139

(3) None 3

Given a series of cache references as word-addresses: 16, 17, 18, 19, 20, 21, 22, 23, 48, 49, 17, 48, 49, 17, 5, 6, and 7. How many of references will cause on compulsory miss and how many of them will cause conflict miss? Please give your answer for each cache separately. (1) A direct-mapped cache with 4-word blocks and a total size of 16 words. (2) A 2-way set associative cache with 2-word blocks and a total size of 16 words, (using LRU replacement).

Answer: Word Block Tag address address 16 4 1 17 4 1 18 4 1 19 4 1 20 5 1 21 5 1 22 5 1 23 5 1 48 12 3 49 12 3 17 4 1 48 12 3 49 12 3 17 4 1 5 1 0 6 1 0 7 1 0 Compulsory: 4

Index 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1

Hit/Miss

3C

Miss compulsory Hit Hit Hit Miss compulsory Hit Hit Hit Miss compulsory Hit Miss conflict Miss conflict Hit Miss conflict Miss compulsory Hit Hit Conflict: 3

(2) Word Block address address 16 8 17 8 18 9 19 9 20 10 21 10 22 11 23 11 48 24

Tag

Index

Hit/Miss

3C

2 2 2 2 2 2 2 2 6

0 0 1 1 2 2 3 3 0

Miss Hit Miss Hit Miss Hit Miss Hit Miss

compulsory

140

compulsory compulsory compulsory compulsory

49 17 48 49 17 5 6 7

24 8 24 24 8 2 3 3 Compulsory: 7

6 2 6 6 2 0 0 0

0 0 0 0 0 2 3 3

Hit Hit Hit Hit Hit Miss compulsory Miss compulsory Hit Conflict: 0

4. Consider a virtual memory system with the following properties:  12-bit virtual byte address  256-byte pages  65536 bytes of physical memory The system uses a single level page table. The contents of the page table are partially shown below (where VPN: Virtual page number and PPN: Physical page number). VPN Valid PPN VPN Valid PPN 0 0 6 1 6 1 1 0xfd 7 0 7 2 1 0x48 8 0 3 0 9 1 0xfe 4 1 0x55 A 1 0xf2 5 1 0x32 B 0 (1) What is the total size of the page table for each process on this machine, assuming that the valid, protection, dirty, and use bits take a total of 4 bits and that all the virtual pages are in use? (2) Please convert the following virtual addresses into physical addresses: 0xae2, 0x258. Answer: (1) There are 212/256 = 16 entries in the page table. The length of the physical page number = log2(65536) – 8 = 8.

The size of a page table = 16  4  832 = 16 words = 64 bytes. 



(2) Virtual address 0xae2 0x258

Physical address 0xf2e2 0x4858

141

5

Virtually addressed cache is accessed with a virtual address, while physically addressed cache is accessed with a physical address. (1) Which kind of cache takes the TLB out of normal cache access path, reducing cache latency? (2) Which kind of cache has aliasing problem? (3) Assume physically addressed cache is used, is it possible to have TLB miss, but page table and cache hit for a data access? (4) Assume virtually addressed cache is used, is it possible to have TLB miss, but page table and cache hit for a data access? Please explain you answer.

Answer: (1) (2) (3) (4)

Virtually addressed cache Virtually addressed cache Yes. TBL miss, but entry found in page table; after retry, data is found in cache Yes. (when data is in both cache and memory and mapping in page table but not in TLB)

6. About RAID levels 1, 3, 4, and 5, please answer each of following statements as True or False. (1) RAID systems rely on redundancy to achieve high availability and throughput. (2) RAID 1 (mirroring) has the highest redundancy overhead. (3) For small writes, RAID 3 has the worst throughput. (4) For large writes, RAID 3, 4, and 5 have the same throughput. Answer: (1) (2) (3) (4)

False (RAID systems rely on data stripe to achieve high throughput) True True True

7. In Patterson and Hennessy's book there is a figure of processor design for multi-cycle implementation. You can draw the figure for your convenience when you answer questions. In the implementation there are many function blocks such as (1) memory, (2) MUX after instruction register, (3) memory data register, (4) register file, (5) MUX after memory data register, (6) A register, (7) ALU control, (8) MUX after PC, (9) sign extend, (10) MUX after B register, (11) B register, (12) shifter for ALU, (13) shifter for Jump, (14) MUX after A register, (15) ALU, (16) instruction register, (17) ALU out, (18) PC, and (19) MUX after ALU. In the implementation there are many control lines such as (20) IorD, (21) ALUSrcA, (22) 2-bit ALUOp, (23) PCWriteCond, (24) 2-bit ALUSrcB, (25) MemRead, (26) IRWrite, (27) PCWrite, (28) 2-bit PCSource, (29) MemWrite, (30) MemtoReg, (31) RegDst, and (32) RegWrite. There are many actions in the implementation such as (33) AReg[IR[25:21]], 142

(34) ALUOutA+sign-extend (IR[15:0]), (35) Memory[ALUOut] B, (36) ALUOutPC+(sign-extend(IR[15:0])«2),(37) BReg[IR[20:16]], (38) if(A:==B) PCALUOut, (39) IRMemory[PC], (40) MDRMemory[ALUOut], (41) Reg[IR[20:16]]ALUOut, (42) ALUOutAop B, (43) PCPC+4, (44) Reg[IR[15:11]]ALUOut, (45) PC{PC[31:28], (IR[25:0],2'b00)}, and (46) Reg[IR[20:16]] MDR. In following questions your answers should be in the form like a=11 or b=21A. (1) For a load instruction, it will go through (18) and other function blocks in sequence. Please reference above numbers and use ONLY numbers to complete unknowns of the sequence: (a)(b)(c)14 188116 (d)1517(e)(f)(g)(h)4 (i) (j) (2) For R-type instructions, they will go through (39) and other actions in sequence. Please reference above numbers and use ONLY numbers to complete unknowns of the sequence: 39(a)(33&37 parallel)(b)(c). (3) Assume a branch-on-equal instruction is taken. It will use (23D) control line and other control lines in sequence. The symbol ‗D‘ means the line is deasserted, and if a line is asserted the ‗A‘ also must be indicated. You do not need to append ‗D‘ or ‗A‘ for 2-bit control lines. Please reference above numbers and use ONLY numbers and possible ‗D‘ or ‗A‘ to complete unknowns of the sequences: 23D (a)(b)(21D&22 parallel)2028(c)((d)&22 parallel)20((e)&22 parallel)(f)28(g). Answer: (1) (a) 2

(b) 4

(c) 6

(a) 43

(b) 42

(c) 44

(a) 27A

(b) 24

(d) 10

(e) 8

(f) 1

(g) 3

(f) 24

(g) 23A

(h) 5

(i) 9

(j) 7

(2)

(3) (c) (d) (e) 27A 21D 21A

(The following shows the MIPS multi-cycle machine which is only for reference, not a part of the answer)

143

-

-

-

-

8. Forwarding unit and data hazard detection unit are important parts of pipeline design. In pipeline we use IF/ID.RegisterRs, IF/ID.RegisterRt, EF/ID.RegisterRd, ID/EX.RegisterRs, ID/EX.RegisterRt, ID/EX.RegisterRd, EX/MEM.RegisterRs, EX/MEM.RegisterRt, EX/MEM.RegisterRd, MEM/WB.RegisterRs, MEM/WB. RegisterRt, and MEM/WE.RegisterRd to specify different source registers and destination registers in different stages. Like ID/EX.MemRead, some control lines are also prefixed by IF/ID, ID/EX, EX/MEM, or MEM/WB, though others are not prefixed. Let IF/EDWrite, ID/EXWrite, EX/MEMWrite, and MEM/WBWrite to determine whether pipeline registers get inputs or not. (1) Draw the design of forwarding unit by using part of above signals and two groups of MUX selection lines (ForwardA and ForwardB). (2) Draw the design of data hazard detection unit by using part of above signals. Answer: (1)

144

Forwarding Unit

EX/MEM.RegisterWrite EX/MEM.RegisterRd

FA1

5

=

ID/EX.RegisterRs

FB1

5 = = ID/EX.RegisterRt

FA0

5

= MEM/WB.RegisterRd

5

FB0

MEM/WB.RegisterWrite

145

(2)

IF/ID.RegisterRs

5

Hazard Detection Unit =

ID/EX.RegisterRt

5 FlushContl

=

IF/ID.RegisterRt

PCWrite

5

IF/IDWrite

ID/EX.MemRead

註:本題題意不甚清楚,也有可能是要求設計forwarding unit和hazard detection unit內部的邏輯。

146

97 年交大資聯 1.

Select one right choice. Each right choice gets 1%. The penalty is 0.3% for each wrong choice. (1) In a data center, what is the appropriate definition of good performance? It (a) gets a program done first (b) completes the most jobs during a certain period (c) has the fastest response time. Answer: b (2) SPEC is an organization about (a) benchmark (b) embedded computers (c) IC technology. Answer: a (3) A program runs in 10 seconds on computer X, which has a 5 GHz clock. You are trying by increasing the clock rate to build a computer Y that will run the program in 6 seconds. However, the increase will cause the computer Y to require 1.2 times as many clock cycles as computer X. What clock rate should you design? (a) 10 GHz (b) 9 GHz (c) 8 GHz (d) 7 GHz (e) 6 GHz. Answer: a 註:clock rate = (10  5G  1.2)/6 = 10GHz (4) Generally, the register file has (a) two read ports and one write port (b) one read port and two write ports (c) two read ports and two write ports (d) one read port and one write port. Answer: a (5) The action MDR Memory[ALUOut] is for (a) load (b) store (C) jump (d) branch. Answer: a (6) The action PC  {PC[31:28], (IR[25:0], 2b'00)} is for (a) load (b) store (C) jump (d) branch. Answer: c (7) Which action is for R-type instruction? (a) ALUOut  A op B (b) ALUOut  A op PC (c) ALUOut  PC op B (d) ALUOut  A op MDR. Answer: a 147

(8) Two actions A  Reg[IR[25:21]] and B  Reg[IR[20:16]] should execute (a) before IR  Memory[PC] (b) exclusively (c) sequentially (d) in the same step. Answer: d (9) Assume the instruction mix: 25% load, 10% store, 11% branches, 2% jump, and 52% ALU. Then the multicycle CPI is (a) > 5 (b) = 5 (c) < 5 & > 4.5 (d) = 4.5 (e) < 4.5. Answer: e 註:CPI = 0.25  5 + 0.1  4 + 0.11  3 + 0.02  3 + 0.52  4 = 4.12 (10) Talking about CPI, what is correct? (a) mulicycle is better than pipeline (b) pipeline is better than multicycle (c) these two are similar (d) it is not easy to justify. Answer: b (11)What is the purpose of finite state machine in the multicycle implementation? (a) to speedup the execution (b) to improve the CPI (c) to secure the control (d) to solve the problem of hazards. Answer: c (12) Value of control signals is NOT dependent on (a) executed instruction (b) performed step (c) size of register file. Answer: c (13) In the finite state machine assume that S0: instruction fetch, S1: instruction decode, S2: memory address computation, S3&S4 (in sequence): for load instruction, S5: for store instruction, S6&S7(in sequence): for R-type instruction, S8: for branch instruction, and S9: for jump instruction. Which one includes RegWrite? (a) S9 (b) S8 (c) S7 (d) S6 (e) S5. Answer: c (14)Same assumption as the above. Which one includes MemWrite? (a) S9 (b) S8 (c) S7 (d) S6 (e) S5. Answer: e (15) Same assumption as the above. What is NOT the possible number of bits to encode those states? (a) 3 (b) 5 (c) 7 (d) 9. Answer: a 148

(16) Same assumption as the above. The abbreviation of Next State is NS. Which one is NOT correct? (a) NS(S0) = S1 (b) NS(S2 * (Op[5-0] = LW)) = S3 (c) NS(S4 * (Op[5-0] = SW)) = S5 (d) NS(S6) = S7 (e) NS(S1 * (Op[5-0] = Jump)) = S9. Answer: c 註:should be S0 (17) Assume that 7 opcode bits, 5 state bits, and 20 datapath control signals. The size of a ROM implementation for the control unit is (a) 27  20 (b) 212  20 (c) 27  25 (d) 212  25 bits. Answer: d 註:2(7 + 5)  (20 + 5) = 212  25 (18) Same assumption as the above. By separating the control unit into two parts, one is for control signals and the other is for next state, the ROM size would be (a) 20.6 (b) 30.6 (c) 40.6 (d) 50.6 Kbits. Answer: a 註:25  20 + 212  5 ≈ 20.6 Kbits (19) What is NOT correct about PLA (Programmable Logic Array)? (a) It can be used for microprogramming, (b) It is a programmable device used to implement combinational logic circuits (c) It allows for a large number of logic functions to be synthesized in the sum of products canonical forms, (d) One application of a PLA is to implement the control over a datapath. Answer: c (20) What is NOT correct about microprogrammed control? (a) A control unit with its binary control values stored as words in memory is called microprogrammed control, (b) Each word in the control memory contains a microinstruction that specifies one or more microoperations for the system, (c) A sequence of microinstructions constitutes a microprogram, (d) The microprogram is usually stored in RAM. Answer: d 註:Microinstruction are usually placed in a ROM or PLA (21) What is NOT a basic concept of pipeline? (a) partition instruction execution into balanced stage (b) overlap execution of consecutive instruction (c) share hardware. Answer: c 149

(22) Pipelining enhances performance by (a) shortening instruction execution time (b) increasing instruction throughput (c) increasing the CPI. Answer: b (23)What makes pipelining hard? (a) few registers (b) without branch instructions (c) one cache for instruction and one cache for data (d) no shortage of hardware resources (e) an instruction does not depend on a previous instruction. Answer: none (24) Hardware resource conflict is a (a) structural hazard (b) branch hazard (c) data hazard (d) control hazard. Answer: a (25) Data hazard can be solved by compiler through (a) data forwarding (b) dynamic scheduling (c) reordering instructions sequence. Answer: c (26) Which one, if there is no forwarding, will cause pipeline stall? (a) /add $s0, $t0, $t1 / sub $s0, $t1, $t0/ (b) /add $t0, $t0, $t1 / sub $s0, $t1, $t0/ (c) /add $s1, $s0, $t1 / sub $s0, $t1, $t0/ (d) /add $s0, $t0, $t0 / sub $t0, $t1, $t1/. Answer: b (27) The reason of causing pipeline stall of the instruction sequence /lw $s1, 100($0) / lw $s1, 200($0) / lw $s1, 300($0)/ is (a) hardware resource conflict (b) control hazard (c) data dependence (d) data resource conflict (e) shortage of load unit (f) there is no reason for pipeline to stall. Answer: f (28) Which stall can NOT be solved by data forwarding? (a) /sub $t3, $t0, $t1 / add $s3, $t1, $t3/ (b) /add $s3, $t1, $t3 / or $s0, $s3, $s3/ (c) /lw $s1, 100($0) / sub $t3, $t0, $s1/. Answer: c

150

(29) What is NOT correct about pipeline registers? (a) same size (b) separate each pipeline stage (c) none at the end of the write-back stage, (d) different contents. Answer: a (30) A code sequence is / add $4, $1, $3 / or $12, $5, $4 / and $13, $4, $6 /. The data hazard of /or/ can be checked by (a) EX/MEM.RegisterRd = ID/EX.RegisterRs (b) EX/MEM.RegisterRd = ID/EX.RegisterRt (c) MEM/WB.RegisterRd = ID/EX.RegisterRs (d) MEM/WB.RegisterRd = ID/EX.RegisterRt. Answer: b (31) The same code sequence as the above. The data hazard of/and/ can be checked by (a) EX/MEM.RegisterRd = ID/EX.RegisterRs (b) EX/MEM.RegisterRd = ID/EX.RegisterRt (c) MEM/WB.RegisterRd = ID/EX.RegisterRs (d) MEM/WB.RegisterRd = ID/EX.RegisterRt. Answer: c (32) Assume the PC of /beq $1, $2, 7/ is 10. What is the new PC if the branch is taken? (a) 7 (b) 18 (c) 38 (d) 42 (e) 17. Answer: d 註:new PC = (10 + 4) + 7  4 = 42 (33) What is NOT correct about dynamic branch prediction? (a) The branch prediction buffer is in CPU. (b) The branch prediction buffer is indexed by the lower portion of the address of the branch instruction. (c) Compiler plays the major role. Answer: c (34) In the 2-bit branch prediction scheme, what is the result after the sequence of branch executions: taken, taken, not-taken, taken, and not-taken? (a) in the predict-taken state (b) in the predict-not-taken state (c) information if not sufficient. Answer: c

151

(35) A branch delay slot is an instruction slot that gets executed without the effects of a preceding branch instruction. One approach to schedule an instruction to the slot is 'from before.' Changing /sub $1, $2, $3/.../ if $4=0 then goto label-in-the-below/ to /if $4=0 then goto Label-in-the-below/sub $1, $2, $3/ is an example. What is NOT correct about this approach? (a) It is not the best choice, (b) The /sub/ is in the slot after changing. (c) Compiler plays the major role. (d) It is not a dynamic branch prediction. Answer: a (36) We can not change /add $3, $4, $5/.../if $3≠1 then goto label-in-the-below/ to /if $3≠1 then goto Label-in-the-below/add $3, $4, $57. What is the reason? (a) Not-equal checking is more complicated, (b) The /add/ instruction is more complicated, (c) data dependence (d) control hazard. Answer: c (37) What is the structure that caches the destination PC or destination instruction for a branch? (a) branch target buffer (b) correlating predictor (c) tournament prediction predictor. Answer: a (38) What is correct about the multiple instruction issue? (a) similar to multicycle (b) it is not related to instruction-level parallelism, (c) it is designed for multiply instruction, (d) CPI can be less than one. Answer: d (39) The purpose of register renaming is to (a) remove antidependence (b) increase readability (c) construct name dependence (d) increase the usage of register. Answer: a (40) What is correct about loop unrolling? (a) not a technique of compilers (b) a technique of instruction-level parallelism (c) not a technique for multiple instruction issue (d) a technique of branch prediction. Answer: b

152

2. In a memory system, there is one TLB, one physically addressed cache, and one physical main memory. Assuming all memory addresses are translated to physical addresses before the cache is accessed. Which of the following events are impossible in memory system? (a) (b) (c) (d)

Cache TLB Page Table Hit Miss Miss Miss Hit Hit Miss Miss Hit Miss Hit Miss

Answer: (a), (d) 3. Consider three processors with different cache configurations: • Cache 1: Direct-mapped with one-word blocks • Cache 2: Direct-mapped with four-word blocks • Cache 3: Two-way set associative with four-word bocks The following miss rates have been measured: • Cache 1: Instruction miss rate is 4%; data miss rate is 6% • Cache 2: Instruction miss rate is 2%; data miss rate is 4% • Cache 3: Instruction miss rate is 2%; data miss rate is 3% Assume the same program with 10,000 instructions is executed on these processors, and one-half of the instructions contain a data reference. Assume that the cache miss penalty is 6 + Block size in words. Assume the CPI of this workload was measured to be 2.0 for the processor with cache 1. (a) Determine the number of cycles spent on cache misses for each of these processors, respectively. (b) Determine the CPI for processor with Cache 2 and Cache 3, respectively. (c) Assume the cycle times are 420 ps for the first and second processors, and 310 ps for the third processor. Determine which processor is fastest and which is the slowest. Answer: (a) Cache

Miss penalty

Total Miss Cycles

1 2 3

6+1=7 6 + 4 = 10 6 + 4 = 10

10000×(0.04×7+0.5×0.06×7)=4900 10000×(0.02×10+0.5×0.04×10)=4000 10000×(0.02×10+0.5×0.03×10)=3500

(b) CPIbase = CPI – CPImisses = 2 – 0.49 = 1.51 CPI with cache 2 = (1.51 + 0.4) = 1.91 CPI with cache 3 = (1.51 + 0.35) = 1.86 (c) Instruction Execution Time for C1 = 2 × 420 ps = 840 ps Instruction Execution Time for C2 = 1.91 × 420 ps = 802 ps Instruction Execution Time for C3 = 1.86 × 310 ps = 577 ps Therefore C3 is fastest and C1 is slowest. 153

Stall cycle per instruction 0.49 0.4 0.35

4. Consider a two-way set associative cache which has 12 blocks, a four-word block size, and a 32-bit address. (a) How many bits are required for this cache, including tag and valid fields? (b) Given the series of address references as word addresses: 13, 39, 15, 38, 60, 15, 63, 36. Please label each reference as a hit or a miss for this cache. Answer: (a) The number of sets = 12/2 = 6; the bits in the index field = log 2 6 = 2 The bits in the tag field = 32 – 4 – 2 = 26 The total bits for the cache = 6  (1 + 26 + 4  32)  2 = 1860 bits (b) Word address Block address Tag Index Hit/Miss 13 3 0 3 Miss 39 9 1 3 Miss 15 3 0 3 Hit 38 9 1 3 Hit 60 15 2 3 Miss 15 3 0 3 Miss 63 15 2 3 Hit 36 9 1 3 Miss 註(a):本題題目有誤,不管index是2個或3個bits答案都不對。Index為2個bits會 有兩個集合指不到浪費一些空間,但Index若為3個bits則tag會不夠長,而 造成系統無法運作。 5. Given a MIPS program with corresponding addresses below Address16 fc80 0000 ∙∙∙ ∙∙∙ fc8a 0000

MIPS assembly instructions L1: add $s0, $s1, $s0 ∙∙∙ bne $s0, $s1, L1 lw $s0, 12($s1)

(a) Is it true that the range of address for beq in MIPS is: up to about 215 bytes before the branch to about 215 bytes after? (b) Is it true that the range of address for j in MIPS is: anywhere within a block of 226 byte addresses where the PC supplies the upper 6 bits? (c) What problem the above MIPS program has? (d) Please rewrite the code using minimal sequence so that if $s0 is equal to $s1, the next instruction to be executed will be the instruction at label L1. (e) Given that the register numbers of $s0 and $s1 are 1610 and 1710, respectively, and that the Opcodes of the instructions beq, bne, slt, j, jal, lw are 4, 5, 0, 2, 3, and 35, respectively. Please convert your assembly code to MIPS machine instruction formats (using binary formats). 154

6 bits op op op

R I J

Address16 fc80 0000 ∙∙∙ fc8a 0000 fc8a 0004 fc8a 0008 ∙∙∙

5 bits rs rs

5 bits rt rt

5 bits rd

5 bits shamt 16 bit address 26 bit address

6 bits funct

Machine instructions 000000 10001 10000 10000 000000 000000 ∙∙∙ (2%) (2%) (2%) ∙∙∙

Answer: (a) No, the range should be up to about 215 words before the branch to about 215 words after. (b) No, it should be within a block of 228 byte addresses where the PC supplies the upper 4 bits. (c) The target address is beyond the range where the bne instruction can go. (d) Address16 MIPS assembly instructions fc80 0000 L1: add $s0, $s1, $s0 ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ fc8a 0000 bne $s0, $s1, L2 j L1 L2: lw $s0, 12($s1) (e) 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits fc80 0000 000000 10001 10000 10000 000000 100000 ∙∙∙ 10000 10001 0000000000000001 fc8a 0000 000101 fc8a 0004 000010 11001000000000000000000000 fc8a 0008 100011 10001 10000 0000000000001100 註:根據(d)的題目敘述原本於位置fc8a 0000的指令bne $s0, $s1, L1已被改為 beq $s0, $s1, L1

155

6. In computer system, bits have no inherent meaning. Given the bit pattern: 1010 1101 0001 0000 0000 0000 0000 0010. What does it represent, assuming that it is (a) A two's complement integer (b) An unsigned integer (c) A single precision floating-point number Answer: (a) − 1391460350 (b) 2903506946 (c) − 8.18545  10−12 7. Given the 4-bit restoring division block diagram below. Please divide 710 by 210 and show the value of each register for each of the steps.

8

8

8

Iteration 0 (initial values) 1 2 3 4 5

Quotient 0000

Divisor 0010 0000

Remainder 0000 0111

Iteration 0 (initial values) 1 2 3 4 5

Quotient 0000 0000 0000 0000 0001 0011

Divisor 0010 0000 0001 0000 0000 1000 0000 0100 0000 0010 0000 0001

Remainder 0000 0111 0000 0111 0000 0111 0000 0111 0000 0011 0000 0001

Answer:

156

8. Consider three RAID disk systems that are meant store 10 terabytes of data (not counting for redundancy). System A uses RAID0, system B uses, RAID1, and system C uses RAID5 with four disks in a "protection group". (a) How many terabytes of storage are needed for system A, B, and C, respectively? (b) Determine which system is most reliable and which is least reliable. Answer: (a) System A: 10 terabytes System B: 20 terabytes System C: 10 + (10/4) = 12.5 terabytes (b) System B (RAID1) is most reliable System A (RAID0) is least reliable 9. There are two primary methods for increasing the potential amount of instruction-level parallelism. The first approach is increasing the depth of the pipeline to overlap more instructions. The second approach is to replicate the internal components of the computer so that it can launch multiple instructions in every pipeline stage. Assume that the first approach is responsible for t1 seconds and the second approach is responsible for t2 seconds, with the total execution time t seconds. We also recognize that the improvement needs costs. Assume that the first approach needs cost C1 to get 1.1 times performance increasing and it continues needing cost C1 to get another 1.1 times performance increasing of the improved one, i.e., (1.1)2 times performance increasing of the original one. Assume the performance increasing is restricted as the above discrete function. The second approach follows the same rule with the discrete 1.2 times performance increasing and the discrete cost C2. The question is that under the total cost limitation CL, you are asked to discuss how to formulate the problem to get the maximum performance increasing by improving both the first approach and the second approach. Suppose that the first approach has n1 times improvements and the second approach has n2 times improvements. (a) Calculate the costs to improve the first approach and the second approach. (b) Calculate the responsible time of the two improved approaches. (c) Formulate the problem you are going to solve. Answer: (a) C1  n1 + C2  n2 (b) responsible time for the first approach

t1

1.1n1

responsible time for the second approach 157

t2

1.2n2

 t1  t2   (t  t1  t 2)  , where C1  n1 + C2  n2  CL (c) Minimize  n1 n2 1.2  1.1  註:題目中的times可解釋為「次數」或「倍數」,以上答案為解釋成「次數」的做法,若解 釋為「倍數」答案則如下所示: (a) C1 log1.1 n1  C 2  log1.2 n2 (b) responsible time for the first approach

t1 n1

responsible time for the second approach t 2

n2

 t1 t 2    (t  t1  t 2)  , where C1 log1.1 n1  C 2  log1.2 n2  CL  n1 n2 

(c) Minimize 

158

97 年交大電子所 1. You found that the critical path to set the clock cycle of multicycle implementation is memory access for loads and stores (not for fetching instructions). This caused your design run at 1 GHz instead of 2 GHz. Now, you proposed to split the memory access for loads and stores into three cycles, then your design could run at 2 GHz, Use the following instruction mix data: load: 25%; store: 10%; R-type: 50%; and branch/jump: 15%. (a) Please determine how much faster the 2 GHz machine with three-cycle memory access as compared with the 1 GHz machine with one-cycle memory access. Assume the required cycles for 1 GHz machine are load (5 cycles), store (4 cycles), R-type (4 cycles), and branch and jump (3 cycles). (b) Would you consider to further split the memory access into four cycles to raise the clock rate up to 3 GHz? Why? Answer: Instruction Frequency Loads 25% Stores 10% R-type 50% Branch/jump 15% Effective CPI Average Instruction Time (ns)

1 GHz (CPI) 5 4 4 3 4.1 4.1

2 GHz (CPI) 7 6 4 3 4.8 2.4

3 GHz (CPI) 8 7 4 3 5.15 1.72

(a) The 3-cycle memory implementation is 4.1/2.4 = 1.71 times faster than 1-cycle memory implementation. (b) Yes. By increasing the memory access into 4 cycles the average instruction time will be reduced to 1.72 ns. So it is worth to further split the memory access into 4 cycles. 2. For a 16-bit adder, if we divide it into 4-bit per group, (a) Please plot the adder by using carry skip adders. Please explain why this adder can speedup the addition process even if the carry is not skipped. (b) Please explain why modified Booth encoding can speedup the multiplication process. Answer:

159

(a) a15 b15 a14 b14 a13 b13 a12 b12

Group 3

a11 b11 a10 b10 a9 b9 a8 b8

a7 b7 a6 b6 a5 b5 a4 b4

a3 b3 a2 b2 a1 b1 a0 b0

Group 2

Group 1

Group 0

P3 c16

P2

P1

P0

+

·

+

·

+

·

+

·

OR

AND

OR

AND

OR

AND

OR

AND

Pi  (a4i  3  b4i  3 )(a4i  2  b4i  2 )(a4i 1  b4i 1 )(a4i  b4i ) , where i = 0 to 3.

The heavy line shown in above diagram is the critical path of the 16-bit carry skip adder. Since two stages (Group 2 and Group 1) are skipped, the adder can speedup the addition process. (b) The modified Booth‘s encoding generating a small number of partial products to be summed for the final result accomplishes faster multiplication. 3. Consider the following news release: ―The company will unveil the industry's first 5GHz version of the chip, which offers a 25% performance boost over the company‘s former speed champ, which runs at 4 GHz. The new chip can be plugged into systems boards for the older original chip (which ran at 1 GHz) to provide a 70% performance boost‖ (a) Comment on the definition (or definitions) of performance that you believe the company used. Do you think the news release is misleading? Show your reason. (b) If the 70% boost is correct, show the fraction that is speedup by the new processor. Answer: (a) Clock rate might be used for this company to assess performance since 5G  4G  0.25  25% 4G

The release of the news is possible since the new chip is 5 times faster than the original chip and might provide a 70% performance boost. (b) Speedup 

1 f  (1  f ) 5

 1.7  f = 0.514

160

c0

4. Consider the following MIPS code fragment: L.D R1, 45(R2) DADD R7, R1, R5 DSUB R8, R1, R6 OR R9, R5, R1 BNEZ R7, target DADD R10, R8, R5 XOR R2, R3, R4 (a) Identify each dependence by type; list the two instructions involved; identify which instruction is dependent; and if there is one, name the storage location involved. (b) For the classical 5-stage MIPS processor, assume a register file that writes in the first half of the clock cycle and reads in the second half-cycle forwarding. Which of the dependences that you found in (a) become hazards and which do not? Why? Answer: (a) (Note: instruction (j) is dependent, and the bold italic item is the storage location involved) Data Dependency: Line of instructions 1&2 1&3 1&4 2&5 3&6

Instruction (i) LD R1, 45 (R2) LD R1, 45 (R2) LD R1, 45 (R2) DADD R7, R1, R5 DSUB R8, R1, R6

Instruction (j) DADD R7, R1, R5 DSUB R8, R1, R6 OR R9, R5, R1 BNEZ R7, target DADD R10, R8, R5

(b) From the pipeline above, we can find only instruction 1 & 2, 1 & 3 occur data hazard. Other data dependencies are solved by the half-write half-read register file and forwarding strategy. Hence, 2 stalls may be needed for instruction 2, and as a sequence, no more stall needed for instruction 3.

161

5. (a) Why the pipelining can increase the performance? Explain in terms of IC, CPI, and cycle time. (b) As the number of pipelining stages increases, the gap between the ideal performance speedup and actual speedup becomes large as well. Give two reasons. (c) Is it possible to eliminate the pipeline stalls caused by data dependency just by code reordering? Give a brief explanation if not; or give an example if yes. (d) Is the forwarding technique capable of eliminating all pipeline stalls caused by data dependency? Why or why not? Answer: (a) Execution time = IC  CPI  cycle time Pipeline can reduce the average CPI by overlapping the execution of instructions. By dividing the datapath into stages, pipeline can increase the clock rate and thus shorten the cycle time. Hence, the performance can be increased by pipelining. (b) 1. To balance stage time is getting harder. 2. The penalties for solving the hazards are increased, and thus resulting in the increase of average CPI. (c) Yes. For example, both add instructions in the following code have a data hazard because of their respective dependence on the immediately preceding lw instruction. lw $t2, 4($t0) add $t3, $t1, $t2 sub $t6, $t6, $t7 lw $t4, 8($t0) add $t5, $t1, $t4 Reorder above code as follows, all the data hazards can be eliminated. lw $t2, 4($t0) lw $t4, 8($t0) sub $t6, $t6, $t7 add $t3, $t1, $t2 add $t5, $t1, $t4 (d) No. When an instruction tries to read a register following a load instruction that writes the same register, the forwarding technique can not eliminating this kind of pipeline stalls. For example: suppose that we have the following code to execute in MIPS 5-stage pipeline processor. lw $s0, 20($t1) sub $t2, $s0, $t3 Clearly, there is a load-use data hazard between the two instructions. 162

The following shows the multiple-clock-pipeline diagram. lw sub

c1 IF

c2 ID IF

c3 EX ID

c4 MEM EX

c5 WB MEM

c6 WB

During c4 the lw instruction is accessing memory but in the same clock sub instruction is using the wrong operand ($s0) for computation. Even if we have a forwarding path from memory access stage output to execution stage input, we still have to stall the pipeline for 1 clock to avoid such data hazard. 6. (a) Explain the three types of cache misses: compulsory, capacity, and conflict miss. (b) What's the use of valid bit and dirty bit within cache, respectively, (c) How many total bits are required for a direct-mapped cache with 16 KB of data and 4-word blocks, assuming a 32-bit address? (d) Given a direct-mapped cache and a fully associative cache with the same cache size and same block size, which cache requires a larger tag memory? Give your reason. (e) Why the miss rate goes down first then raises as the block size increases? (f) Is it possible that a cache read miss leads to a memory write? Give your reasons. (g) Suppose we have a processor with a base CPI of 1.0, assuming all references hit in the primary cache, and a clock rate of 5 GHz. Assume a main memory access time of 100 ns, including all the miss handling. Suppose the miss rate per instruction at primary cache is 2%. How much faster will the processor be if we add a secondary cache that has a 5 ns access time for either a hit or a miss and is large enough to reduce the miss rate to main memory to 0.5%. Answer: (a) Compulsory miss: a cache miss caused by the first access to a block that has never been in the cache. Capacity miss: a cache miss that occurs because the cache even with fully associativity, can not contain all the block needed to satisfy the request. Conflict miss: a cache miss that occurs in a set-associative or direct-mapped cache when multiple blocks compete for the same set. (b) Valid bit: to indicate the block in the cache contains valid data. Dirty bit: to indicate whether the victim block in the cache should be write back. (c) 16 KB = 4K words = 212 words a block size = 4 words  1K = 210 blocks in the cache 163

The total cache size = 210  (128 + (32–10–4) + 1) = 210  147 = 147 Kbits (d) Fully associative cache requires a larger tag memory. Use problem (c) as an example. The tag memory for direct-mapped cache = (32–10–4)  210 = 18 Kbits The tag memory for fully associative cache = (32 – 4)  210 = 28 Kbits Hence, fully associative cache requires a larger tag memory (e) Large blocks exploit spatial locality to lower miss rates. The miss rate may go up eventually if the block size becomes a significant fraction of the cache size because the number of blocks that can held in the cache will become small, and there will be a great deal of competition for those blocks. As a result, a block will be bumped out of the cache before many of its words are accessed. That is, spatial locality among the words in a block decreases. (f) Yes. If the set containing the missing data block is full and the victim block to be replaced had been updated by the processor, then the dirty block should be written back to memory before the data block amounts to the cache. (g) The miss penalty to main memory is 100/0.2 = 500 clock cycles For the processor with one level of caching, total CPI = 1.0 + 500 × 2% = 11.0 The miss penalty to the second-level cache = 5/0.2 = 25 clock cycles For the two-level cache, total CPI = 1.0 + 2% × 25 + 0.5% × 500 = 4.0 Thus, the processor with the secondary cache is faster by 11.0/4.0 = 2.8 7. (a) Consider a virtual memory subsystem: 40-bit virtual byte address, 16 KB pages, 36-bit physical byte address. What is the total size of the page table for each process on this processor, assuming that the valid, protection, dirty, and use bits take a total of 4 bits and that all the virtual pages are in use? (Assume that disk addresses are not stored in the page table.) (b) Describe the way how a virtual address is translated into a physical address, in which a TLB is incorporated. Answer: (a) Each page is 16 KB  14 bits page offset. The bits of virtual page number = 40 – 14 = 26 bits, this  226 entries in the page table Each entry requires 36 – 14 = 22 bits to store the physical page number and an additional 4 bits for the valid, protection, dirty, and use bits. We round the 26 bits up to a full word per entry, so this gives us a total size of 226 × 32 bits or 256 MB. (b) The virtual address is first broken into a virtual page number and a page offset. 164

The TLB which contains the virtual to physical address translation is indexed by the virtual page number. If the mapping exists, the physical page number mapped from the TLB constitutes the upper portion of the physical address, while the page offset constitutes the lower portion. If the mapping does not exist, the missing translation will be retrieved from the page table.

165

97 年交大電控 1. Suppose that two processors P1 and P2 have the same instruction set which consists of 4 classes of instructions (A, B, C and D). The CPI (clock cycles per instruction) of each class of instructions is given in Table 1 below. The clock rates of P1 and P2 are 800MHz and 1GHz respectively. Two compilers C1 and C2 are used to generate assembly codes from the same source program and the instruction counts for each compiler are also shown in Table 1. Now answer the following questions. Instruction classes A B C D

CPI for P1 1 2 3 4

Table 1 CPI for P2 Instruction Count of C1 (in billions) 2 5 2 2 4 1 3 2

Instruction count of C2 (in billions) 6 2 2 2

(a) If the codes generated by C1 and C2 are run on P1, what is the execution time for each code? (b) What is the MIPS (million instructions per second) for each code in part (a)? (c) If the codes generated by C1 and C2 are run on P2, what is the execution time for each code? (d) What is the MIPS for each code in part (c)? Answer: (5  1  2  2  1  3  2  4)  109 = 25 sec. 800  106 (6  1  2  2  2  3  2  4)  109 Execution time for C2 = = 30 sec. 800  106 (5  2  1  2)  109 (b) MIPS for C1 = = 400 25  106 (6  2  2  2)  109 MIPS for C2 = = 400 30  106 (5  2  2  2  1  4  2  3)  109 (c) Execution time for C1 = = 24 sec. 109 (6  2  2  2  2  4  2  3)  109 Execution time for C2 = = 30 sec. 109 (5  2  1  2)  109 (d) MIPS for C1 = = 416.67 24  106 (6  2  2  2)  109 MIPS for C2 = = 400 30  106

(a) Execution time for C1 =

166

2. Use the instructions and registers in Table 2 to compile the following C code to assembly code. You are restricted to use the instructions in Table 2 but not all of them. Suppose that each element in array x and y occupies 4 bytes and the memory is byte addressable. Let register $s0 be the variable k; registers $t0 and $t1 hold the start addresses of array x and y respectively /* compile this for-loop for assembly language */ for (k = 0; k < 10; k++) y[k] = x[k] + 2*k;

Instruction

Example

Table 2 Meaning

Comments

add add $s1, $s2, $s3 $s1 = $s2 + $s3 addition add immediate addi $s1, $s2, 100 $s1 = $s2 + 100 add constant branch on not equal bne $s1, $s2, L if ($s1!=$s2) go to label L not equal test and branch word from memory to load word lw $s1, 100($s2) $s1 = Memory[$s2 + 100] register shift left logical sll $s1, $s2, 10 $sl=$s2 127 exceeds the maximum positive integer that 8-bit 2's complement can represent

3. Processor control (a) Please describe clearly and compare microprogramming and hardwired design for controlling a processor, in which one instruction is executed in several clock cycles. (b) Given that a processor supports four types of instructions: Arithmetic, Memory Load, Memory Store and Branch and that the following truth table specifies the inputs and outputs of the control mechanism, what control design strategy will you use and why? Please show the details of your design. Control

Input

Output Reg Reg Mem Mem ALU ALU Signal Name In5 In4 In3 In2 In1 In0 Branch Dst Write Read Write OP1 Op2 Arithmetic 0 0 0 0 0 0 1 1 0 0 0 1 0 Mem. Load 1 0 0 0 1 1 0 1 1 0 0 0 0 Mem. Store 1 0 1 0 1 1 X 0 0 1 0 0 0 Branch 0 0 0 1 0 0 X 0 0 0 1 0 1

Answer:

228

(a) Microprogrammed control 解釋

優點

缺點

Hardwired control An implementation of finite state machine control typically using programmable logic arrays (PLAs) or collections of PLAs and random logic

A method of specifying control that uses microcode rather than a finite state representation 1. 2. 3. 1. 2.

可簡化控制單元的設計 有彈性、修改設計容易 適合處理較複雜的指令集 執行速度較慢 硬體成本較高(控制記憶體及 解碼邏輯)

執行速度快 硬體設計上較為複雜修改不易 適合處理較簡單的指令集

(b) Inputs In5 In4 In3 In2 In1 In0

R-format

lw

beq

sw

Outputs RegDst RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0

The structure, called a programmable logic array (PLA), uses an array of AND gates followed by an array of OR gates. The inputs to the AND gates are the function inputs and their inverses. The inputs to the OR gates are the outputs of the AND gates). The output of the OR gates is the function outputs. PLA can simplify the design process and is one of the common ways to implement a control function.

229

4. For the 32 bit Floating point system given below, the exponent bias is 127, i.e., if the true exponent is 5, the biased exponent would be 132. Find the sign bit, exponent, and significand representations of floating point number 118.75. sign of significand

8 bits

23 bits

biased exponent

significand

Answer: 118.7510 = 1110110.112 = 1.110110112  26 S 0

exponent 10000101

significand 11011011000000000000000

5. Write a complete VHDL or Verilog code for the circuit shown below:

A

X

B C

Y

D Answer: entity Comb is port (A, B, C, D: in std_logic; X, Y: out std_logic); end Comb; architecture behavioral of Comb is begin X = 0) sum = sum + A[j]; else sum = sum - A[j]; j = j + 1; } Assume that variable j is assigned to R1, variable k is assigned to R2, and variable sum is assigned to R3; the start address of integer array A is already stored in R4.

Answer L0:

L1: L2:

slt beq sll add lw slt bne add

R5, R1, R2 R5, R0, END R5, R1, 2 R5, R4, R5 R6, 0(R5) R5, R6, R0 R5, R0, L1 R3, R3, R6

j sub addi j

L2 R3, R3, R6 R1, R1, 1 L0

END:

3. Explain the following terms in detail. (a) RAW (Read After Write) dependency and WAR (Write After Read) dependency. (b) Write Back cache and Write Through cache. (c) Virtual Memory, Page table, and TLB (Translation Look-aside Buffer). (d) RISC (Reduced Instruction Set Computer) and CISC (Complex Instruction Set Computer). Answer (a) RAW dependency: it arises when the next instruction tries to read a source before the previous instruction writes to it. So, the next instruction gets the old value incorrectly 344

WAR dependency: it arises when the next instruction writes to a destination before the previous instruction reads it. In this case, the previous instruction gets a new value incorrectly (b) Write back cache: a scheme that handles writes by updating values only to the block in the cache, then writing the modified block to the lower level of the hierarchy when the block is replaced. Write through cache: a scheme in which writes always update both the cache and the memory, ensuring that data is always consistent between the two. (c) Virtual Memory: a memory system that uses main memory as a ―cache‖ for secondary storage Page table: a page table is the table which contains the virtual to physical address translation in virtual memory system TLB: a cache that keeps track of recently used address mappings to avoid an access to the page table (d) RISC: it stands for reduced instruction set computer and is the generic name given to processors that use a small number of simple instructions, to try to do less work with each instruction but execute them much faster CISC: it stands for complex instruction set computer and is the name given to processors that use a large number of complicated instructions, to try to do more work with each one.

345

100 年彰師大電子 1. Suppose you are trying to improve the performance of processor X, which spends 25% of its CPU time executing floating point (FP) operations and 30% of its CPU time executing the load/store operations. The first improvement method is to make the FP operations run 10 times faster, and the second improvement method is to make the load/store operations run 7 times faster. (a) Which method (the first or the second) can get higher speed-up? (b) Suppose the floating point operations can be improved by a factor of infinity (∞), what is the speed-up? Answer (a) Speedup1 =

1 0.25  0.75 10

 1.29 ; Speedup2 =

1 0.3  0.7 7

 1.346

The second method can get higher speed-up. (b) Speedup =

1 0.25  0.75 

 1.33

2. Translate the following C code segment into MIPS assembly code while ( j < 10) {if (A[j] >= B[j]) {sum = sum + A[j];} else {sum = sum + B[j];} j = j + i;} Assume variable j is assigned to R1 and variable sum is assigned to R2; the start address of integer array A is already stored in R3; the start address of integer array B is already stored in R4. Answer L0:

L1: L2:

slti beq sll add add lw lw slt bne add j add addi j

R5, R1, 10 R5, R0, END R5, R1, 2 R6, R5, R3 R7, R5, R4 R6, 0(R6) R7, 0(R7) R5, R6, R7 R0, R5, L1 R2, R2, R6 L2 R2, R2, R7 R1, R1, 1 L0 346

END: 3. The following Figure shows the block diagram of processor MIPS R2000 (the figure is from the book Computer Organization and Design by Patterson and Hennessy). If the processor is executing the instruction LW R l l , 56(R7). What are the values for the following signals? (a) Instruction[25-21]; (b) Instruction[20-16]; (c) Instruction[15-0]; (d) ALUSrc; (e) MemtoReg. 0 M u x ALU Add result Add 4 Instruction [31 26]

Control

Instruction [25 21] PC

Read address

Instruction memory

Instruction [15 11]

RegDst Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite

PCSrc

Read register 1

Instruction [20 16] Instruction _ [31 0]

1

Shift left 2

0 M u x 1

Read data 1 Read register 2 Registers Read Write data 2 register

0 M u x 1

Write data

Zero ALU ALU result

Address

Write data Instruction [15 0]

16

Sign extend

Read data Data memory

1 M u x 0

32 ALU control

Instruction [5 0]

Answer (a) (b) (c) (d) (e) Instruction[25-21] Instruction[20-16] Instruction[15-0] ALUSrc MemtoReg 7 11 56 1 1 4. Plot the block diagram for 4-way set-associative cache and explain how the address, which is composed of tag, index, and offset, sent from the processor is used to access the cache. Answer

347

Address

Tag

Index 8

22

Index

V

Tag

Data

V

Offset

Tag

Data

V

Tag

Data

V

Tag

Data

0 1 2

.. ..

=

=

=

=

4-to-1 MUX Data Hit

The index is used to select the set, then the tag is used to choose the block by comparison with the blocks in the selected set. 5. Explain how virtual memory is implemented in a computer system. You must explain (a) page table; (b) address translation; (c) the advantages of virtual memory. Answer In virtual memory, the address is broken into a virtual page number and a page offset. The physical page number constitutes the upper portion of the physical address, while the page offset, which is not changed, constitutes the lower portion. Each program has its own page table, which maps the virtual address space of the program to main memory. The page table is indexed with the page number from the virtual address to discover the corresponding physical page number. The advantage of virtual memory is that it allows efficient and safe sharing of memory among multiple programs, and to remove the programming burdens of a small, limited amount of main memory. The second advantage for virtual memory is to allow a single user program to exceed the size of primary memory.

348

99 年彰師大電子 1. Translate the following C code segment into MIPS assembly code while ( j < 10) {if ( j % 2 = = 0) sum_e = sum_e + A[j]; else sum_o = sum_o + A[j]; j = j + 1;} Assume variable j is already assigned to R1, variable sum_e is already assigned to R2; variable sum_o is already assigned to R3, and the start address of integer array A is already stored in R4. Answer: L0:

L1: L2:

slti beq sll add lw andi bne add j add addi j

R5, R1, 10 R0, R5, END R5, R1, 2 R5, R5, R4 R6, 0(R5) R7, R1, 1 R7, R0, L1 R2, R2, R6 L2 R3, R3, R6 R1, R1, 1 L0

END: 2. Consider two different implementations of the same instruction set architecture. There are four classes of instructions, A, B, C, and D. The clock rate and CPI (cycles per instruction) of each implementation are given in the following table. Given a program with 108 instructions divided into classes as follows: 20% class A, 30% class B, 40% class C, and 10% class D. (a) Find the average CPI for P1; (b) Find the average CPI for P2; (c) Find the execution time for P1; (d) Find the execution time for P2. P1 P2

clock rate 2.6 GHz 1.5 GHz

CPI class A 1 2

CPI class B 2 3

CPI class C 1 1

CPI class D 4 2

Answer: (a) Average CPI for P1 = 1  0.2 + 2  0.3 + 1  0.4 + 4  0.1 = 1.6 (b) Average CPI for P2 = 2  0.2 + 3  0.3 + 1  0.4 + 2  0.1 = 1.9 349

(c) Execution Time for P1 = (108  1.6) / 2.6 G = 0.0615 (sec.) (d) Execution Time for P2 = (108  1.9) / 1.5 G = 0.1267 (sec.) 3. Explain the following terms in detail: (a) Write After Read hazard (b) Set associative cache (c) Static instruction scheduling (d) Register renaming. Answer: (a) Write After Read hazard arises when the next instruction writes to a destination before the previous instruction reads it. In this case, the previous instruction gets a new value incorrectly. (b) A cache that has a fixed number of locations (at least two) where each block can be placed. (c) Reorder the instructions by compiler to avoid any pipeline stalls. (d) The renaming of registers by the compiler or hardware to remove antidependences. 4. The micro-architecture of MIPS R2000 is shown in the following figure (This figure comes from the book ―Computer Organization and Design‖ by D. A. Patterson and J. L. Hennessy). Assume MIPS R2000 is to execute the instruction ‗LW R3, 100(R1)‘, assign appropriate values for the following control signals: (a) ALUSrc (b) RegWrite (c) MemWrite (d) MemRead (e) MemtoReg

-

-

-

Answer: (a) ALUSrc 1

(b) RegWrite 1

(c) MemWrite 0

350

(d) MemRead (e) MemtoReg 1 1

98 年彰師大電子 1. (a) The MIPS processor has an instruction JAL (Jump and Link). Describe the conditions in which you need to use the JAL instruction when writing a MIPS assembly program. (b) Describe the detailed hardware operations for the MIPS processor to execute a JAL instruction. Answer: (a) The JAL instruction is used when a procedure call is required in a MIPS assembly language. (b) It jumps to an address and simultaneously saves the address of the following instrcution in a register ($ra in MIPS). 2. Describe at least three methods to reduce the branch penalty of a pipelined processor. Answer: (1) Delay branch (2) Branch prediction (3) Move branch decision from MEM stage to ID stage 3. (a) Give two examples of false data dependence between instructions. (b) How to remove false data dependences between instructions? Answer: (a) An false data dependence exists if a compiler cannot see that no actual dependence exists. Anti-dependence (WAR) and Output dependence (WAW) are two kinds of false data dependence. Example 1 add $s1, $s2, $s3 sub $s2, $t0, $t1 Anti-dependency

Example 2 add $s1, $s2, $s3 sub $s1, $t0, $t1 Output dependency

(b) Anti- and output dependencies arise because the limited number of registers. So, we can use register renaming technique to remove the false data dependences.

351

4. Write the corresponding MIPS R2000 assembly codes for the following C code segment. Assume variable i is assigned to register R1, the start addresses of integer array A and B are already stored in register R11, R12; variable sum is assigned to register R13. An integer is 4-byte long. //int A[10], B[10], sum; while (i >= 0){ if(A[i] >= B[i]) sum = sum + A[i]; else sum = sum + B[i]; i = i – 1; } Answer: L0:

L1: L2:

slt bne sll add add lw lw slt bne add j add addi j

R2, R1, R0 R0, R2, END R3, R1, 2 R4, R3, R11 R5, R3, R12 R6, 0(R4) R7, 0(R5) R8, R6, R7 R0, R8, L1 R13, R13, R6 L2 R13, R13, R7 R1, R1, -1 L0

END:

5. A 3-bit ALU has two 3-bit inputs A and B, one 3-bit output C, and a 2-bit control input S. When S = 00, C = (A bitwise-AND B); when S = 01, C = (A bitwise-OR B); when S = 10, C = (A add B); when S = 11, C = (A subtract B). Use the basic logic gates (AND, OR, NOT, NAND, NOR, XOR), 1-bit full adders, and 1-bit 4-to-l multiplexers to design the 3-bit ALU. Answer:

352

1-bit ALU

Binvert

3-bit ALU

CarryIn

a

S0 S1

Op

0

c

1

b

0 1

a0

2

+

b0

3

CarryOut

a1

S

Binvert

CarryIn

Op1

Op0

Operation

00

0



0

0

And

01

0

0

1

Or

10 11

0 1

 0 1

1 1

0 1

Add Sub

op1

Bnegate

b1

a2 b2

op0

CarryIn ALU0 CarryOut

c0

CarryIn ALU1 CarryOut

c1

CarryIn ALU2 CarryOut

c2

Carryout

Bnegate

6. Assume that you have already finished the logic design of a single-cycle processor (i.e., the processor can complete the execution of an instruction within a single clock cycle). How to modify the logic design of a single-cycle processor into that of a five-stage pipelined processor? Give your answer in detail. Answer: (1) To separate the datapath into five pieces, with each piece corresponding to a stage of instruction execution, we cut, in the datapath, behind each main functional unit (main functional units include: Instruction memory, register file, ALU, and data memory). (2) To retain the value of an individual instruction for each stage, pipeline registers are placed whenever there are cutting lines between stages. (3) Uncover a bug by passing the destination register number from ID stage until it reaches the MEM/WB pipeline register.

353

7. The instruction fetch unit (IFU) in a RISC processor takes charge of fetching instructions from memory to the instruction decoder. Plot the block diagram of IFU to show the internal architecture of (IFU), and illustrate how IFU deals with jump and branch instructions. Answer: The following diagram shows the instruction fetch unit. When jump instruction is excuting, the control signal jump is set to 1 and the jump target address will go through multiplexor (on the right) and will update the PC at the next clock. When the branch instruction is executing, the control signals branch will be set to 1 and jump will set to 0. If the condition is true, branch target will got through these two multiplexers and update PC at the next clock.

jump target

4

+

branch target

0 1

jump

address

PC

1 0

Branch

Instruction Memory

354

97 年彰師大電子 1. Interpret the following three types of dependency, and give an example for each type of dependency. (a) Read After Write dependency (b) Write After Read dependency (c) Write After Write dependency Answer: (a) Read after Write (RAW) or True dependency: An operand is modified and read soon after. Because the first instruction may not have finished writing to the operand, the second instruction may use incorrect data. For example: i1: r2  r1 + r3 i2: r4  r2 + r3 In a pipeline, when we fetch the operands (r2) for the 2nd instruction, the results from the 1st instruction will not yet have been saved. (b) Write after Read (WAR) or Anti dependency: Read an operand and write soon after to that same operand. Because the write may have finished before the read, the read instruction may incorrectly get the new written value. For example: i1: r1  12(r3) i2: r3  r4 − r5 In a pipeline, when we fetch the operands (r3) for the 1st instruction, the results from the 2nd instruction may have finished before the read. (c) Write after Write (WAW) or Output dependency: Two instructions that write to the same operand are performed. The first one issued may finish second, and therefore leave the operand with an incorrect data value. For example: i1: r2  r1 + r3 i2: r2  r4 − r7 In a pipeline, when we write the result (r2) for the 1st instruction, the write from the 2nd instruciton may have finished before the write. 2. (a) Plot the hardware diagram for the data forwarding scheme in pipelined processors (b) Can the data forwarding scheme remove the Load-Use data hazard? Give your reason.

355

Answer: (a)

(b) Forwarding scheme can not remove the Load-Use data hazard completely Suppose that we have the following code to execute in MIPS 5-stage pipeline processor. lw $s0, 20($t1) sub $t2, $s0, $t3 Clearly, there is a load-use data hazard between the two instructions. The following shows the multiple-clock-pipeline diagram. lw sub

c1 IF

c2 ID IF

c3 EX ID

c4 MEM EX

c5 WB MEM

c6 WB

During c4 the lw instruction is accessing memory but in the same clock sub instruction is using the wrong operand ($s0) for computation. Even if we have a forwarding path from memory access stage output to execution stage input, we still have to stall the pipeline for 1 clock to avoid such data hazard. 3. A computer system has a two-way set-associative data cache. This data cache has 1024 blocks in total and each block contains 16 bytes. Plot the hardware diagram for this data cache. You must show how the address from the processor is used to select one data item out of the cache. Answer: Suppose that the number of bits of memory address issued from CPU is 32 bits 356

Address 31 30

12 11 10 9 8

321 0

89

2219

Index

V

Tag

Data

V

Tag

Data

0 1 2

510 511

128

128

2-to-1 MUX

Hit

Data

4. Write the corresponding MIPS R2000 assembly codes for the following C code segment. (Assume variable i is assigned to register R1, and the start addresses of array A, B, and C, are already stored in register R11, R12, R13. An integer is 4-byte long.) // int A[10], B[10], C[10]; while (i >= 0){ if(B[i]>= C[i]) A[i] = B[i] + A[i]; else A[i] = C[i] + A[i]; i = i – l; } Answer:

357

Loop:

Else: Out:

slt $t0, $R1, $zero bne $t0, $zero, Exit sll $R2, $R1, 2 add $t1, $R2, $R11 lw $t2, 0($t1) add $t3, $R2, $R12 lw $t4, 0($t3) add $t5, $R2, $R13 lw $t6, 0($t5) slt $t0, $t4, $t6 bne $t0, $zero, Else add $t7, $t4, $t6 sw $t7, 0($t1) j Out add $t8, $t6, $t2 sw $t8, 0($t1) addi $R1, $R1, –1 j Loop

# if i < 0 then # exit the while loop # $R2 = 4  i # $t1 contains the address of A[i] # $t2 contains the data of A[i] # $t3 contains the address of B[i] # $t4 contains the data of B[i] # $t5 contains the address of C[i] # $t6 contains the data of C[i] # if B[i] < C[i] # goto Else # # A[i] = B[i] + A[i]

# A[i] = C[i] + A[i] #i=i–1

Exit: 5. Explain how a processor handles interrupts and exceptions. Describe an exception handling mechanism for the processor so that the processor can jump to the corresponding exception handling routine and can resume the execution of the interrupted program after the exception handling routine is executed. Answer: The basic action that the machine must perform when an exception occurs is to save the address of the offending instruction in the exception program counter (EPC) and then transfer control to the operating system at some specified address. The operating system can then take the appropriate action. After performing whatever action is required because of the exception, the operating system may continue the program‘s execution, using the EPC to determine where to restart the execution of the program. For the operating system to handle the exception, it must know the reason for the exception. One method to know the reason for an exception is to include a status register (called Cause register), which holds a field that indicates the reason for the exception. Finally, we will need to be able to write the exception address, which is the operating system entry point for exception handling, into the PC; in the MIPS architecture, this address is 8000 0180hex. 358

6. Describe the Booth's Algorithm for multiplication. Use the Booth's Algorithm to compute 10110110  11110110 step by step. Answer: 10110110  11110110 = 0000001011100100 (– 74  – 10 = 740) Iteration 0 1 2 3 4 5 6 7 8

Step initial values 00 no operation Shift right product 10 – Multiplicand Shift right product 11 no operation Shift right product 01 + Multiplicand Shift right product 10 – Multiplicand Shift right product 11 no operation Shift right product 11 no operation Shift right product 11 no operation Shift right product

Multiplicand 10110110 10110110 10110110 10110110 10110110 10110110 10110110 10110110 10110110 10110110 10110110 10110110 10110110 10110110 10110110 10110110 10110110

Product 00000000 11110110 0 00000000 11110110 0 00000000 01111011 0 01001010 01111011 0 00100101 00111101 1 00100101 00111101 1 00010010 10011110 1 11001000 10011110 1 11100100 01001111 0 00101110 01001111 0 00010111 00100111 1 00010111 00100111 1 00001011 10010011 1 00001011 10010011 1 00000101 11001001 1 00000101 11001001 1 00000010 11100100 1

7. Interpret the big-endian scheme and the little-endian scheme. If a 32-bit data, 0xd75afb68, is stored in memory from address 100 to address 103, what is the memory content at address 101 for each scheme? Answer: Big endian: the most significant byte of any multibyte data field is stored at the lowest memory address. Little endian: the least significant byte of any multibyte data field is stored at the lowest memory address. Address 100 101 102 103

Contents Big endian Little endian d7 68 5a fb fb 5a 68 d7

359

101 年彰師大資工 1.

In evaluating the performance of a computer, two performance metrics response time and throughput are usually used. Explain the meanings of these two terms.

Answer Response time (execution time): the total time required for the computer to complete a task. Throughput (bandwidth): it is the number of tasks completed per time unit. 2.

Amdahl's law is a rule stating that the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used. Give a computer-related example to explain this rule.

Answer Suppose a program runs in 100 seconds on a computer, with multiply operation responsible for 80 seconds of this time. If we improve the speed of multiplication 5 times faster, the execution time after multiplication improvement can be computed by Amdahl's law as: Execution time after improvement = (80 / 5) + (100 – 80) = 36 second. 3.

Many programs have more variables than computers have registers. How does the compiler do to conquer this problem?

Answer The compiler tries to keep the most frequently used variables in registers and placing the rest in memory, using loads and stores to move variables between registers and memory. 註:The process of putting less commonly used variables into memory is called spilling registers. 4.

In the execution of a procedure, the program must first put the parameters in a place where the procedure can access them, and (after performing the desired task) return control to the point of origin. How can these two steps be achieved in computer hardware for a MIPS machine?

Answer Registers are the fastest place to hold data in a computer, so MIPS hardware provides: $a0 - $a3: four argument registers in which to pass parameters $ra: one return address register to return to the point of origin

360

5.

A designer of a floating-point representation must find a compromise between the size of the fraction and the size of the exponent. The tradeoff is between precision and range. What does that mean?

Answer Increasing the size of the fraction enhances the precision of the fraction, while increasing the size of the exponent increase the range if numbers that can be represented. 6.

Shown in the following figure is a portion of the datapath for MIPS hardware, what is the purpose for using an adder here and why the value to be added to the PC is 4?

Add 4

PC

Read address

Instruction Instruction memory

Answer The use of the adder is to increment the PC to obtain the address of the next instruction. The length of an MIPS instruction is 4 bytes and it occupies 4 memory addresses, so the value to be added to the PC is 4. 7.

When an exception occurs, the address of the offending instruction will be saved in the exception program counter (EPC), and the control will be transferred to the OS. After performing some required actions, what will then be done by the OS?

Answer The operating system can then take the appropriate action, which may involve providing some service to the user program, taking some predefined action in response to the exception, or stopping the execution of the program and reporting an error. 8.

In a pipelined design with N stages, what is its theoretical speedup and how to determine its clock rate?

Answer 361

The theoretical speedup = N The execution time for the longest stage will determine the pipeline clock cycle time. If the execution time for the longest stage is T, then the clock rate = 1 / T. 9.

When the two instructions are executed in the pipelined design in the following figure, what kind of hazard will occur? Why? ID/EX

EX/MEM

ALU

add $12, $2, $5

sub $2, $1, $3

Answer Data hazard will occur. As shown in the multiple clock cycle pipeline diagram below, without intervention, the sub instruction doesn‘t write its result to register $2 until the 5th clock cycle but the add instruction fetch the wrong data from register $2 at the 3rd clock cycle.

sub $2, $1, $3 add $12, $2, $5

c1

c2

c3

c4

c5

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

c6 WB

10. Suppose on a store instruction, the data is written only into the cache (without changing main memory), then the memory would have a different value from that in the cache. Two ways to solve this are write-through and write-back. Please explain their meanings. Answer Write-through: writes always update both the cache and the next lower level of the memory hierarchy, ensuring that data is always consistent between the two. Write-back: writes update values only to the block in cache, and then write the modified block to the lower level of the hierarchy when the block is replaced.

362

100 年彰師大資工 1. Determine if each of the following statements is true (T) or false (F). (a) The number of variables in many programs is larger than that of registers in computers. The compiler tries to keep the most frequently used variables in memory and places the rest in registers. (b) Pseudo-instructions are variations of assembly language instructions. They need not be implemented in hardware. (c) In the MIPS instructions, the R-type and I-type instructions can be distinguished by the 6-bit op field of an instruction. (d) Pipelining does not reduce the latency of a single task, but it improves the throughput of an entire workload. (e) In the memory hierarchy design, if the hit rate is high enough, the memory hierarchy has an effective access time close to that of the slowest level and a size equal to that of the smallest level. Answer (a) (b) (c) (d) (e)

False (Most frequently used variables are placed in registers) True False (R-type instructions are distinguished by function field) True False (effective access time close to that of the highest level and a size equal to that of the lowest level)

2. Explain the following terms. (a) PC-relative addressing (b) Structural pipeline hazard (c) Branch prediction (d) Fully associative cache (e) Page table Answer (a) PC-relative addressing: An addressing regime in which the address is the sum of the program counter (PC) and a constant in the instruction. (b) Structural pipeline hazard: When a planned instruction cannot execute in the proper clock cycle because the hardware does not support the combination of instructions that are set to execute. (c) Branch prediction: A method of resolving a branch hazard that assumes a given outcome for the branch and proceeds from that assumption rather than waiting to ascertain the actual outcome. (d) Fully associative cache: A cache structure in which a block can be placed in any location in the cache. 363

(e) Page table: The table containing the virtual to physical address translations in a virtual memory system. The table, which is stored in memory, is typically indexed by the virtual page number; each entry in the table contains the physical page number for that virtual page if the page is currently in memory. 3. Answer the following questions. (a) Suppose a program runs in 100 seconds on a computer, with 80 seconds of multiple operations of this time. How much improvement do you have to increase the speed of multiplication if you want the program to run four times faster? (b) For an arrays stored in memory, write a MIPS instruction to load the value of A[3] into register $t1. Assume byte addressing is used here and the base address of A is stored in register $s1. (c) When the jal (jump-and-link) instruction is executed, it will jump to an address and simultaneously save the address of the following instruction in $ra. What is the purpose for this address saving? (d) Why does the IEEE 754 floating-point standard use a bias of 127 for single precision and 1023 for double precision? (e) Compared with the single-cycle design, what is the advantage of a multicycle design? Answer (a)

(b) (c) (d)

(e)

100 Speedup = 4 = 80  x = 16.  20 x Multiplication should be improved by 16 times lw $t1, 12($s1) The address in $ra is called return address that allows a procedure to return to the proper address. 多數浮點數運算需要先比較指數大小來決定要對齊哪一個數的指數,使用 bias 表示法可以直接比較指數欄位的無號數值即可判斷兩個浮點數指數的 大小而不頇考慮其正負號,因此可以加快比較速度。 The ability to allow instructions to take different numbers of clock cycles and the ability to share functional units within the execution of a single instruction.

364

4. Shown in the following Figure is the datapath for execution of an R-type instruction add $t1, $t2, $t3. Describe the four steps in executing this instruction.

-

-

-

Answer (1) The instruction is fetched, and the PC is incremented. (2) Two registers, $t2 and $t3, are read from the register file; also, the main control unit computes the setting of the control lines during this step. (3) The ALU operates on the data read from the register file, using the function code (bits 5:0, which is the funct field, of the instruction) to generate the ALU function. (4) The result from the ALU is written into the register file using bits 15:11 of the instruction to select the destination register ($t1).

365

99 年彰師大資工 1. Determine if each of the following statements is true (T) or false (F). a. In the IEEE 754 floating-point representation, the precision of represented numbers is determined by the size of fraction. b. In computer arithmetic, overflow occurs when adding two positive numbers and the sum is negative. c. In the memory hierarchy design, address mapping is a process by which a physical address is mapped to a virtual address used for accessing memory. d. When choosing a block to replace in a direct-mapped cache, all blocks in the cache can be candidates for replacement. e. If computer X runs a program in 10 seconds and computer Y runs the same program in 15 seconds, we can say that X is 1.5 times faster than Y. Answer: a. True b. True c. False. A virtual address is mapped to a physical address used for accessing memory. d. False. Only one block can be the candidate for replacement. e. True 2. Answer the following questions. a. When a program is executed on a computer, how to calculate the CPU time from its instruction count, CPI, and clock cycle time? b. In the MIPS addressing mode, what does immediate addressing mean? c. When page fault occurs and if all the pages in the main memory are in use, the operating system usually uses the LRU replacement scheme to choose a page to replace. What does this scheme mean? d. When handling a write request to a memory with cache, what is the write-through scheme? e. In the design of a MIPS processor, how to pass parameters and return result values during a procedure call? Answer: a. CPU time = instruction count  CPI  clock cycle time b. MIPS addressing mode: the operand is a constant within the instruction itself. c. LRU replacement scheme: the block replaced is the one that has been unused for the longest time. d. Write-through scheme: writes always update both the cache and the next lower level of the memory. e. The caller puts the parameter values in $a0 - $a3 and the callee places the results in $v0 - $v1. 366

3. Answer the following questions about data hazard. a. Data hazard is a type of pipeline hazard. One way to solve it is called forwarding, what does that mean? b. However, the forwarding method cannot be applied in solving the load-use data hazard, why? c. One solution to the load-use data hazard is called pipeline stall, what does that mean? d. Reorder the code segment in Fig. 1 to avoid pipeline stalls when solving the load-use data hazard. lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) Fig. 1 Answer: a. Forwarding is a method of resolving a data hazard by retrieving the missing data element from internal buffers rather than waiting for it to arrive from programmer-visible registers or memory. b. Suppose that we have the following code to execute in MIPS 5-stage pipeline processor. lw $s0, 20($t1) sub $t2, $s0, $t3 Clearly, there is a load-use data hazard between the two instructions. The following shows the multiple-clock-pipeline diagram. lw sub

c1 IF

c2 ID IF

c3 EX ID

c4 MEM EX

c5 WB MEM

c6 WB

During c4 the lw instruction is accessing memory but in the same clock sub instruction is using the wrong operand ($s0) for computation. Even if we have a forwarding path from memory access stage output to execution stage input, we still have to stall the pipeline for 1 clock to avoid such data hazard. c. Pipeline stall: A stall initiated in order to resolve a hazard. d. lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 367

add $t5, $t1, $t4 sw $t3, 12($t0) sw $t5, 16($t0) 4. Shown in Fig. 2 is a single-cycle implementation of the MIPS processor, please explain in detail how a branch-on-equal instruction, e.g., beq $t1, $t2, offset, is executed in the datapath.

-

-

-

Fig. 2 Answer: 1. An instruction is fetched from the instruction memory, and the PC is incremented. 2. Two registers, $t1 and $t2, are read from the register file. 3. The ALU performs a subtract on the data values read from the register file. 4. The value of PC + 4 is added to the sign-extended, lower 16 bits of the instruction (offset) shifted left by two; the result is the branch target address. 5. The Zero result from the ALU is used to decide which adder result to store into the PC.

368

98 年彰師大資工 1. Determine if each of the following statements is true (T) or false (F). (a) One factor that determines the performance of a machine is the instruction count, which is determined by the implementation of the processor. (b) By using the addi instruction along with a negative number, a subtract immediate operation can be performed. (c) For the multi-cycle implementation of the MIPS processor, a functional unit can be used more than once per instruction, as long as it is used in different clock cycles. (d) For the pipelined implementation of the MIPS processor, the performance is improved by decreasing the execution time of an individual instruction. (e) The design of a memory hierarchy is to present the user with as much memory as is available in the cheapest technology. Answer: (a) Fault. (Algorithm, programming language, compiler, or instruction set architecture will impact instrcdution count) (b) True. (c) True. (d) False. (Pipelining improves instrcuiton throught rather than individual instrcution execution time) (e) False. (The design of a memory hierarchy is to present the user with as much memory as is available in the cheapest technology, while providing access at the speed offered by the fastest memory) 2. For the load instruction: lw $t1, offset($t2), explain its five steps of execution. Answer: 1. An instruction is fetched from the instruction memory, and the PC is incremented. 2. A register ($t2) value is read from the register file. 3. The ALU computes the sum of the value read from the register file and the sign-extended, lower 16 bits of the instruction (offset). 4. The sum from the ALU is used as the address for the data memory. 5. The data from the memory unit is written into the register file; the register destination is given by bits 20:16 of the instruction ($t1).

369

3. Answer the following questions. (a) When the jal (jump-and-link) instruction is executed, what will be done? (b) We know that pseudoinstructions need not be implemented in hardware, how can they be executed by hardware? (c) In the memory hierarchy design, what is a directed-mapped cache? (d) Data hazard is a type of pipeline hazard, what is a data hazard? (e) When handling a write request to a memory with cache, what is the write-back scheme? Answer: (a) It jump to an address and simultaneously saves the address of the following instrcution in a register ($ra in MIPS). (b) Assembler will translate the pseudoinstructions to those instructions that the machine can accept. (c) In a directed-mapped cache, each memory location is mapped to exactly one location in the cache. (d) An occurrence in which a planned instruction cannot execute in the proper clock cycle because data that is needed to execute the instructions is not yet available. (e) A write-back scheme update values only to the block in the cache, then write the modified block to the lower level of the hierarchy when the block is replaced.

370

4. Shown in Fig. 1 is a single-cycle implementation of the MIPS processor, answer the following questions. (a) What does "single-cycle" mean for this MIPS implementation? (b) What is the purpose of sending the value of PC register to an adder? (c) What is the purpose of using multiple levels of decoding to generate the actual ALU control bits? (d) What is the purpose of performing "Shift left 2" on the output bits of "Sign extend"? (e) What is the purpose of sending the Zero signal of the ALU to an AND gate? 0 M u x ALU Add result Add 4 Instruction [31 26]

Control

Instruction [25 21] PC

Read address

Instruction memory

Instruction [15 11]

RegDst Branch MemRead MemtoReg ALUOp MemWrite ALUSrc RegWrite

PCSrc

Read register 1

Instruction [20 16] Instruction _ [31 0]

1

Shift left 2

0 M u x 1

Read data 1 Read register 2 Registers Read Write data 2 register

0 M u x 1

Write data

Zero ALU ALU result

Address

Write data Instruction [15 0]

16

Sign extend

Read data Data memory

1 M u x 0

32 ALU control

Instruction [5 0]

Fig. 1

Answer: (a) In a single-cycle implementation, an instruction is executed in one clock cycle. (b) To increment the PC to obtain the address of the next sequential instruction (4 bytes later). (c) Using multiple levels of control can reduce the size of the main control unit and may also potentially increase the speed of the control unit. (d) To increases the effective range of the offset field of the branch instruction by a factor of 4. (e) Zero output of the ALU, which is used for equality comparison, will need to AND together the control signal, branch, to determine a branch instruction is taken or not.

371

101 年台科大資工 1.

You are going to enhance a machine, and there are two possible implementations: either make multiply instructions run 4 times faster than before, or make memory access instructions run 2 times faster than before. You repeatedly run a program that takes 100 seconds to execute. Of this time, 20% is used for multiplication, 50% for memory access instructions, and 30% for other tasks. (1) What will the speedup be if you improve only multiplication? (2) What will the speedup be if you improve only memory access? (3) What will the speedup be if both implementations are made?

Answer

1

 1.18 0.2  0.8 4 1  1.33 (2) Speedup = 0.5  0.5 2 1 (3) Speedup =  1.67 0.2 0.5   0.3 4 2 (1) Speedup =

2.

Assuming a cache of 4096 blocks, a four-word block size (1 word = 4 bytes), and a 32-bit address, find the total number of sets and the total number of tag bits for (1) a directed-mapped cache (2) a 4-way set-associative cache (3) a fully associative cache

Answer

Number of sets Number of tag bits 3.

(1) directed-mapped

(2) 4-way set-associative

(3) fully associative

4096

1024

1

16  4K = 64Kbits

18  4K = 72Kbits

28  4K = 112Kbits

Suppose we have a processor with a base CPI of 1.0, assuming all references hit in the primary cache, and a clock rate of 1GHz. Assume a main memory access time of 100ns, including all the miss handling. Suppose the miss rate per instruction at the primary cache is 5%. (1) Compute the effective CPI with one level of caching (2) How much faster will the machine be if we add a secondary (L2) cache that has a 20ns access time and is large enough to reduce the miss rate to main memory to 1%? 372

Answer (1) CPI for one level cache = 1 + 0.05 × 100 = 6 (2) CPI for two level cache = 1 + 0.05 × 20 + 0.01 × 100 = 3 Two level cache machine is 6 / 3 = 2 times faster than one level cache machine.

373

101 年台科大電子 1.

If computer A runs a program in 15 seconds and computer B runs the same program in 25 seconds, how much faster is A than B?

Answer: Computer A is 25 / 15 = 1.67 times faster than computer B.

2.

A compiler designer is trying to decide between two code sequences for a particular computer. The hardware designers have supplied the following facts: CPI for each instruction class CPI

A

B

C

1

3

4

For a particular high-level language statement, the compiler writer is considering two code sequences that require the following instruction counts: Code sequence

Instruction counts for each instruction class A

B

C

1

1

2

2

2

3

2

1

(1) What is CPI? (2) Which code sequence executes the most instructions? (3) Which will be faster? (4) What is the CPI for each sequence? Answer (1) CPI is the average number of clock cycles per instruction for a program or a program fragment. (2) Code sequence 2 executes the most instructions = 3 + 2 + 1 = 6 (3) CPU clock cycles for code sequence 1 = 1 × 1 + 2 × 3 + 2 × 4 =15 CPU clock cycles for code sequence 2 = 3 × 1 + 2 × 3 + 1 × 4 =13 So, code sequence 2 is faster than code sequence 1. (4) CPI for code sequence 1 = (1 × 1 + 2 × 3 + 2 × 4) / (1 + 2 + 2) = 3 CPI for code sequence 2 = (3 × 1 + 2 × 3 + 1 × 4) / (3 + 2 + 1) = 2.17 3.

Please explain the following terms: (1) What are structural hazards? (2) What are data hazards? (3) What are control hazards?

Answer 374

(1) When a planned instruction cannot execute in the proper clock cycle because the hardware does not support the combination of instructions that are set to execute. (2) When a planned instruction cannot execute in the proper clock cycle because data that is needed to execute the instruction is not yet available. (3) When a planned instruction cannot execute in the proper pipeline clock cycle because the instruction that was fetched is not the one that is needed; that is, the flow of instruction addresses is not what the pipeline expected. 4.

Assume that a processor with a 1ns clock cycle time, a miss penalty of 16 clock cycles, a miss rate of 0.07 misses per instruction, and a cache access time (including hit detection) of 1 clock cycle. Note that the read and write miss penalties are the same and ignore other write stalls. What is the average memory access time per instruction?

Answer: AMAT per instruction = (1 + 0.07 × 16) × 1ns = 2.12ns

5.

Please explain the following terms: (1) What is direct-mapped cache? (2) What is fully-associative cache? (3) What is set- associative cache?

Answer (1) Direct-mapped cache: a cache structure in which each memory location is mapped to exactly one location in the cache. (2) Fully-associative cache: a cache structure in which a block can be placed in any location in the cache. (3) Set- associative cache: a cache that has a fixed number of locations (at least two) where each block can be placed. 6.

Explain the relationship of virtual memory, TLB, and caches in detail.

Answer The TLB contains a subset of the virtual-to-physical page mappings that are in the page table. On every reference, we look up the virtual page number in the TLB. Under the best of circumstances, a virtual address is translated by the TLB and sent to the cache where the appropriate data is found, retrieved, and sent back to the processor.

375

100 年台科大電子 1.

Consider a MIPS processor with a five-stage pipeline and internal forwarding mechanism as described below: Stage 1: IF Instruction fetch Stage 2: ID Instruction decode and register read Stage 3: EXE Execution or address calculation Stage 4: MEM Data memory access Stage 5: WB Write back How many clock cycles does it need for the following code? Please also identify all the hazards with a figure. lw $2, 20($1) and or add slt

$4, $2, $5 $8, $2, $6 $9, $4, $2 $1, $6, $7

Answer Cock cycles = (5 – 1) + 5 + 1 = 10 lw $2, 20($1) and $4, $2, $5 or

$8, $2, $6

IF

ID

EX

ME

WB

IF

ID

EX

ME

WB

IF

ID

EX

ME

WB

IF

ID

EX

ME

WB

IF

ID

EX

ME

add $9, $4, $2 sit

$1, $6, $7

WB

There are data hazards between instructions pair (1, 2), (1, 3), and (2, 4) 2. Please explain the following terms: (a) Parallel processing program (b) Process-level parallelism (c) Instruction-level parallelism (d) Cache coherence Answer (a) (b) (c)

Parallel processing program: A single program that runs in multiple processors simultaneously. Process-level parallelism: Utilizing multiple processors by running independent programs simultaneously Instruction-level parallelism: The parallelism among instructions. 376

(d)

3.

Cache coherence: Consistency in the value of data between the versions in the caches of several processors. In computer memory hierarchy, a memory reference can encounter three different typss of misses; a TLB miss, a page fault, and a cache miss. Please answer Yes or No the each of (a). Possible? (a) (b) (c) (d) (c)

TLB Hit Hit Hit Miss Miss

Page Table Miss Hit Miss Hit Miss

Cache Hit Miss Miss Miss Hit

Answer

(a) No

(b) Yes

(c) No

(d) Yes

(e) No

4. Please answer True or False to each of the following statements. (a) More powerful instructions mean higher performance. (b) Computers at low utilization imply low power consumption. (c) To benefit from a multiprocessor, an application must be concurrent. (d) GPUs reply on graphics DRAM chips to reduce memory latency and thereby increase performance on graphics applications. (e) There is no way to reduce compulsory misses. Answer (a) (b) (c) (d) (e)

False False (low frequency imply lower power consumption) False False (increase bandwidth not reduce latency) False

377

99 年台科大電子 1. Please describe the difference between CISC and RISC machines, and give an example for each of them. Answer: (1) 指令集依其設計考量可分為精簡指令集(RISC)及複雜指令集(CISC)兩 種。精簡指令集是以CPU執行效率為主要考量,因此這種指令集只提供 功能簡單的基本指令以求較快的指令執行速度。而複雜指令集不但具有 基本指令同時也包括有功能較強的複雜指令,這種指令集主要是以提供 較佳的程式設計環境以減輕程式設計者的負擔為考量。 (2) MIPS屬於精簡指令集(RISC)而Intel IA-32則是屬於複雜指令集(CISC) 2. In the pipeline structure, there are situations called hazards that the next instruction cannot execute in the following clock cycle. Based on the different causes, hazards can be divided into three types: structural, data and control hazards. Please explain the cause of each type in detail. Answer: (1) 結構危障(structural hazards):硬體資源不夠多,而導致在同一時間內要執 行的多個指令卻無法執行。 (2) 資料危障(data hazards):管線中某一指令需要用到前面指令(還在管線中) 尚未產生的結果(資料相依,data dependency),也就是執行的指令所需的 資料還無法獲得。 (3) 控制危障(control hazards):當分支指令尚未決定是否分支至其他指令執行 時,後續指令便已經進入管線中,若分支發生則指令執行順序便會發生 錯誤。控制危障的產生是由管線中的分支指令,及其他會改變 PC 的指令 所引發的,因此又稱為 branch hazards。 3. Suppose we have a 32-bit computer with an instruction set that supports immediate instructions as shown below: Opcode

6 bits

Source Register 5 bits

Destination Register 5 bits

Immediate 16 bits

(a) How many registers at most does this computer have? (b) How many operations at most can this computer have? (c) What is the range of the number in the Immediate field in 2's complement format? Answer: 378

(a) 25 = 32 registers (b) 26 = 64 operations (c) -215 ~ +215 – 1 = - 32768 ~ + 32767 4. Suppose the execution times for 3 programs on 3 machines are given below: Machine A Machine B Machine C

Program X 50 10 20

Program Y 30 30 15

Program Z 10 20 10

(a) Using Program X, Y and Z as benchmarks, please calculate the unweighted arithmetic means of the execution times for the 3 machines. (b) Using Program X, Y and Z as benchmarks, please calculate the geometric means of the execution times for the 3 machines, with Machine A as the reference machine (as in the SPEC benchmarks). (c) Which machine has the overall highest performance? Please justify your answer in detail. Answer: Program X Program Y Program Z SPEC SPEC SPEC ExTime ExTime ExTime ratio ratio ratio Machine A 50 1 30 1 10 1 Machine B 10 5 30 1 20 0.5 Machine C 20 2.5 15 2 10 1

(a)

(b)

AM

GM

30 20 15

1 1.36 1.71

(c) Machine C has the highest performance since its AM is the smallest among the three machines. 5. Suppose we have a direct-mapped cache design with its 32-bit address Tag (bit 31-10)

Index (bit 9-4)

Offset ((bit 3-0)

(a) What is the cache line size (in words)? (b) How many entries does the cache have? (c) What is the ratio between total bits required for such a cache implementation over the data storage bits? Starting from power on, suppose the following byte-addressed cache references are recorded: 0, 4, 16, 132, 232, 160, 1024, 30, 140, 3100, 180, 2180. (d) List the final state of the cache, with each valid entry represented record of . Answer: 379

(a) The cache line size = 22 = 4 words. (b) The number of entries = 26 = 64 (c) The ratio is (1 + 22 + 128) / 128 = 1.179 (d) Byte Block Tag address address 0 0 0 4 0 0 16 1 0 132 8 0 232 14 0 160 10 0 1024 64 1 30 1 0 140 8 0 3100 193 3 180 11 0 2180 136 2 < 0000002, < 0000012, < 0010002, < 0010102, < 0010112, < 0011102,

00012, 00112, 00102, 00002, 00002, 00002,

mem[1024] mem[3100] mem[2180] mem[160] mem[180] mem[232]

> > > > > >

380

Index

Hit/Miss

0 0 1 8 14 10 0 1 8 1 11 8

Miss Hit Miss Miss Miss Miss Miss Hit Hit Miss Miss Miss

98 年台科大電子 1. "Multi-core" has become a popular technology for new generation processors. However, the amount of performance gained by the use of a multi-core processor does not increase as the core numbers inside chip, why? Answer: The amount of performance gained by the use of a multi-core processor depends on the problem being solved and the algorithms used, as well as their implementation in software (Amdahl's law). The software has been designed to take advantage of available parallelism. If it hasn't, there will not be any speedup at all. However, the processor will multitask better since it can run two programs at once, one on each core. 2. Please explain the following terms: • Simultaneous multithreading • Thread-Level Parallelism • Brach target buffer (BTB) • Precise interrupt • Branch prediction Answer: Simultaneous multithreading: a variation on hardware multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and thread-level parallelism (TLP) Thread-Level Parallelism: multiple thread contexts in CPU Branch target buffer: a structure that caches the destination PC or destination instruction for a branch. It is usually organized as a cache with tags, making it more costly than a simple prediction buffer Precise interrupt: an interrupt or exception that is always associated with the correct instruction in pipelined computers. Branch prediction: Prediction of branches at runtime. 3. What does TLB mean? Which stage(s) should it be used for a five-stage MIPS architecture? Why? Answer: TLB (translation-lookaside buffer) is a cache that keeps track of recently used address mappings to avoid an access to the page table. The following diagram shows the stages (IF and EX) that TLB is placed for MIPS R3000, since TLB should be accessed before memory is referenced.

381

TLB

TLB

address address

4. Is it possible to have structure, data, and branch hazards in single-cycle, multi-cycle, and pipelined implementations, respectively? Why? Answer: It is impossible to have hazards in singl-cycle and multi-cycle machines since instructions are executed sequentially. That is, the instuction won‘t be executed before its immediate preceding instruction has completed. In pipelined implementation, hazards would be occurred since instructions are executed overlapped. 5. Please use Verilog or VHDL to design a 4-bit 3 input Multiplexer. The 4-bit input: A, B, C; The 2-bit selection input: sel; The 4-bit output: y_out; y_out = A, when sel = 00 y_out = B, when sel = 01 y_out = C, when sel = 10 y_out = 0, when sel = 11 Answer:

382

entity MUX is port ( A, B, C: in std_logic_vector(3 downto 0); sel: in std_logic_vector(1 downto 0); y_out: out std_logic_vector(3 downto 0)); end MUX; architecture archmux of mux is begin with s select y_out , ≥, 1 10. What is the IPC performance limit for a dual-issue superscalar? Why? Answer: Dual-issue superscalar have a best case IPC of 2 since two instructrions are completed per cycle

386

11. If you write the following C code in a big-endian CPU, what will be the output in the console? char *cp; long data = 0x12345678; cp = (char*)&data; printf("%x",*(cp+2)); Answer: 56 12. Design a circuit which can convert 8-bit signed integer to 16-bit signed integer. The input is A7, A6, ..., A0 (A7 is MSB). The output is B15, B14, ..., B0, (B15 is The MSB). Answer: B15 B14 B13 B12 B11 B10 B9 B8 A7

B7

A6

B6

A5

B5

A4

B4

A3

B3

A2

B2

A1

B1

A0

B0

387

101 年東華資工 1.

(1) Describe the advantages and disadvantages of pipelines. (2) Describe three types of pipeline hazards and the solutions for each type.

Answer (1) Advantages: can overlap the execution of multiple instructions (increasing instruction throughput) to improve performance. Disadvantages: high area overhead and more complex to implementation. (2) Type Description Solutions Hardware cannot support 增加足夠的硬體 (例如: use two memories, one Structure the instructions executing for instruction and one for data) hazard in the same clock cycle (limited resources) Attempt to use item Software solution: (a)使用compiler插入足夠的no before it is ready (Data operation (nop)指令。(b)重排指令的順序使data dependency) hazard情形消失。 Hardware solution: (a)使用Forwarding前饋資料

Data hazard

給後續有data hazard的指令。(b)若是遇到 load-use這種無法單用Forwarding解決的data hazard時則頇先暫停(stall)一個時脈週期後再使 用Forwarding

Control hazard

Attempt to make a decision before condition is evaluated (branch instructions)

Software solution: (a)使用compiler插入足夠的no operation (nop)指令。(b)使用Delay branch。 Hardware solution: (a)將分支判斷提前以減少發 生control hazard時所需要清除指令的個數。(b) 使用預測方法(static or dynamic)。當預測正確時 pipeline便可全速運作其效能就不會因為分支指 令而降低。當預測不正確時我們才需要清除掉 pipeline中擷取錯誤的指令。

388

2. Assume the miss rate of an instruction cache is 2% and the miss rate of the data cache is 4%. If a processor has a CPI of 2 without any memory stalls and the miss penalty is 100 cycles for all misses, determine how much faster a processor would run with a perfect cache that never missed. Assume the frequency of all loads and stores is 36%. Answer CPI considering miss penalty = 2 + 0.02 × 100 + 0.04 × 0.36 × 100 = 5.44 The processor with a perfect cache is 5.44 / 2 = 2.72 times faster

3. A hard disk has the seek time of 4 ms, 100 MB/sec transfer rate, and 0.2 ms controller overhead. What is the average time to read or write a 512-byte sector for the disk rotating at 15,000 RPM? Assume that the disk is idle so that there is no waiting time. Answer Disk access time = 4ms 

0.5 0.5KB   0.2ms  6.205ms 15000 100MB 60

4. (1) Explain the following items in a memory system. (a) Locality of reference. (b) Spatial locality. (c) Temporal locality. (2) Briefly describe three techniques used for performing I/O data transfer. Answer (1) (a) Locality of reference:程式在任何一個時間點只會存取一小部份的位址空間稱為 區域性。 (b) Temporal locality:如果一個項目被存取到,那麼它很快的會被再存取到。 (c) Spatial locality:如果一個項目被存取到,那麼它位址附近的項目也會很快被存取 到。 (2) Polling Interrupt DMA

The processor periodically checking the status of an I/O device to determine the need to service the device I/O devices employs interrupts to indicate to the processor that they need attention DMA approach provides a device controller the ability to transfer data directly to or from the memory without involving the processor

389

101 年海洋資工 1. (1) With x = 1111 1111 1111 1111 1011 0011 0101 00112 and y = 0000 0000 0000 0000 0000 0010 1101 01112 representing two‘s complement signed integers, perform, showing all work: (a) x – y (b) x / y (2) With x = 0100 0110 1101 1000 0000 0000 0000 00002 and y = 1011 1110 1110 0000 0000 0000 0000 00002 representing single precision IEEEE 754 floating-poing number, perform, showing all work: (a) x + y (b) x × y Answer (a) 1111 1111 1111 1111 1011 0000 0111 11002 (1) (b) 0000 0000 0000 0000 0000 0000 0001 10102 (a) 0 10001101 10101111111111100100000 (2) (b) 1 10001100 01111010100000000000000

2.

Identify all of the data dependencies in the following code. Which dependencies are data hazards that will be resolved via forwarding? Which dependencies are data hazards that will cause a stall? add

$3, $4, $2

sub

$5, $3, $1

lw

$6, 200($3)

add

$7, $3, $6

Answer Data dependency Data hazard Can be resolved via forwarding Cause a stall

(1, 2), (1, 3), (1, 4), (3, 4) (1, 2), (1, 3), (3, 4) (1, 2), (1, 3) (3, 4)

390

3.

Consider a virtual memory system with the following properties: 

40-bit virtual byte address



16 KB pages



36-bit physical byte address

What is the total size of the page table for each process on this processor, assuming that the valid, protection, dirty, and use bits take a total of 4 bits and that all the virtual pages are in use? (Assume that disk addresses are stored in the page table) Answer Each page is 16 KB  14 bits page offset. The bits of virtual page number = 40 – 14 = 26 bits, this  226 entries in the page table Each entry requires 36 – 14 = 22 bits to store the physical page number and an additional 4 bits for the valid, protection, dirty, and use bits. We round the 26 bits up to a full word per entry, so this gives us a total size of 226 × 32 bits or 256 MB.

4.

Do we need combinational logic, sequential logic, or a combination of the two to implement each of the following: (1) multiplexer (2) comparator (3) incrementer/decrementer (4) register (5) multiplier with shifters and adders (6) memory (7) ALU (the ones in single-cycle and multiple-cycle datapath) (8) carry look-ahead adder (9) latch (10) general finite state machine (FSM)

Answer Combinational logic only: Sequential logic only: Mixed sequential and combinational:

(1), (2), (3), (7), (8) (4), (6), (9) (5), (10)

391

5

Design a 4 bit adder/subtractor using the logical gate, ROM, PLA, PAL, VHDL, or Verilog HDL implementation. (choose one among the six approaches)

Answer The following design uses logical gate to implement a 4 bit adder/subtractor. S = A + B when M = 0, S = A – B when M = 1 a3 b3

c4

a2 b2

a1 b1

a0 b0

+

+

+

+

s3

s2

s1

s0

392

M

100 年海洋資工 1.

Modify the below PLA for controller of multi-cycle CPU with exception handling shown in the following state diagram. Op5 Op4 Op3 Op2 Op1 Op0 S3 S2 S1 S0 PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource1 PCSource0 ALUOp1 ALUOp0 ALUSrcB1 ALUSrcB0 ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0

Sequence Control

393

Answer

註:本題答案只畫出需要增加或修改的部份,其餘保留和原題目一樣

394

Op5 Op4 Op3 Op2 Op1 Op0 Other Overflow S3 S2 S1 S0

Control signals (ignore) NS3 NS2 NS1 NS0

2. Given the bit pattern 1010 1101 0001 0000 0000 0000 0000 0010 What does it represent, assuming that it is (a) A two's complement integer? (b) An unsigned integer? (c) A single precision floating-point number? (IEEE 754) S 1 bit

Exponent 8 bits

Significand 23 bits

Answer

(a) (b) (c)

-231 + 229 + 227 + 226 + 224 +220 + 21 231 + 229 + 227 + 226 + 224 +220 + 21 -1.001000000000000000000102  2-37

3. Refer to below figure, show each step and final value of each register in

calculating 01011101  0101

395

>=

Answer

Iteration 0 (initial values) 1 2 3 4 5

Quotient 00000000 00000000 00000000 00000000 00000000 00000001

Divisor 00000101 00000000 00000010 10000000 00000001 01000000 00000000 10100000 00000000 01010000 00000000 00101000 396

Remainder 00000000 01011101 00000000 01011101 00000000 01011101 00000000 01011101 00000000 01011101 00000000 00001101

6 7 8 9

00000010 00000100 00001001 00010010

00000000 00010100 00000000 00001010 00000000 00000101 00000000 00000010

00000000 00001101 00000000 00001101 00000000 00000011 00000000 00000011

4. Multicycle Datapath:

lw

$t2, 0($t3)

lw

$t3, 4($t3)

beq $t2, $t3, Label: assume not equal add $t5, $t2, $t3 sw

$t5, 8($t3)

What is going on during the 6th and 12th cycle of execution? Answer

(1) Fetch instruction ―lw $t3, 4($t3)‖ from memory (2) Decode instruction ―beq $t2, $t3, Label‖ and read registers, $t2 and $t3, from register file

397

97 年海洋資工 1. A hard disk has a rotating speed of 7200 rpm, 3 ms seek time, 1 ms controller overhead, 40 MB/s transfer rate and 10 ms queuing delay, what is the disk access time for a 1 KB byte sector ? Answer: 3ms 

0.5 1KB   1ms  10ms  3ms  4.167ms  0.025ms  1ms  10ms  18.192ms 7200 / 60 40MB

2. A multi-cycle CPU has 3 implementations. The first one is a 5-cycle IF-ID-EX-MEM-WB design running at 4.8 GHz, where load takes 5 cycles, store/R-type 4 cycles and branch/jump 3 cycles. The second one is a 6-cycle design running at 5.6 GHz, with MEM replaced by MEM1 & MEM2. The third is a 7-cycle design running at 6.4 GHz, with IF further replaced by IF1 & IF2. Assume we have an instruction mix: load 26%, store 10%, R-type 49%, branch/jump 15%. Do you think it is worthwhile to go for the 6-cycle design over the 5-cycle design? How about the 7-cycle design, is it worthwhile? Please give your rationales. Answer: The average CPI for implementation 1 is: 5  0.26 + 4  0.1 + 4  0.49 + 3  0.15 = 4.11 The execution time for an instruction in implementation 1 = 4.11/4.8G = 0.86 ns The average CPI for implementation 2 is: 6  0.26 + 5  0.1 + 4  0.49 + 3  0.15 = 4.47 The execution time for an instruction in implementation 2 = 4.47/5.6G = 0.80 ns The average CPI for implementation 3 is: 7  0.26 + 6  0.1 + 5  0.49 + 4  0.15 = 5.47 The execution time for an instruction in implementation 3 = 5.47/6.4G = 0.85 ns It is worthwhile to go for the 6-cycle design over the 5-cycle design, but it is not worthwhile to go for 7-cycle design over the 6-cycle design.

398

3. Modify the below PLA for controller of multi-cycle CPU with exception handling shown in the following state diagram. Op5 Op4 Op3 Op2 Op1 Op0 S3 S2 S1 S0 PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource1 PCSource0 ALUOp1 ALUOp0 ALUSrcB1 ALUSrcB0 ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0

Sequence Control

399

Answer: 註:本題答案只畫出需要增加或修改的部份,其餘保留和原題目一樣

400

Op5 Op4 Op3 Op2 Op1 Op0 Other Overflow S3 S2 S1 S0

Control signals (ignore) NS3 NS2 NS1 NS0

401

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.