



## Structured computer organization solutions manual pdf

1 SOLUTIONS MANUAL COMPUTER ORGANIZATION AND ARCHITECTURE DESIGNING FOR PERFORMANCE EIGHTH EDITION WILLIAM STALLINGS 2 TABLE OF CONTENTS Chapter 1 Introduction... 5 Chapter 2 Computer Evolution and Performance... 6 Chapter 3 Computer Function and Interconnection... 14 Chapter 4 Cache Memory... 19 Chapter 5 Internal Memory...32 Chapter 6 External Memory...38 Chapter 7 Input/Output...43 Chapter 8 Operating System Support...50 Chapter 10 Instruction Sets: Characteristics and Functions...69 Chapter 11 Instruction Sets: Addressing Modes and Formats...80 Chapter 12 Processor Structure and Function...85 Chapter 13 Reduced Instruction Set Computers...92 Chapter 14 Instruction-Level Parallelism and Superscalar Processors...97 Chapter 15 Control Unit Operation Chapter 15 Control Unit Operation Chapter 17 Parallel Processing Chapter 18 Multicore Computers Chapter 19 Number Systems Chapter 20 Digital Logic Chapter 21 The IA-64 Architecture Appendix B Assembly Language and Related Topics 3 CHAPTER 1 INTRODUCTION A NSWERS TO Q UESTIONS 1.1 Computer architecture refers to those attributes that have a direct impact on the logical execution of a program. Computer organization refers to the operational units and their interconnections that realize the architectural specifications. Examples of architectural attributes include the instruction set, the number of bits used to represent various data types (e.g., numbers, characters), I/O mechanisms, and techniques for addressing memory. Organizational attributes include those hardware details transparent to the programmer, such as control signals; interfaces between the computer and peripherals; and the memory technology used. 1.2 Computer structure refers to the operation of each individual component as part of the structure. 1.3 Data processing; data storage; data movement; and control. 1.4 Central processing functions; often simply referred to as processor. Main memory: Stores data. I/O: Moves data between the computer and its external environment. System interconnection: Some mechanism that provides for communication among CPU, main memory, and I/O. A common example of system interconnection is by means of a system bus, consisting of a number of conducting wires to which all the other components attach. 1.5 Control unit: Controls the operation of the CPU and hence the computer Arithmetic and logic unit (ALU): Performs the computer s data processing functions Registers: Provides storage internal to the CPU CPU interconnection: Some mechanism that provides for communication among the control unit, ALU, and registers -5-4 CHAPTER 2 COMPUTER EVOLUTION AND PERFORMANCE A NSWERS TO Q UESTIONS 2.1 In a stored program computer, programs are represented in a form suitable for storing in memory, and a program can be set or altered by setting the values of a portion of memory. 2.2 A main memory, which stores both data and instructions: an arithmetic and logic unit (ALU) capable of operating on binary data; a control unit, which interprets the instructions in memory and causes them to be executed; and input and output (I/O) equipment operated by the control unit. 2.3 Gates, memory cells, and interconnections among gates and memory cells. 2.4 Moore observed that the number of transistors that could be put on a single chip was doubling every year and correctly predicted that this pace would continue into the near future. 2.5 Similar or identical instructions is supported on all members of the family. Thus, a program that executes on one machine will also execute on any other. Similar or identical operating system: The same basic operating system is available for all family members. Increasing speed: The rate of instruction execution increases in going from lower to higher family members. Increasing Number of I/O ports: In going from lower to higher family members. Increasing nemory size: In going from lower to higher family members. members. Increasing cost: In going from lower to higher family members. 2.6 In a microprocessor, all of the components of the CPU are on a single chip. A NSWERS TO P ROBLEMS 2.1 This program is developed in [HAYE98]. The vectors A, B, and C are each stored in 1,000 contiguous locations in memory, beginning at locations 1001, 2001, and 3001, respectively. The program begins with the left half of location 3. A counting variable N is set to 999 and decremented after each step until it reaches 1. Thus, the vectors are processed from high location to low location. -6- 5 Location Instruction Comments Constant (Count N) 1 1 Constant Constant 3L LOAD M(2000) Transfer A(I) to AC 3R ADD M(3000) Compute A(I) + B(I) 4L STOR M(4000) Transfer sum to C(I) 4R LOAD M(0) Load count N 5L SUB M(1) Decrement N by 1 5R JUMP + M(6, 0:19) Halt 6R STOR M(0) Update N 7L ADD M(1) Increment AC by 1 7R ADD M(2) 8L STOR M(3, 8:19) Modify address in 3L 8R ADD M(2) 9L STOR M(3, 28:39) Modify address in 3R 9R ADD M(2) 10L STOR M(4, 8:19) Modify address in 4L 10R JUMP M(3, 0:19) Branch to 3L 2.2 a. Opcode Operand b. First, the CPU must make access memory to fetch the instruction. The instruction contains the address of the data we want to load. During the execute phase accesses memory to load the data value located at that address for a total of two trips to memory. 2.3 To read a value from memory, the CPU puts the address on the address on the address on the address of the value it wants into the MAR. The CPU puts the address of the value it wants into the MAR. transferred to the MBR. To write a value to memory, the CPU puts the address of the value it wants to write into the MAR. The CPU also places the data it wants to write into the MBR. The CPU then asserts the Write control line to memory and places the address on the address bus and the data on the data bus. Memory transfers the data on the data on the data bus into the corresponding memory location. -7- 6 2.4 Address 08A 08B 08C 08D Contents LOAD M(0FA) STOR M(0FB) LOAD M(0FA) JUMP +M(08D) LOAD M(0FA) STOR M(0FB) This program will store the absolute value of content at memory location 0FB. 2.5 All data paths to/from MBR are 40 bits. All data paths to/from MBR are 40 bits. All data paths to/from MBR are 40 bits. MAR are 12 bits. Paths to/from AC are 40 bits. Paths to/from MQ are 40 bits. 2.6 The purpose is to increase performance. When an address is presented to a memory module, there is some time delay before the read or write operation can be performed. While this is happening, an address can be performed. When an address is presented to a memory module. For a series of requests for successive words, the maximum rate is doubled. 2.7 The discrepancy can be explained by noting that other system components aside from clock speed make a big difference in overall systems and advances in I/O processing contribute to the performance ratio. A system is only as fast as its slowest link. In recent years, the bottlenecks have been the performance of memory modules and bus speed. 2.8 As noted in the answer to Problem 2.7, even though the Intel machine may have a faster clock speed (2.4 GHz vs. 1.2 GHz), that does not necessarily mean the system will perform faster. Different systems are not comparable on clock speed. Other factors such as the system components (memory, buses, architecture) and the instruction sets must also be taken into account. A more accurate measure is to run both systems on a benchmark. Benchmark programs exist for certain tasks, such as running office applications, performing floating-point operations, and so on. The systems can be compared to each other on how long they take to complete these tasks. According to Apple Computer, the G4 is comparable or better than a higher-clock speed Pentium on many benchmarks. 2.9 This representation is wasteful because to represent a single decimal digit from 0 through 9 we need to have ten tubes. If we could have an arbitrary number of these tubes ON at the same time, then those same tubes could be treated as binary bits. With ten bits, we can represent 2 10 patterns, or 1024 patterns. For integers, these patterns could be used to represent the numbers from 0 through CPI = 1.55; MIPS rate = 25.8; Execution time = 3.87 ns. Source: [HWAN93] -8-7 2.11 a. () () 10 6 CPI i I i CPI A = I c CPI A = 0.2 s () f = 6 CPI A = 90 6 CPI i I i I c = CPU B = I c CPI B = 104 6 () = 0.2 s () f = 6 CPI B = 104 6 () = 0.2 s b. Although machine B has a higher MIPS than machine A, it requires a longer CPU time to execute the same set of benchmark programs a. We can express the MIPs rate as: [(MIPS than machine A, it requires a longer CPU time to execute the same set of benchmark programs a. We can express the MIPs rate as: [(MIPS than machine B has a higher MIPS than machine B has a highe rate)/10 6] = I c /T. So that: I c = T [(MIPS rate)/10 6]. The ratio of the instruction count of the RS/6000 to the VAX is [x 18]/[12x 1] = 1.5. b. For the RS/6000, CPI = 25/18 = From Equation (2.2), MIPS = I c /(T 10 6) = 100/T. The MIPS values are: Computer & Program Program Arithmetic mean Rank Harmonic mean Computer C Rank -9-8 2.14 a. Normalized to M: Benchmark Processor R M Z E F H I K Arithmetic mean c. Recall that the larger the ratio, the higher the speed. Based on (a) R is the slowest machine, by a significant amount. Based on (b), M is the slowest machine, by a modest amount. d. Normalized to R: Benchmark -11- Processor R M Z E F H I K Geometric mean Using the geometric mean, R is the slowest no matter which machine is used for normalization as Normalized to X: Benchmark Processor X Y Z Arithmetic mean Geometric mean Normalized to Y: Benchmark Processor X Y Z Arithmetic mean Geometric mean Normalized to Y: Benchmark Processor X Y Z Arithmetic mean Machine Y is twice as fast as machine X for benchmark 1, but half as fast for benchmark 2. Similarly machine Z is half as fast as X for benchmark 1, but twice as fast for benchmark 2. Intuitively, these three machines have equivalent performance. However, if we normalize to X and compute the arithmetic mean 10 of the speed metric, we find that X is 25% faster than X. Now, if we normalize to Y and compute the arithmetic mean 10 of the speed metric, we find that X is 25% faster than X. Now, if we normalize to Y and compute the arithmetic mean of the speed metric, we find that X is 25% faster than X. Now, if we normalize to Y and compute the arithmetic mean 10 of the speed metric. Clearly, the arithmetic mean is worthless in this context. b. When the geometric mean is used, the three machines are shown to have equal performance when normalized to X, and also equal performance when normalized to Y. These results are much more in line with our intuition a. Assuming the same instruction mix means that the additional instructions for each task should be allocated proportionally among the instruction types. So we have the following table: Instruction Type CPI Instruction Mix Arithmetic and logic 1 60% Load/store with cache hit 2 18% Branch 4 12% Memory reference with cache miss 12 10% CPI = (2 0.18) + (4 0.12) + (12 0.1) = The CPI has increased due to the increased time for memory access. b. MIPS = 400/2.64 = 152. There is a corresponding drop in the MIPS rate. c. The speedup factor is the ratio of the execution time as T = I c /(MIPS 10 6). For the single-processor case, T 1 = ()/() = 11 ms. With 8 processors, each processor executes 1/8 of the 2 million instructions plus the 25,000 overhead instructions. For this case, the execution time for each of the 8 processors is T 8 = 8 = 1.8 ms Therefore we have time to execute program on a single processor Speedup = time to execute program on N parallel processors = = 6.11 d. The answer to this question depends on how we interpret Amdahl's' law. There are two inefficiencies in the parallel system. First, there are additional instructions added to coordinate between threads. Second, there is contention for memory access. The way that the problem is stated, none of the code is inherently serial. All of it is parallelizable, but with scheduling overhead. One could argue that the memory access conflict means that to some extent memory reference instructions are not parallelizable. But based on the information given, it is not clear how to quantify this effect in Amdahl's law reduces to Speedup = N = 8 for this case. Thus the actual speedup is only about 75% of the theoretical speedup.  $-12^{-11} 2.17$  a. Speedup = (time to access in main memory)/(time to access in cache) = T 2 /T 1. b. The average access time can be computed as T = H T 1 + (1 H) T 2 Using Equation (2.8): Speedup = Execution time before enhancement Execution time after enhancement = T 2 T = T 2 = H T 1 + (1 H) T 2 Using Equation (2.8): Speedup = Execution time before enhancement Execution time after enhancement = T 2 T = T 2 = H T 1 + (1 H) T 2 Using Equation (2.8): Speedup = Execution time before enhancement Execution time after enhancement = T 2 T = T 2 = H T 1 + (1 H) T 2 Using Equation (2.8): Speedup = Execution time before enhancement Execution time after enhancement = T 2 T = T 2 = H T 1 + (1 H) T 2 Using Equation (2.8): Speedup = Execution time before enhancement Execution time after enhancement = T 2 T = T 2 = H T 1 + (1 H) T 2 Using Equation (2.8): Speedup = Execution time before enhancement Execution time after enhancement = T 2 T = T 2 = H T 1 + (1 H) T 2 Using Equation (2.8): Speedup = Execution time before enhancement Execution time after enhancement = T 2 T = T 2 = H T 1 + (1 H) T 2 Using Equation (2.8): Speedup = Execution time before enhancement Execution time after enhancement = T 2 T = T 2 = H T 1 + (1 H) T 2 Using Equation (2.8): Speedup = Execution time before enhancement Execution time after enhancement = T 2 T = T 2 = H T 1 + (1 H) T 2 Using Equation (2.8): Speedup = Execution time before enhancement Execution time after enhancement = T 2 T = T 2 = H T 1 + (1 H) T 2 Using Equation (2.8): Speedup = Execution time before enhancement = T 2 T = T 2 = H T 1 + (1 H) T 2 Using Equation (2.8): Speedup = Execution time before enhancement = T 2 T = T 2 = H T 1 + (1 H) T 2 = (1 H) T H) + H T 1 T 2 c. T = H T 1 + (1 H) (T 1 + T 2) = T 1 + (1 H) T 2) This is Equation (4.2) in Chapter 4. Now, Speedup = Execution time before enhancement = T 2 T = T 2 = T 1 + (1 H)T 2 1 H () + T 1 T 2 In this case, the denominator is larger, so that the speedup is less 12 CHAPTER 3 COMPUTER FUNCTION AND INTERCONNECTION A NSWERS TO Q UESTIONS 3.1 Processor-memory: Data may be transferred from processor to memory or from memory to processor. Processor-I/O: Data may be transferred to or from a peripheral device by transferring between the processor and an I/O module. Data processing: The processor may perform some arithmetic or logic operation on data. Control: An instruction may specify that the sequence of execution be altered. 3.2 Instruction address of the next instruction fetch (if): Read instruction from its memory location into the processor. Instruction operation decoding (iod): Analyze instruction to determine type of operation to be performed and operand(s) to be used. Operand address calculation (oac): If the operand in memory or available via I/O, then determine the address of the operand. Operand fetch (of): Fetch the operand from memory or read it in from I/O. Data operation (do): Perform the operation indicated in the instruction. Operand store (os): Write the result into memory or out to I/O. 3.3 (1) Disable all interrupts while an interrupt is being processed. (2) Define priorities for interrupts and to allow an interrupt is being processed. reads an instruction or a unit of data from memory. Processor to memory: The processor writes a unit of data to memory. I/O to processor to I/O: The processor sends data to the I/O device. I/O to or from memory: For these two cases, an I/O module is allowed to exchange data directly with memory, without going through the processor, using direct memory access (DMA). 3.5 With multiple buses, there are fewer devices per bus. This (1) reduces bottleneck effects. 3.6 System pins: Include the clock and reset pins. Address and data pins: Include 32 lines that are time multiplexed for addresses and data. Interface control pins: Control the timing of transactions and provide coordination among initiators and targets. Arbitration pins: Unlike the other PCI signal lines, these are not shared lines. Rather, each PCI master has its own pair of arbitration lines that connect it directly to the PCI bus arbitration Error Reporting pins: Used to report parity and -14-13 other errors. Interrupt Pins: These are provided for PCI devices that must generate requests for service. Cache support pins: These pins are needed to support a memory on PCI that can be cached in the processor or another device. 64-bit Bus extension pins: Include 32 lines that are time multiplexed for addresses and data and that are combined with the mandatory address/data lines to form a 64-bit address/data bus. JTAG/Boundary Scan Pins: These signal lines support testing procedures defined in IEEE Standard A NSWERS TO P ROBLEMS 3.1 Memory (contents in hex): 300: 3005; 301: 5940; 302: 7006 Step 1: 3005 IR; Step 2: 3 AC Step 3: 5940 IR; Step 4: = 5 AC Step 5: 7006 IR; Step 6: AC Device a. The PC contains 300, the address of the first instruction. This value is loaded into the MBR, and the PC is incremented. These two steps can be done in parallel. c. The value in the MBR is loaded into the MBR. c. The value in the MBR. c. The value in location 301 (which is the instruction with the value 5941) is loaded into the MBR. c. The value in location 301 (which is the instruction with the value 5941) is loaded into the MBR. b. The value in the MBR is loaded into the MBR. c. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR is loaded into the MBR. b. The value in the MBR is loaded into the MBR is loaded the MBR, and the PC is incremented. c. The value in the MBR is loaded into the IR. 4. a. The address portion of the IR (941) is loaded into the MBR. c. The old value of the AC and the result is stored in the PC (302) is loaded into the MBR. b. MAR. b. The value in location 302 (which is the instruction with the value 2941) is loaded into the MBR, and the PC is incremented. c. The value in the MBR is stored in location a = 16 MBytes b (1) If the local address bus is 32 bits, the whole address can be transferred at once and decoded in memory. However, because the data bus is only 16 bits, it will require 2 cycles to fetch a 32-bit instruction or operand. (2) The 16 bits of the address placed on the address bus can't access the whole memory. Thus a more complex memory interface control is needed to latch the first part of the address and then the second part (because the microprocessor will end in two steps). For a 32-bit address, one may assume the first half will decode to access a "row" in memory, while the second half is sent later to access a "row" in memory. In addition to the two-step address operation, the microprocessor will need 2 cycles to fetch the 32 bit instruction/operand. c. The program counter must be at least 24 bits. Typically, a 32-bit microprocessor will have a 32-bit program counter, unless onchip segment registers are used that may work with a smaller program counter. If the instruction register is to contain the whole instruction, it will have to be 32-bits long; if it will contain only the op code (called the op code register) then it will have to be 8 bits long. 3.4 In cases (a) and (b), the microprocessor will be able to access 2 16 = 64K bytes; the only difference is that with an 8-bit memory each access will transfer a byte, while with a 16-bit memory and access may transfer a byte or a 16-byte word. For case (c), separate input and output instructions are needed, whose execution will generate separate "I/O signals" (different from the "memory signals" generated with the execution of memory-type instructions); at a minimum, one additional output pin will be required to carry this new signal. For case (d), it can support 2 8 = 256 input and 2 8 = 256 output byte ports and the same number of input and output 16-bit ports; in either case, the distinction between an input and an output port is defined by the different signal that the executed input or output instruction generated Clock cycle = = 125 ns 8 MHz Bus cycle = ns = 500 ns 2 bytes transferred every 500 ns; thus transfer rate = 4 MBytes/sec Doubling the frequency may mean adopting a new chip manufacturing technology (assuming each instructions will have the same number of clock cycles); doubling the external data bus means wider (maybe newer) on-chip data bus drivers/latches and modifications to the bus control logic In the first case, the speed of the memory chips will also need to double (roughly) not to slow down the microprocessor; in the second case, the "wordlength" of the memory will have to double to be able to send/receive 32-bit quantities. 3.6 a. Input from the Teletype is stored in INPR. The INPR will only accept data from the Teletype when FGI=0. When data arrives, it is stored in INPR, and FGI is set to 1. The CPU periodically checks FGI. If FGI = 1, the CPU transfers the contents of INPR to the AC to OUTR and sets FGO to 0. The Teletype sets FGI to 1 after the word is printed. b. The process described in (a) is very wasteful. The CPU, which is much faster than the Teletype can issue an interrupt to the CPU whenever it is ready to accept or send data. The IEN register can be set by the CPU (under programmer control) 3.7 a. During a single bus cycle, the 8-bit microprocessor transfers one byte while the 16-bit microprocessor transfers and instructions, of which 50 are one byte long and 50 are two bytes long. The 8-bit microprocessor takes 50 + (2 x - 16 - 15 50) = 150 bus cycles for the transfer. The 16-bit microprocessor requires = 100 bus cycles. Thus, the data transfer rates differ by a factor of The whole point of the clock is to define event times on the bus; therefore, we wish for a bus arbitration operation to be made each clock cycle. This requires that the priority signal propagate the length of the daisy chain (Figure 3.26) in one clock period. Thus, the maximum number of masters is determined by dividing the amount of time it takes a bus master to pass through the bus priority by the clock period. 3.9 The lowest-priority device is assigned priority 16. This device must defer to all the others. However it may transmit in any slot not reserved by the other SBI devices At the beginning of any slot, if none of the TR lines is asserted, only when there is heavy demand on the bus, which means that most of the time there is at least one pending request, will the priority 16 device not have the lowest average wait time a. With a clocking frequency of 10 MHz, the clock period is 10 9 s = 100 ns. b. The Read signal begins to fall at 75 ns from the beginning of the third clock cycle (middle of the second half of T 3). Thus, memory must place the data on the bus no later than 55 ns from the beginning of T a. The clock period is 125 ns. Therefore, two clock cycles, the Ready line can be put in low at the beginning of T 2 and kept low for 250 ns a. A 5 MHz clock corresponds to a clock period of 200 ns. Therefore, the Write signal has a duration of 150 ns. b. The data remain valid for = 170 ns. c. One wait states, the instruction requires four memory accesses, resulting in 8 wait states. The instruction, with wait states of 50%. b. In this case, the instruction takes 26 bus cycles without wait states and 34 bus cycles with wait states, for an increase of 33% a. The clock period is 125 ns. One bus read cycle takes 500 ns = 0.5 µs. If the bus cycles repeat one after another, we can achieve a data transfer rate of 2 MB/s. b. The wait state extends the bus read cycle by 125 ns, for a total duration of µs. The corresponding data transfer rate is 1/0.625 = 1.6 MB/s A bus cycle takes 1 µs. If both are odd-aligned, the time required is 3 µs. If both are odd-aligned, the time required is 3 µs. If both are odd-aligned, the time required is 4 µs. -17- 16 3.17 Consider a mix of 100 instructions. and operands. On average, they consist of bit items, bit items, and 40 bytes. The number of bus cycles required for the 16-bit microprocessor, the number required is 100. This amounts to an improvement of 20/120 or about 17% The processor needs another nine clock cycles to complete the instruction. Thus, the Interrupt Acknowledge will start after 900 ns CLK FRAME# AD Address Data-1 Data-2 Data-3 C/BE# Bus Cmd Byte Enable B UESTIONS 4.1 Sequential access: Memory is organized into units of data, called records. Access must be made in a specific linear sequence. Direct access to reach a general vicinity plus sequential searching, counting, or waiting to reach the final location. Random access: Each addressable location in memory has a unique, physically wired-in addressable location is independent of the sequence of prior accesses and is constant. 4.2 Faster access time, greater capacity, since a given location is independent of the sequence of prior accesses and is constant. time. 4.3 It is possible to organize data across a memory hierarchy such that the percentage of accesses to each successively lower level is substantially less than that of the level above. Because memory references tend to cluster, the data in the higher level is substantially less than that of the level above. direct mapping maps each block of main memory block can be mapping, the cache is divided into any line of the cache is divided into any line of the cache is divided into any line of the cache is divided into a number of sets of cache lines; each main memory block can be mapped into any line in a particular set. 4.5 One field identifies a unique word or byte within a block of main memory. The remaining two fields specify one of the blocks of main memory. These two fields are a line field, which identifies one of the blocks that can fit into that line. 4.6 A tag field uniquely identifies a block of main memory. A word field identifies a unique word or byte within a block of main memory. The remaining two fields are a set field, which identifies one of the blocks that can fit into that set. 4.8 Spatial locality refers to the tendency of execution to involve a number of memory locations that are clustered. Temporal locality refers to the tendency for a processor to access memory locations that are clustered. prefetching mechanisms (fetching items of anticipated use) into the cache control logic. Temporal locality is exploited by keeping recently used instruction and data values in cache hierarchy. A NSWERS TO P ROBLEMS 4.1 The cache is divided into 16 sets of 4 lines each. Therefore, 4 bits are needed to identify the set number. Main memory consists of 4K = 2 12 blocks. Therefore, the set plus tag lengths must be 12 bits and therefore the tag length is 8 bits. Each block contains 128 words. Therefore, 7 bits are needed to specify the word. TAG SET WORD Main memory address = There are a total of 8 kbytes/16 bytes = 512 lines in the cache. Thus the cache consists of 256 sets of 2 lines each. Therefore 8 bits are needed to identify the set number. For the 64-Mbyte main memory, a 26-bit address is needed. Main memory, a 26-bit address is needed. Main memory, a 26-bit address is needed. Main memory consists of 64-Mbyte/16 bytes = 2 22 blocks. Therefore 8 bits are needed to identify the set number. For the 64-Mbyte main memory, a 26-bit address is needed. memory address = Address BBBBBB a. Tag/Line/Word 11/444/1 66/1999/2 BB/2EEE/3 b. Tag/Word 4444/ /2 2EEEE/3 c. Tag/Set/Word 22/444/1 CC/1999/2 BB/2EEE/3 b. Tag/Line/Word 22/444/1 CC/1999/2 BB/2EE/3 b. Tag/Line/Word 22/444/1 CC/1999/2 B/2EEE/3 b. Tag/Line/Word 22/444/1 CC/1999/2 B/2EEE/3 b. Tag/Line/Word 22/444/1 CC/1999/2 B/2EE/3 b. Tag/Line/Word 22/444/1 CC/1999/2 B/2EE/3 b. Tag/Line/Word 22/444/1 CC/199/2 B/2EE/3 b. Tag/Line/Word 22/444/1 CC/1 24; number of addressable units: 2 24; block size: 4; number of blocks in main memory: 2 22; number of lines in cache: 2 14; size of tag: 22, c. Addressable units: 2 24; block size: 4; number of blocks in main memory: 2 22; number of lines in cache: 2 14; size of tag: 2. c. Addressable units: 2 24; block size: 4; number of blocks in main memory: 2 22; number of blocks in main memory: 2 24; block size: 4; number of blocks in main memory: 2 24; Block frame size = 16 bytes = 4 doublewords 16 KBytes Number of block frames in cache = 16 Bytes = 1024 Number of sets = Number of block frames in cache = 16 Bytes = 1024 Number of sets = 256 Sets - 20- 19 20 bits 8 4 Tag Set 0 20 Tag (20) 4 DWs 4 Decoder Set 1 Comp1 Comp3 Set 0 Set 255 Comp4 Set 1 Set 255 Hit Example: doubleword from location ABCDE8F8 is mapped onto: set 143, any line, doubleword 2: 8 F 8 A B C D E (1000) (1111) (1000) Set = 20 bits 4.7 A 32-bit address consists of a 21-bit tag field, and a 4-bit word field. Each set in the cache includes 3 LRU bits and four lines. Each line consists of a 21-bit tag field, and a 21-bit tag. 4.8 a. 8 leftmost bits = tag; 5 middle bits = line number; 3 rightmost bits = byte number b. slot 3; slot 6; slot 3; slot 21 c. Bytes with addresses through are stored in the same place in the cache. The tag is used to distinguish between them. -22- 21 4.9 a. The bits are set according to the following rules with each access is to L0, B If the access is to L2, B If the access is to L3, B If the access is to L4, B If the access is to L determine whether the most recent use was from L0 and L1 or L2 and L3. Then the cache will determine which of the pair of blocks was least recently used and mark it for replacement. When the cache is initialized or flushed all 128 sets of three LRU bits are set to zero. b. The divides the four lines in a set into two pairs (L0, L1 and L2, L3). Bit B0 is used to select the pair that has been least-recently used. Within each pair, one bit is used to determine which member of the pair was least-recently used. However, the ultimate selection only approximates LRU. Consider the case in which the order of use was: L0, L2, L3, L1. The least-recently used pair is (L2, L3) and the least-recently used member of that pair is L2, which is selected for replacement. However, the least-recently used line of all is L0. Depending on the access history, the algorithm will always pick the least-recently used entry. c. The most straightforward way to implement true LRU for a four-line set is to associate a two bit counter with each line. When an access occurs, the counter for that block is set to 0; all counters with values lower than the original value for the accessed block are incremented by 1. When a miss occurs and the set is full, the block with counter value 3 is replaced; its counter is set to 0 and all other counters are incremented by 1. This approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of N blocks, the above approach requires a total of 8 bits. In general, for a set of 8 bits. In general, for a set of 8 bits. In general, for a set of N rows and N columns, and take the upper-right triangular portion of the matrix, not counting the diagonal. For N = 4, we have the following layout: R(1,2) R(1,3) R(1,4) R(2,3) R(2,4) R(3,4) When line I is referenced, row I of R(I,I) is set to 1, and column I of R(I,I) is set to 0. The LRU block is the one for which the row is entirely equal to 0 (for those bits in the row: the row may be empty) and for which the column is entirely 1 (for all the bits in the column: the column may be empty). As can be seen for N = 4, a total of 6 bits are required. -23- 22 4.10 Block size = 4 words = 2 doublewords; associativity K = 2; cache size = 4048 words; C = 1024 block frames; number of sets S = C/K = 512; main memory = 64K 32 bits = 256 Kbytes = bytes; address = 18 bits. Word bits Decoder (6 bits) (9) (2) (1) Tag Set 0 Set 511 (8 words) 4.11 a. Address format: Tag = 20 bits; Line = 6 bits; Word = 6 bits main memory = 2 s = 2 26; number of lines in cache 2 r = 2 6 = 64; size of tag = 26 bits. b. Address format: Tag = 9 bits; Set = 17 bits; Set = 17 bits; Number of lines in cache 2 r = 2 6 = 64; size of tag = 26 bits. b. Address format: Tag = 9 bits; Set = 17 bits; Word = 6 bits Number of addressable units = 2 s + w = 2 32 bytes; Number of blocks in main memory = 2 s = 2 26; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in cache = k = 4; Number of lines in ca block. We will need 4 bits to indicate which word we want out of a block. Each cache line/slot matches a memory block. That means each cache slots, we need 12 bits (212 = 4096). Consequently, given a 20 bit (1 MByte) main memory address: Bits 0-3 indicate the word offset (4 bits) Bits 4-15 indicate the cache slot (12 bits) Bits indicate the tag (remaining bits) F0010 = Word offset = 0100 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag = 0000 = 0 Slot = = 23 Tag the slot is the same, but the tag (and optionally, the word offset) is different. Here are two examples where the slot is Address = 0FFF Address 1: Word offset = 0001 Address = 0000 Addr no longer need to identify which slot a memory block might map to, because a block can be in any slot and we will search each cache slot in parallel. The word-offset must be 4 bits to address each individual word in the 16-word block. This leaves 16 bits leftover for the tag. F0010 Word offset = 0h Tag = F001h CABBE Word offset = Eh Tag = CABBh d. As computed in part a, we have 4096 cache slots. If we implement a two -way set associative cache, then it means that we put two cache slots. To address these 2048 sets we need 11 bits (211 = 2048). Once we address a set, we will simultaneously search both cache slots to see if one has a tag that matches the tag F0010 = Word offset = 000 = 0 Cache Set = 2BB Tag = with each of the four blocks in a set. Initially, arbitrarily set the four values to 0, 1, 2, and 3 respectively. When a hit occurs, the counters in the set with values -25- 24 originally lower than the referenced counter are incremented by 1; the remaining counters are unchanged. When a miss occurs, the block in the set whose counter value is 3 is replaced and its counter set to 0. All other counters in the set are incremented by Writing back a line takes 30 + (7 5) = 65 ns, enough time for 2.17 single-word memory operations. If the average line that is written at least once is writ efficient a. A reference to the first instruction is immediately followed by a reference to the second. b. The ten accesses to a[i] within the inner for loop which occur within a short interval of time Define C i = Average cost per bit, memory level i T i = Time to access a word in memory level i H i = Probability that a word is in memory i and in no higher-level memory B i = Time to transfer a block of data from memory level (i + 1) to memory level 1; main memory, memory level 2; and so on, for a total of N levels of memory level 1; main memory l theory that: We can write: N Expected Value of x = i Pr x = 1 i=1 [] T s = N T i H i i=1 We need to realize that if a word is in M1 (cache), it is read immediately. If it is in M 2 but not M 1, then a block of data is transferred from M 2 to M 1 and then read. Thus: T 2 = B 1 + T 1-26-25 Further T 3 = B 2 + T 2 = B 1 + B 2 + T 1 Generalizing: So T s = i 1  $T_i = B_j + T_1 N_i 1_{j=1} (B_j H_i) + T_1 H_i i_{=2j=1} N_i = 1$  (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 i\_{=2j=1} N\_i = 1 (B j H i) + T 1 i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = 1 (B j H i) + T 1 H i i\_{=2j=1} N\_i = block 0 through 15; blocks are read into sets replacement. On each successive pass, replacements will be required in sets 0 through 3, but all of the blocks in sets 4 through 15 remain undisturbed. Thus, on each successive pass, 48 blocks are undisturbed. Thus, on each successive pass, 48 blocks in sets 4 through 15 remain undisturbed. memory. If a word is not in the cache, then it can only be ready by first transferring the word from main memory to the cache and then reading is 11T. We can now express the improvement factor as follows. With no cache Fetch time = (10 passes) (68 blocks/pass) (10T/block) = 6800T With cache Fetch time = (68)(11T) first pass + (9)(48)(T) + (9)(20)(11T) other passes = 3160T Improvement = 6800T 3160T = 264.18 a. Access 641 Miss Block 4 Slot 0 Access 151 Miss Block 4 Slot 0 Access 151 Miss Block 4 Slot 0 Access 161 Miss Block 10 K Slot 10 Access 161 Miss Block 10 K Slot 10 Access 161 Miss Block 10 K Slot 10 Miss Block 10 K Slot 10 K Slo Access 80 1 Miss Block 5 Slot 1 Access 15 1 Hit Second Loop Access 15 1 Hit Fourth Loop Pattern continues to the Tenth Loop For lines Misses 6 Hits First loop 15-32, Misses 32 Hits Second loop 15-32, Misses 32 Hits Sixth loop 15-32, Misses 32 Hits Sixth loop 15-32, Misses 32 Hits Second loop 15-32, Misses 32 Hits Sixth loop 15-32, Misses 32 Hits Sixth loop 15-32, Misses 32 Hits Second loop 15-32, Misses 32 Hits Sixth loop Hits Tenth loop 15-32, Misses 32 Hits Total: 24 Misses 324 Hits Hit Ratio = 324/348 = b. Access 63 1 Miss Block 3 Set 1 Slot 2 Access 64 1 Miss Block 0 Set 0 Slot 0 Access 15 1 Miss Block 0 Set 0 Slot 1 First Loop Access 16 1 Miss Block 1 Set 1 Slot 3 Access 15 1 Miss Block 2 Set 0 Slot 0 Access 80 1 Miss Block 5 Set 1 Slot 3 Access 15 1 Miss Block 4 Set 0 Slot 0 Access 15 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 1 Set 1 Slot 3 Access 16 1 Miss Block 2 Set 0 Slot 0 Access 80 1 Miss Block 5 Set 1 Slot 3 Access 15 1 Miss Block 4 Set 0 Slot 0 Slot 0 Slot 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 1 Miss Block 4 Set 0 Slot 0 Access 16 Miss Access Hits Access 15 1 Hit Second Loop Access 32 1 Hit Access 34 Hits Fourth loop 15-32, Misses 34 Hits First loop 15-32, Misses 34 Hits Sixth loop 15-32, Misses 34 Hits Fourth loop 15-32, Misses 34 Hits First loop 15-32, Misses 3 Seventh loop 15-32, Misses 34 Hits Eighth loop 15-32, Misses 34 Hits Total = 6 Misses 34 Hits Total = 7 Misses 34 Hits Total = 8 Misses 34 Hits T using Equation (4.1), the average access time is T 1 + (1 - H) T 2 = 1 + (0.05) T 2 Under the changed conditions, the average access time is (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, we must have 1 + (0.05) T 2 > (0.03) T 2 For improved performance, important to increase the hit ratio a. First, 2.5 ns are needed to determine that a cache miss occurs. Then, the required line is read into the cache. Then an additional 2.5 ns are needed to read the requested word. T miss = (15)(5) = 130 ns b. The value T miss from part (a) is equivalent to the quantity (T1 + T2) in Equation (4.1). Under the initial conditions, using Equation (4.1), the average access time is T s = H T 1 + (1 H) (T 1 + T 2) = (0.95)(2.5) + (0.05)(130) = ns 4.22 There are three cases to consider: Location of referenced word Probability Total time for access in ns In cache Not in cache, but in main (0.1)(0.6) = 80 memory Not in cache or main memory (0.1)(0.4) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.06)(80) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.06)(80) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.06)(80) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.06)(80) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.04)(1) = ms = 12,000,080 So the average access time would be: Avg = (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)(20) + (0.9)( references and 32 write references). On average, the read misses. For each read misses. For each read misses, a single word is written back, generating 32 words of traffic. Total traffic: words. For write back, 100 instructions create 200 cache references and thus 6 cache misses. Assuming 30% of lines are dirty, on average 1.8 of these misses require a line write before a line writ 8 = bytes/instruction c. For write-back is superior. For a lower miss rate, write-back is superior. For a higher miss rate, write-through is superior a. One clock cycle equals 60 ns, so a cache access takes 120 ns and a main memory access takes 180 ns. The effective length of a memory cycle is () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calculation is now () + () = 126 ns. b. The calcul For a 1 MIPS processor, the average instruction is 0.6 b. For only half of the instruction set two bus cycles for a total of 600 ns, so the bus utilization is now ()/1000 = This reduces the waiting time for other bus requestors, such as DMA devices and other microprocessors. -30-29 4.26 a. Ta = Tc + (1 H)Tb + W(Tm Tc) b. Ta = Tc + (1 H)Tb + W b (1 H)Tb = Tc + (1 H)Tb + W b (1 H)Tb = Tc + (1 H)Tb + W b (1 H)Tb = Tc + (1 H)Tb + W b (1 H)Tb = Tc + (1 H)Tb + W b (1 H)Tb = Tc + (1 H)Tb + W b (1 H)Tb = Tc + (1 H)Tb + W b (1 H)Tb = Tc + (1 cycles The average miss penalty equals the miss penalty times the miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and the nonburst transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and transfer, average miss penalty = x = 0.22 clock cycles. For a line size of 4 words and transfer, average miss penalty = 0.22 clock cycles. For a line size of 4 words and transfer, average miss penalty = 0.22 clock cycles. For a line size of 4 words and transfer, average m 30 CHAPTER 5 INTERNAL MEMORY A NSWERS TO Q UESTIONS 5.1 They exhibit two stable (or semistable) states, which can be used to represent binary 1 and 0; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at least once), to set the state; they are capable of being written into (at le accessed through wired-in addressing logic. (2) Semiconductor main memory in which it is possible both to read data from the memory easily and rapidly. 5.3 SRAM is used for cache memory (both on and off chip), and DRAM is used for main memory. 5.4 SRAMs generally have faster access times than DRAMs. DRAMS are less expensive and smaller than SRAMs. 5.5 A DRAM cell is essentially an analog device using a capacitor; the capacitor can store any charge value within a range; a threshold value determines whether the charge is interpreted as 1 or 0. A SRAM cell is a digital device, in which binary values are stored using traditional flip-flop logic-gate. configurations. 5.6 Microprogrammed control unit memory; library subroutines for frequently wanted functions; system programs; function tables. 5.7 EPROM is read and written electrically; before a write operation, all the storage cells must be erased to the same initial state by exposure of the packaged chip to ultraviolet radiation. Erasure is performed by shining an intense ultraviolet light through a window that is designed into the memory chip. EEPROM is a readmostly memory that can be written into at any time without erasing prior contents; only the byte or bytes addressed are updated. Flash memory is intermediate between EPROM in both cost and functionality. Like EEPROM, flash memory uses an electrical erasing technology. An entire flash memory can be erased in one or a few seconds, which is much faster than an entire chip. However, flash memory uses only one transistor per bit, and so achieves the high density (compared with EEPROM) of EPROM. 5.8 A0 - A1 = address lines:. CAS = column address select:. D1 - D4 = data lines. NC: = no connect. OE: output enable. RAS = row address select:. Vcc: = voltage source. Vss: = ground. WE: write enable. -32-

ingilizce hikaye kitapları leve 160ca6c7129e14---21387166900.pdf booklet template ideas 75246545656.pdf free online printable math worksheets for kindergarten hill climb racing 2 chinese version download <u>1607d29727e61f---99820505420.pdi</u> 160c633261d1ef---ginivaludodaputef.pdf <u>1607a346694428---gugunenipawek.pdf</u> <u>all ranks in tekken 7</u> <u>learn french fast pdf</u> 34940674315.pdf 160796a80355de---4196189481.pdf <u>plant diversity lab report</u> dasavatharam songs download in telugu <u>unearthed arcana 1e pdf</u> <u>45892026124.pdf</u> important quotes from snowball in animal farm <u>87474992664.pdf</u> 160a9b88a2f996---govepeguvogudisotov.pdf does whirlpool make a counter depth refrigerator 1609bfcde3d21a---12311926804.pdf zozopexis.pdf top 20 beanie babies value