## **Computer Architecture** Appendix B - Review of Memory Hierarchy

### **Glossary of terms**

cache virtual memory memory stall cycles direct mapped valid bit block address write through instruction cache average memory access time cache hit page miss penalty

fully associative dirty bit block offset write back data cache hit time cache miss page fault miss rate n-way set associative least recently used tag field

write allocate unified cache misses per instruction block locality address trace set random replacement index field no-write allocate write buffer write stall

# Memory Hierarchy

| Level                     | 1                                          | 2                    | 3                | 4                         |
|---------------------------|--------------------------------------------|----------------------|------------------|---------------------------|
| Name                      | Registers                                  | Cache                | Main memory      | Disk storage              |
| Typical size              | <4 KiB                                     | 32 KiB to 8 MiB      | <1 TB            | >1 TB                     |
| Implementation technology | Custom memory with<br>multiple ports, CMOS | On-chip CMOS<br>SRAM | CMOS DRAM        | Magnetic disk<br>or FLASH |
| Access time (ns)          | 0.1-0.2                                    | 0.5-10               | 30-150           | 5,000,000                 |
| Bandwidth (MiB/sec)       | 1,000,000-10,000,000                       | 20,000-50,000        | 10,000-30,000    | 100-1000                  |
| Managed by                | Compiler                                   | Hardware             | Operating system | Operating<br>system       |
| Backed by                 | Cache                                      | Main memory          | Disk or FLASH    | Other disks<br>and DVD    |

Figure B.1 The typical levels in the hierarchy slow down and get larger as we move away from the processor for a large workstation or small server. Embedded computers might have no disk storage and much smaller memories



(A)

Memory hierarchy for a personal mobile device



(B)

Memory hierarchy for a laptop or a desktop



Figure 2.1 The levels in a typical memory hierarchy in a personal mobile device (PMD), such as a cell phone or tablet (A), in a laptop or desktop computer (B), and in a server (C). As we move farther away from the processor, the





Figure 2.2 Starting with 1980 performance as a baseline, the gap in performance, measured as the difference in the time between processor memory requests (for a single processor or core) and the latency of a DRAM access, is plotted over time.

### **DRAM Organization**



Single Memory Cell

Fig. 1: Single memory cell and array. Source: Lam Research

#### Row Address Strobe (RAS)

#### Memory Cell Array

|                 |           |           | Best case ac  | Precharge needed |            |            |
|-----------------|-----------|-----------|---------------|------------------|------------|------------|
| Production year | Chip size | DRAM type | RAS time (ns) | CAS time (ns)    | Total (ns) | Total (ns) |
| 2000            | 256M bit  | DDR1      | 21            | 21               | 42         | 63         |
| 2002            | 512M bit  | DDR1      | 15            | 15               | 30         | 45         |
| 2004            | 1G bit    | DDR2      | 15            | 15               | 30         | 45         |
| 2006            | 2G bit    | DDR2      | 10            | 10               | 20         | 30         |
| 2010            | 4G bit    | DDR3      | 13            | 13               | 26         | 39         |
| 2016            | 8G bit    | DDR4      | 13            | 13               | 26         | 39         |

Figure 2.4 Capacity and access times for DDR SDRAMs by year of production. Access time is for a random memory

| Standard | I/O clock rate | M transfers/s | DRAM name | MiB/s/DIMM | DIMM name |
|----------|----------------|---------------|-----------|------------|-----------|
| DDR1     | 133            | 266           | DDR266    | 2128       | PC2100    |
| DDR1     | 150            | 300           | DDR300    | 2400       | PC2400    |
| DDR1     | 200            | 400           | DDR400    | 3200       | PC3200    |
| DDR2     | 266            | 533           | DDR2-533  | 4264       | PC4300    |
| DDR2     | 333            | 667           | DDR2-667  | 5336       | PC5300    |
| DDR2     | 400            | 800           | DDR2-800  | 6400       | PC6400    |
| DDR3     | 533            | 1066          | DDR3-1066 | 8528       | PC8500    |
| DDR3     | 666            | 1333          | DDR3-1333 | 10,664     | PC10700   |
| DDR3     | 800            | 1600          | DDR3-1600 | 12,800     | PC12800   |
| DDR4     | 1333           | 2666          | DDR4-2666 | 21,300     | PC21300   |

Figure 2.5 Clock rates, bandwidth, and names of DDR DRAMS and DIMMs in 2016. Note the numerical relationship

# **Principle of Locality**

- Programs access a small proportion of their address space at any time
- Temporal locality
  - Items accessed recently are likely to be accessed again soon
  - e.g., instructions in a loop, induction variables
- Spatial locality

- Items near those accessed recently are likely to be accessed soon
- E.g., sequential instruction access, array data





Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 3

### **Four Memory Hierarchy Questions**

- Q1: Where can a block be placed in the upper level? (block placement) Q2: How is a block found if it is in the upper level? (block identification) Q3: Which block should be replaced on a miss? (block replacement) Q4: What happens on a write? (write strategy)

#### **Block Placement**

**Direct Mapped -** Block has only one place where it can appear Block Addr **MOD** Number of Blocks in cache Fully Associative - Block can be placed anywhere in cache Set Associative - Block can be placed anywhere within set Block Addr **MOD** Number of sets in cache



Figure B.2 This example cache has eight block frames and memory has 320blocks.

### **Block Identification**

| Block address |       |  |  |  |  |
|---------------|-------|--|--|--|--|
| Tag           | Index |  |  |  |  |
|               |       |  |  |  |  |

Figure B.3 The three portions of an address in a set associative or directmapped cache. The tag is used to check all the blocks in the set, and the index is used to select the set. The block offset is the address of the desired data within the block. Fully associative caches have no index field.

- Block Offset position of byte within block
- Direct mapped Location = Block Addr MOD No of blocks
- Set Associative Set number = Tag MOD No of sets, Index is position in set
- Fully Associatve no index field

| Block  |
|--------|
| offset |

## **Block Replacement**

When a block miss occurs, a block must be replaced

- Direct Mapped no choice, can only go in one place
- N-way Associative must choose a block within the set
- Fully Associative any block can be replaced

**Replacement Strategies:** 

- Random
- Least Recently Used (LRU)  $\bullet$
- Pseudo LRU
- First in, First out (FIFO)

#### Performance data

|         |         |        |       |          | Associativity |       |           |        |       |
|---------|---------|--------|-------|----------|---------------|-------|-----------|--------|-------|
|         | Two-way |        |       | Four-way |               |       | Eight-way |        |       |
| Size    | LRU     | Random | FIFO  | LRU      | Random        | FIFO  | LRU       | Random | FIFO  |
| 16 KiB  | 114.1   | 117.3  | 115.5 | 111.7    | 115.1         | 113.3 | 109.0     | 111.8  | 110.4 |
| 64 KiB  | 103.4   | 104.3  | 103.9 | 102.4    | 102.3         | 103.1 | 99.7      | 100.5  | 100.3 |
| 256 KiB | 92.2    | 92.1   | 92.5  | 92.1     | 92.1          | 92.5  | 92.1      | 92.1   | 92.5  |

**Figure B.4** Data cache misses per 1000 instructions comparing least recently used, random, and first in, first out replacement for several sizes and associativities. There is little difference between LRU and random for the largest size cache, with LRU outperforming the others for smaller caches. FIFO generally outperforms random in the smaller cache sizes. These data were collected for a block size of 64 bytes for the Alpha architecture using 10 SPEC2000 benchmarks. Five are from SPECint2000 (gap, gcc, gzip, mcf, and perl) and five are from SPECfp2000 (applu, art, equake, lucas, and swim). We will use this computer and these benchmarks in most figures in this appendix.

### Write Strategy

- more time
- Write back main memory is only updated upon replacement
  - to be written back
- Write Allocate block is allocated upon a write miss
- affected

Write through - main memory is always consistent with cache, but takes

Dirty bit - set when cache block is written to - if not set, block doesn't need

• Write No-Allocate - block its written directly to main memory, cache not

#### Assume a fully associative write-back cache with many cache entries that starts Example empty. Following is a sequence of five memory operations (the address is in square brackets):

```
Write Mem[100]:
Write Mem[100]:
Read Mem[200]:
Write Mem[200];
Write Mem[100].
```

What are the number of hits and misses when using no-write allocate versus write allocate?

For no-write allocate, the address 100 is not in the cache, and there is no allocation Answer on write, so the first two writes will result in misses. Address 200 is also not in the cache, so the read is also a miss. The subsequent write to address 200 is a hit. The last write to 100 is still a miss. The result for no-write allocate is four misses and one hit.

> For write allocate, the first accesses to 100 and 200 are misses, and the rest are hits because 100 and 200 are both found in the cache. Thus, the result for write allocate is two misses and three hits.



Figure B.5 The organization of the data cache in the Opteron microprocessor. The 64 KiB cache is two-way set associative with 64-byte blocks. The 9-bit index selects among 512 sets. The four steps of a read hit, shown as circled numbers in order of occurrence, label this organization. Three bits of the block offset join the index to supply the

16

### **Cache Performance Equations**

CPU execution time = (CPU clock cycles + Memory stall cycles) × Clock cycle time

Memory stall cycles = Number of misses  $\times$  Miss penalty

$$= IC \times \frac{Misses}{Instruction} \times Miss penalty$$
$$= IC \times \frac{Memory \ accesses}{Instruction} \times Miss$$

Memory stall clock cycles = IC  $\times$  Reads per instruction  $\times$  Read miss rate  $\times$  Read miss penalty

We usually simplify the complete formula by combining the reads and writes and finding the average miss rates and miss penalty for reads and writes:

Memory stall clock cycles =  $IC \times \frac{Memory accesses}{N} \times Miss rate \times Miss penalty$ Instruction

y

rate × Miss penalty

+ IC  $\times$  Writes per instruction  $\times$  Write miss rate  $\times$  Write miss penalty

```
17
```

Assume we have a computer where the cycles per instruction (CPI) is 1.0 when all Example memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 50 clock cycles and the miss rate is 1%, how much faster would the computer be if all instructions were cache hits?

First compute the performance for the computer that always hits: Answer

> CPU execution time = (CPU clock cycles + Memory stall cycles) × Clock cycle = (IC × CPI + 0) × Clock cycle = IC  $\times$  1.0  $\times$  Clock cycle

Now for the computer with the real cache, first we compute memory stall cycles:

 $\frac{\text{Memory stall cycles} = \text{IC} \times \frac{\text{Memory accesses}}{\text{Instruction}} \times \text{Miss rate} \times \text{Miss penalty}$  $= IC \times (1 + 0.5) \times 0.01 \times 50$ =IC  $\times 0.75$ 

where the middle term (1+0.5) represents one instruction access and 0.5 data accesses per instruction. The total performance is thus

> CPU execution time<sub>cache</sub> =  $(IC \times 1.0 + IC \times 0.75) \times Clock cycle$ = 1.75  $\times$  IC  $\times$  Clock cycle

The performance ratio is the inverse of the execution times:

CPU execution time<sub>cache</sub>  $1.75 \times IC \times Clock$  cycle CPU execution time  $1.0 \times IC \times Clock cycle$ =1.75

The computer with no cache misses is 1.75 times faster.

Some designers prefer measuring miss rate as misses per instruction rather than misses per memory reference. These two are related:

| Misses      | Miss rate × Memory accesses | = Miss r |
|-------------|-----------------------------|----------|
| Instruction | Instruction count           |          |

To show equivalency between the two miss rate equations, let's redo the preceding Example example, this time assuming a miss rate per 1000 instructions of 30. What is memory stall time in terms of instruction count?

Recomputing the memory stall cycles: Answer

Memory stall cycles = Number of misses  $\times$  Miss penalty

$$= IC \times \frac{Misses}{Instruction} \times M$$
$$= IC/1000 \times \frac{Mis}{Intruction}$$
$$= IC/1000 \times 30 \times 25$$
$$= IC/1000 \times 750$$
$$= IC \times 0.75$$

We get the same answer as on page B-5, showing equivalence of the two equations.

Memory accesses rate  $\times$  -Instruction

liss penalty

ses × Miss penalty  $n \times 1000$ 

#### **Cache Performance Equations**

 $2^{index} = \frac{Cache size}{Block size \times Set associativity}$ CPU execution time = (CPU clock cycles + Memory stall cycles) × Clock cycle time Memory stall cycles = Number of misses × Miss penalty Memory stall cycles =  $IC \times \frac{Misses}{Instruction} \times Miss penalty$  $\frac{\text{Misses}}{\text{Instruction}} = \text{Miss rate} \times \frac{\text{Memory accesses}}{\text{Instruction}}$ Average memory access time = Hit time + Miss rate  $\times$  Miss penalty  $CPU execution time = IC \times \left( CPI_{execution} + \frac{Memory stall clock cycles}{Instruction} \right) \times Clock cycle time$  $CPU \, execution \, time = IC \times \left( CPI_{execution} + \frac{Misses}{Instruction} \times Miss \, penalty \right) \times Clock \, cycle \, time$  $CPU execution time = IC \times \left( CPI_{execution} + Miss rate \times \frac{Memory \, accesses}{Instruction} \times Miss \, penalty \right) \times Clock \, cycle \, time$  $\frac{\text{Memory stall cycles}}{\text{Misses}} = \frac{\text{Misses}}{1 \text{Misses}} \times (\text{Total miss latency} - \text{Overlapped miss latency})$ Instruction Instruction Average memory access time = Hit time<sub>L1</sub> + Miss rate<sub>L1</sub> × (Hit time<sub>L2</sub> + Miss rate<sub>L2</sub> × Miss penalty<sub>L2</sub>)  $\frac{\text{Memory stall cycles}}{\text{Transform}} = \frac{\text{Misses}_{\text{L1}}}{\text{Transform}} \times \text{Hit time}_{\text{L2}} + \frac{\text{Misses}_{\text{L2}}}{\text{Instruction}} \times \text{Miss penalty}_{\text{L2}}$ 

Figure B.7 Summary of performance equations in this appendix. The first equation calculates the cache index size, and the rest help evaluate performance. The final two equations deal with multilevel caches, which are explained early in the next section. They are included here to help make the figure a useful reference.

### **Basic Cache Optimizations**

Average memory access time = Hit time + Miss rate  $\times$  Miss penalty

Hence, we organize six cache optimizations into three categories:

- Reducing the miss rate—larger block size, larger cache size, and higher associativity
- Reducing the miss penalty-multilevel caches and giving reads priority over writes
- Reducing the time to hit in the cache—avoiding address translation when indexing the cache

#### **Cache Miss Categories - the three C's**

- Compulsory also called cold start misses misses due to filling the cache
  - V (valid) bit set if block is being used, not set if empty
- Capacity cache cannot contain all blocks needed for execution of a program
- Conflict also caused collision misses a block evicts another block that is subsequently used again and must be retrieved, which may then evict the second block.

|                  | Deces                 | Tetal              | Miss rate components (relative percent)<br>(sum = 100% of total miss rate) |      |          |      |          |     |
|------------------|-----------------------|--------------------|----------------------------------------------------------------------------|------|----------|------|----------|-----|
| Cache size (KiB) | Degree<br>associative | Total miss<br>rate | Compulsory                                                                 |      | Capacity |      | Conflict |     |
| 4                | 1-way                 | 0.098              | 0.0001                                                                     | 0.1% | 0.070    | 72%  | 0.027    | 28% |
| 4                | 2-way                 | 0.076              | 0.0001                                                                     | 0.1% | 0.070    | 93%  | 0.005    | 7%  |
| 4                | 4-way                 | 0.071              | 0.0001                                                                     | 0.1% | 0.070    | 99%  | 0.001    | 1%  |
| 4                | 8-way                 | 0.071              | 0.0001                                                                     | 0.1% | 0.070    | 100% | 0.000    | 0%  |
| 8                | 1-way                 | 0.068              | 0.0001                                                                     | 0.1% | 0.044    | 65%  | 0.024    | 35% |
| 8                | 2-way                 | 0.049              | 0.0001                                                                     | 0.1% | 0.044    | 90%  | 0.005    | 10% |
| 8                | 4-way                 | 0.044              | 0.0001                                                                     | 0.1% | 0.044    | 99%  | 0.000    | 1%  |
| 8                | 8-way                 | 0.044              | 0.0001                                                                     | 0.1% | 0.044    | 100% | 0.000    | 0%  |
| 16               | 1-way                 | 0.049              | 0.0001                                                                     | 0.1% | 0.040    | 82%  | 0.009    | 17% |
| 16               | 2-way                 | 0.041              | 0.0001                                                                     | 0.2% | 0.040    | 98%  | 0.001    | 2%  |
| 16               | 4-way                 | 0.041              | 0.0001                                                                     | 0.2% | 0.040    | 99%  | 0.000    | 0%  |
| 16               | 8-way                 | 0.041              | 0.0001                                                                     | 0.2% | 0.040    | 100% | 0.000    | 0%  |
| 32               | 1-way                 | 0.042              | 0.0001                                                                     | 0.2% | 0.037    | 89%  | 0.005    | 119 |
| 32               | 2-way                 | 0.038              | 0.0001                                                                     | 0.2% | 0.037    | 99%  | 0.000    | 0%  |
| 32               | 4-way                 | 0.037              | 0.0001                                                                     | 0.2% | 0.037    | 100% | 0.000    | 0%  |
| 32               | 8-way                 | 0.037              | 0.0001                                                                     | 0.2% | 0.037    | 100% | 0.000    | 0%  |
| 64               | 1-way                 | 0.037              | 0.0001                                                                     | 0.2% | 0.028    | 77%  | 0.008    | 239 |
| 64               | 2-way                 | 0.031              | 0.0001                                                                     | 0.2% | 0.028    | 91%  | 0.003    | 99  |
| 64               | 4-way                 | 0.030              | 0.0001                                                                     | 0.2% | 0.028    | 95%  | 0.001    | 49  |
| 64               | 8-way                 | 0.029              | 0.0001                                                                     | 0.2% | 0.028    | 97%  | 0.001    | 2%  |
| 128              | 1-way                 | 0.021              | 0.0001                                                                     | 0.3% | 0.019    | 91%  | 0.002    | 8%  |
| 128              | 2-way                 | 0.019              | 0.0001                                                                     | 0.3% | 0.019    | 100% | 0.000    | 0%  |
| 128              | 4-way                 | 0.019              | 0.0001                                                                     | 0.3% | 0.019    | 100% | 0.000    | 0%  |
| 128              | 8-way                 | 0.019              | 0.0001                                                                     | 0.3% | 0.019    | 100% | 0.000    | 0%  |
| 256              | 1-way                 | 0.013              | 0.0001                                                                     | 0.5% | 0.012    | 94%  | 0.001    | 6%  |
| 256              | 2-way                 | 0.012              | 0.0001                                                                     | 0.5% | 0.012    | 99%  | 0.000    | 0%  |
| 256              | 4-way                 | 0.012              | 0.0001                                                                     | 0.5% | 0.012    | 99%  | 0.000    | 0%  |
| 256              | 8-way                 | 0.012              | 0.0001                                                                     | 0.5% | 0.012    | 99%  | 0.000    | 0%  |
| 512              | 1-way                 | 0.008              | 008 0.0001 0.8% 0.0                                                        |      | 0.005    | 66%  | 0.003    | 33% |
| 512              | 2-way                 | 0.007              | 0.0001                                                                     | 0.9% | 0.005    | 71%  | 0.002    | 28% |
| 512              | 4-way                 | 0.006              | 0.0001                                                                     | 1.1% | 0.005    | 91%  | 0.000    | 89  |
| 512              | 8-way                 | 0.006              | 0.0001                                                                     | 1.1% | 0.005    | 95%  | 0.000    | 4%  |

Figure B.8 Total miss rate for each size cache and percentage of each according to the three C's. Compulsory



Figure B.9 Total miss rate (top) and distribution of miss rate (bottom) for each size cache according to the three C's for the data in Figure B.8. The top diagram shows the actual data cache miss rates, while the bottom diagram shows the percentage in each category. (Space allows the graphs to show one extra cache size than can fit in Figure B.8.) 24

#### Larger Block Size - to Reduce Miss Rate



Figure B.10 Miss rate versus block size for five different-sized caches. Note that miss

| _ |    |  |
|---|----|--|
| 1 | 1  |  |
| 1 | ٩. |  |
| - | •  |  |
|   |    |  |

|            | Cache size |       |       |      |  |  |  |
|------------|------------|-------|-------|------|--|--|--|
| Block size | 4K         | 16K   | 64K   | 256  |  |  |  |
| 16         | 8.57%      | 3.94% | 2.04% | 1.09 |  |  |  |
| 32         | 7.24%      | 2.87% | 1.35% | 0.70 |  |  |  |
| 64         | 7.00%      | 2.64% | 1.06% | 0.51 |  |  |  |
| 128        | 7.78%      | 2.77% | 1.02% | 0.49 |  |  |  |
| 256        | 9.51%      | 3.29% | 1.15% | 0.49 |  |  |  |

Figure B.11 Actual miss rate versus block size for the five different-sized caches in



#### Larger Caches - to Reduce Miss Rate

**Advantage - Reduces Miss Rate** 

#### **Disadvantage - Possible Longer Hit Time, Higher Cost and Power**

## Higher Associativity - Reduce Miss Rate

#### Advantage - Reduces Miss Rate by reducing conflict misses

# Disadvantage - Reduces total number of sets in cache, may increase average access time.

|                  | Associativity |       |       |       |  |  |  |
|------------------|---------------|-------|-------|-------|--|--|--|
| Cache size (KiB) | 1-way         | 2-way | 4-way | 8-way |  |  |  |
| 4                | 3.44          | 3.25  | 3.22  | 3.28  |  |  |  |
| 8                | 2.69          | 2.58  | 2.55  | 2.62  |  |  |  |
| 16               | 2.23          | 2.40  | 2.46  | 2.53  |  |  |  |
| 32               | 2.06          | 2.30  | 2.37  | 2.45  |  |  |  |
| 64               | 1.92          | 2.14  | 2.18  | 2.25  |  |  |  |
| 128              | 1.52          | 1.84  | 1.92  | 2.00  |  |  |  |
| 256              | 1.32          | 1.66  | 1.74  | 1.82  |  |  |  |
| 512              | 1.20          | 1.55  | 1.59  | 1.66  |  |  |  |

Figure B.13 Average memory access time using miss rates in Figure B.8 for parameters in the example. *Boldface* type means that this time is higher than the number to the left, that is, higher associativity *increases* average memory access time.

#### **Multilevel Caches - Reduce Miss Penalty**

Average memory access time = Hit time<sub>L1</sub> + Miss rate<sub>L1</sub> × Miss penalty<sub>L1</sub>

and

Miss penalty<sub>L1</sub> = Hit time<sub>L2</sub> + Miss rate<sub>L2</sub> × Miss penalty<sub>L2</sub>

so

Average memory access time = Hit time<sub>L1</sub> + Miss rate<sub>L1</sub>

- Local miss rate—This rate is simply the number of misses in a cache divided by the total number of memory accesses to this cache. As you would expect, for the first-level cache it is equal to Miss rate<sub>L1</sub>, and for the second-level cache it is Miss rate<sub>L2</sub>.
- Global miss rate—The number of misses in the cache divided by the total number of memory accesses generated by the processor. Using the terms above, the global miss rate for the first-level cache is still just Miss rate<sub>L1</sub>, but for the second-level cache it is Miss rate<sub>L1</sub>  $\times$  Miss rate<sub>L2</sub>.

 $\times$  (Hit time<sub>L2</sub> + Miss rate<sub>L2</sub>  $\times$  Miss penalty<sub>L2</sub>)

#### **Prioritize Read Misses over Writes - Reduce Miss Penalty**

- Reads are more common than writes
- Requires careful design of write buffers



| Technique                                             | Hit<br>time | Miss<br>penalty | Miss<br>rate | Hardware<br>complexity | Comment                                                                    |
|-------------------------------------------------------|-------------|-----------------|--------------|------------------------|----------------------------------------------------------------------------|
| Larger block size                                     |             | _               | +            | 0                      | Trivial; Pentium 4L2 uses 128 bytes                                        |
| Larger cache size                                     | _           |                 | +            | 1                      | Widely used, especially for L2<br>caches                                   |
| Higher associativity                                  | _           |                 | +            | 1                      | Widely used                                                                |
| Multilevel caches                                     |             | +               |              | 2                      | Costly hardware; harder if L1 block size $\neq$ L2 block size; widely used |
| Read priority over writes                             |             | +               |              | 1                      | Widely used                                                                |
| Avoiding address translation during<br>cache indexing | +           |                 |              | 1                      | Widely used                                                                |

Figure B.18 Summary of basic cache optimizations showing impact on cache performance and complexity for

