| Size: 425 Comment:  | Size: 9553 Comment:  | 
| Deletions are marked like this. | Additions are marked like this. | 
| Line 1: | Line 1: | 
| Back to ComputerTerms See Also VirtualMemory | |
| Line 3: | Line 7: | 
| '''Direct Mapped Cache''' | == Direct Mapped Cache == | 
| Line 13: | Line 17: | 
| https://www.scotnpatti.com/images/directmappedcache.jpg '''Tags''' contain the address information required to identify if the information in a cache location coresponds to the data needed. These tags contain the upper bits of the memory address. '''Byte offset''' contains the number of bits required to specify bytes in a block. Since this example uses 4 byte ( or word) sized blocks the ''byte offset'' is '''2 bits''' '''Valid bit''' (or validity bit): When the computer is first started the tags contain junk. Since we don't want to recognize this junk, we start with all the valid bits set to 0 for invalid. Even as time goes on we may still have invalid information and so we have to check this bit. https://www.scotnpatti.com/images/directmappedcache2.jpg To determine the size of the cache above: {{{ cache size = (number of locations=1024 = 2^10) X ( Block size=32 + Validity bit=1 + (32 - TagSize=10 - Block bits=2)) = 2^10 X (32 + 1 + 32 - 10 - 2) = 1024 X (53) = 54,272 bits or 6,784 bytes }}} How many bits are required for a direct-mapped cache with 64 KB of data and one-word blocks, assuming a 32-bit address. {{{ 64/4 = 16 KWords = 2^16 Cache Size = 2^16 X (32 + (32-16-2) + 1) = 2^16 X 49 = 802816 bits = 100,352 bytes = 98 KB. }}} === Cache Misses === A ''data cache miss'' requires us to "freeze" the processor on the instruction that caused the miss. We then retrieve the data from memory, place it in cache and restart the instruction - this time we are guaranteed a hit. An ''instruction cache miss'' requires us to do the following 1. Send the original PC value (Current PC - 4) to memory. 1. Instruct main memory to performa read and wait for the memory to complete it's access. 1. Write the cache entry, putting the data from maemory in the data portion of the entry, writing the upper bitsof the address (from the ALU) into the tag field, and turning the valid bit on. 1. Restart the instruction execution at the first step, which will refetch the instruction, this time finding it in the cache. === Points to Ponder === 1. A 32-bit processor has 32-bit registers --> The smallest unit of data to be loaded anywhere is 32-bits = 4 bytes = 1 ''word''. A ''word'' becomes the smallest unit of the memory hierarchy. 1. Block size is a '''conversion factor''': [[latex2($$C \frac{Bytes}{Block}$$)]] == Locality == The schemes of using one word blocks does help with '''temporal locality''' but it does not help with '''spatial locality'''. In order to take advantage of spatial locality we must use multi-word blocks. This will also increase the effeciency of the cache because more bits will be used for data instead of overhead (tag, valid bit) == Direct Mapped cache with multi-word blocks == Since we are now talking about multi-word blocks, we must have a way to identify not just the word, but the block! [[latex2($$Block~Address = \left\lfloor \frac{Byte~Address}{Bytes~Per~Block}\right\rfloor$$)]] So here we have four different divisions of the address: '''Tag''' is now the ''Block Address'', we still use it to check and make sure that the block in the cache is the block that we want. '''Index''' identifies the word within the cache, that is Block Address MOD Number of ''Cache Blocks'' (remember Cache Blocks are multi-word now). How does Blocks in general relate to Cache Blocks? I surmise that they are the same! '''Block offset''' Identifies the word within the block to the multiplexor. This the size in bits = n where 2^n = words per block. '''Byte offset''' Will probably always be 2 since we are dealing with 32-bit machines with 32-bit words. https://www.scotnpatti.com/images/directmappedcache3.jpg === Cache Miss === We can handle read misses exactly as before, however write misses require a read from memory. P588 computer organization and design. Basically you can't just change the tag and go on like before. On a write through buffer (or any other type of buffer), you must read in the correct block other wise you will have two parts of different blocks in memory with the latest tag - now that's a problem! Increasing block size will not always result in better miss rates. You could end up with only one block in your cache... and your '''miss penalty''' will increase as the transfer time may also increase. This is cited as the major draw back. There are ways around this by having the memory subsystem return the '''critical word first'''. SEE ALSO MemoryDesign == Fully Associative Cache == This is the other extreme from direct mapped caches where a block can be placed in '''any location''' in the cache. In fully associative caches '''all entries in the cache must be searched'''. This causes the cache to be quite slow, but extremely flexible in how a block is replaced. == Set Associative Cache == In set associative caches there are a set number of locations (at least two) where each block can be placed. '''A cache with n locations for a block is called n-way set-associative cache.''' The replacement policies are discussed below, but it suffices here to say that most n-way set-associative cahces replace blocks in the set that is the '''least recently used (LRU)''' policy. '''KEY: ''n-way stands for the size of the set NOT the number of sets'' ''' In a set-associative cache, the set containing the block is given by {{{ (Block number) modulo (Number of sets in the cache) or (Block number) modulo (Cache Size / n) }}} ''Here again all the blocks in a set must be searched.'' Below is an example of the three types of mappings discussed so far: https://www.scotnpatti.com/images/cache3mappingtypes.jpg '''KEY: As you increase set associativity the miss rate goes down, however the hit time goes up.''' Also size of the cache and n are not independent when determining performance of the cache - think about why that is true === Locating a Block === The address breakdown is similar to the multiblock format above. || Tag ||||Index ||||Block Offset|| * Tag is essentially the left over bits at the top that we must match for a hit. '''T = 32 - x - y''' * Index is the value used to select the set: bits '''x = log2(number of sets)''' * The block offset here includes the byte offset which we just calculate as '''y = log2(bytes per block)''' === Block Replacement Schemes === Direct Mapped caches are trival because the can only be placed in one location '''LRU''' stand for Least Recently Used. For 2-way set associative caches we can use a single bit to identify the least recently used slot. For higher associativity you have a bigger problem. === Multi Level Caching === Suppose we have a 500 MHz processor with a base CPI of 1.0, when all references hit in the primary cache. Assume a main memory access time of 200 ns, including all the miss handling. Suppose the miss rate per instruction at the primary cache is 5%. How much faster will the machine be if we add a secondary cache that has a 20-ns access time for either a hit or a miss and is large enough to reduce the miss rate to main memory to 2%. Miss penalty to main memory is [[latex2($$\frac{200~ns}{\frac{2~ns}{Clock~Cycle}} = 100~Clock~Cycles$$)]] Miss penalty to [[latex2($$L_2 = \frac{20~ns}{\frac{2~ns}{Clock Cycle}} = 10~Clock~Cycles$$)]] {{{ CPI(L1 only) = 1 + 5% * 100 = 6 CPI(L1 & L2) = 1 + 5% * 10 + 2% * 100 = 1+1+2 = 4 }}} KEY: Having two caches allows us to focus L1 cache on reducing hit time, and allowing L2 cache to reduce long memory access time or Temporal and Spatial locality respectively (although these are not totally independent). We also need to make sure that we differentiate between global and local miss rates. The 2% given above for L2 cache is the global miss rate. The local miss rate is actually very bad. Of the 5% of the global accesses that L2 sees, it misses on 2% of them. Thus the local miss rate is 2%/5% = 40%! = The Big Picture = Whiel caches, TLBs, and Cirtual meory may intitially look very different, they rely on the same two principles of locality and can be understood by looking at how they deal with four questions: ||Question1:|||| Where can a block be placed?|| ||Answer: ||||One place ('''Direct Mapped'''), a few places (set associative), or any place (fully associative).|| ||Question2:|||| Hos is a block found?|| ||Answer: ||||There are four methods: '''Indexing''' (as in a direct mapped cache). '''limited search''' (as in a set-associative cache), '''Full Search''' (as in a fully associative cache), and a '''seperate lookup table''' (as in a page table).|| ||Question3: ||||What block is replaced on a miss?|| ||Answer: ||||Typically, either the LRU (least recently used) or a random block.|| ||Question4:|||| How are writes handled?|| ||Answer:|||| Each level in the hierarchy can use either a WriteThrough or a WriteBack scheme|| Back to ComputerTerms | 
Back to ComputerTerms
See Also VirtualMemory
Cache
Direct Mapped Cache
Cache's are directed mapped if each memory location is mapped to exactly one location in the cache. An example would be:
location = (Block Address) MOD (Number of cache blocks in the cache)
In this way we can map a memory location to a cache location.
Example: Suppose we have a cache with 8 slots 2^3. Then a word at 45 would be found at slot 5 in the cache.
https://www.scotnpatti.com/images/directmappedcache.jpg
Tags contain the address information required to identify if the information in a cache location coresponds to the data needed. These tags contain the upper bits of the memory address.
Byte offset contains the number of bits required to specify bytes in a block. Since this example uses 4 byte ( or word) sized blocks the byte offset is 2 bits
Valid bit (or validity bit): When the computer is first started the tags contain junk. Since we don't want to recognize this junk, we start with all the valid bits set to 0 for invalid. Even as time goes on we may still have invalid information and so we have to check this bit.
https://www.scotnpatti.com/images/directmappedcache2.jpg
To determine the size of the cache above:
cache size = (number of locations=1024 = 2^10) X 
             ( Block size=32 + Validity bit=1 + (32 - TagSize=10 - Block bits=2)) 
           = 2^10 X (32 + 1 + 32 - 10 - 2) 
           = 1024 X (53)
           = 54,272 bits or 6,784 bytesHow many bits are required for a direct-mapped cache with 64 KB of data and one-word blocks, assuming a 32-bit address.
64/4 = 16 KWords = 2^16
Cache Size = 2^16 X (32 + (32-16-2) + 1) 
           = 2^16 X 49 = 802816 bits = 100,352 bytes = 98 KB.
Cache Misses
A data cache miss requires us to "freeze" the processor on the instruction that caused the miss. We then retrieve the data from memory, place it in cache and restart the instruction - this time we are guaranteed a hit.
An instruction cache miss requires us to do the following
- Send the original PC value (Current PC - 4) to memory.
- Instruct main memory to performa read and wait for the memory to complete it's access.
- Write the cache entry, putting the data from maemory in the data portion of the entry, writing the upper bitsof the address (from the ALU) into the tag field, and turning the valid bit on.
- Restart the instruction execution at the first step, which will refetch the instruction, this time finding it in the cache.
Points to Ponder
- A 32-bit processor has 32-bit registers --> The smallest unit of data to be loaded anywhere is 32-bits = 4 bytes = 1 word. A word becomes the smallest unit of the memory hierarchy. 
- Block size is a conversion factor: 
latex2($$C \frac{Bytes}{Block}$$)
Locality
The schemes of using one word blocks does help with temporal locality but it does not help with spatial locality. In order to take advantage of spatial locality we must use multi-word blocks. This will also increase the effeciency of the cache because more bits will be used for data instead of overhead (tag, valid bit)
Direct Mapped cache with multi-word blocks
Since we are now talking about multi-word blocks, we must have a way to identify not just the word, but the block!
latex2($$Block~Address = \left\lfloor \frac{Byte~Address}{Bytes~Per~Block}\right\rfloor$$)
So here we have four different divisions of the address:
Tag is now the Block Address, we still use it to check and make sure that the block in the cache is the block that we want.
Index identifies the word within the cache, that is Block Address MOD Number of Cache Blocks (remember Cache Blocks are multi-word now). How does Blocks in general relate to Cache Blocks? I surmise that they are the same!
Block offset Identifies the word within the block to the multiplexor. This the size in bits = n where 2^n = words per block.
Byte offset Will probably always be 2 since we are dealing with 32-bit machines with 32-bit words.
https://www.scotnpatti.com/images/directmappedcache3.jpg
Cache Miss
We can handle read misses exactly as before, however write misses require a read from memory. P588 computer organization and design. Basically you can't just change the tag and go on like before. On a write through buffer (or any other type of buffer), you must read in the correct block other wise you will have two parts of different blocks in memory with the latest tag - now that's a problem!
Increasing block size will not always result in better miss rates. You could end up with only one block in your cache... and your miss penalty will increase as the transfer time may also increase. This is cited as the major draw back. There are ways around this by having the memory subsystem return the critical word first.
SEE ALSO MemoryDesign
Fully Associative Cache
This is the other extreme from direct mapped caches where a block can be placed in any location in the cache. In fully associative caches all entries in the cache must be searched. This causes the cache to be quite slow, but extremely flexible in how a block is replaced.
Set Associative Cache
In set associative caches there are a set number of locations (at least two) where each block can be placed. A cache with n locations for a block is called n-way set-associative cache. The replacement policies are discussed below, but it suffices here to say that most n-way set-associative cahces replace blocks in the set that is the least recently used (LRU) policy.
KEY: n-way stands for the size of the set NOT the number of sets
In a set-associative cache, the set containing the block is given by
(Block number) modulo (Number of sets in the cache) or (Block number) modulo (Cache Size / n)
Here again all the blocks in a set must be searched.
Below is an example of the three types of mappings discussed so far:
https://www.scotnpatti.com/images/cache3mappingtypes.jpg
KEY: As you increase set associativity the miss rate goes down, however the hit time goes up. Also size of the cache and n are not independent when determining performance of the cache - think about why that is true
Locating a Block
The address breakdown is similar to the multiblock format above.
| Tag | Index | Block Offset | ||
- Tag is essentially the left over bits at the top that we must match for a hit. T = 32 - x - y 
- Index is the value used to select the set: bits x = log2(number of sets) 
- The block offset here includes the byte offset which we just calculate as y = log2(bytes per block) 
Block Replacement Schemes
Direct Mapped caches are trival because the can only be placed in one location
LRU stand for Least Recently Used. For 2-way set associative caches we can use a single bit to identify the least recently used slot. For higher associativity you have a bigger problem.
Multi Level Caching
Suppose we have a 500 MHz processor with a base CPI of 1.0, when all references hit in the primary cache. Assume a main memory access time of 200 ns, including all the miss handling. Suppose the miss rate per instruction at the primary cache is 5%. How much faster will the machine be if we add a secondary cache that has a 20-ns access time for either a hit or a miss and is large enough to reduce the miss rate to main memory to 2%.
- Miss penalty to main memory is latex2($$\frac{200~ns}{\frac{2~ns}{Clock~Cycle}} = 100~Clock~Cycles$$) - Miss penalty to latex2($$L_2 = \frac{20~ns}{\frac{2~ns}{Clock Cycle}} = 10~Clock~Cycles$$) 
CPI(L1 only) = 1 + 5% * 100 = 6 CPI(L1 & L2) = 1 + 5% * 10 + 2% * 100 = 1+1+2 = 4
KEY: Having two caches allows us to focus L1 cache on reducing hit time, and allowing L2 cache to reduce long memory access time or Temporal and Spatial locality respectively (although these are not totally independent).
We also need to make sure that we differentiate between global and local miss rates. The 2% given above for L2 cache is the global miss rate. The local miss rate is actually very bad. Of the 5% of the global accesses that L2 sees, it misses on 2% of them. Thus the local miss rate is 2%/5% = 40%!
= The Big Picture =
Whiel caches, TLBs, and Cirtual meory may intitially look very different, they rely on the same two principles of locality and can be understood by looking at how they deal with four questions:
| Question1: | Where can a block be placed? | |
| Answer: | One place (Direct Mapped), a few places (set associative), or any place (fully associative). | |
| Question2: | Hos is a block found? | |
| Answer: | There are four methods: Indexing (as in a direct mapped cache). limited search (as in a set-associative cache), Full Search (as in a fully associative cache), and a seperate lookup table (as in a page table). | |
| Question3: | What block is replaced on a miss? | |
| Answer: | Typically, either the LRU (least recently used) or a random block. | |
| Question4: | How are writes handled? | |
| Answer: | Each level in the hierarchy can use either a WriteThrough or a WriteBack scheme | |
Back to ComputerTerms
