Why would computers transfer data and instructions from main memory to cache in blocks when this means a single cache miss may overwrite a lot [closed]

StackOverflow https://stackoverflow.com/questions/22446371

Question

According to the comments in the accepted answer of this question instructions and data of a program are loaded from memory into cache in chunks. How does this affect cache mapping? For example what if there is one cache miss, now a whole chunk of x bytes needs to be brought in so it could end up over writing other data/instructions in the cache that may still be needed. Overall isn't this a worse design then working with just one instruction or piece of data instead of chunks?

Is it necessary for multiple instructions to fit in one cache line because a cache line cannot be partially full, or if necessary for performance reasons can one instruction go in one cache line?

Was it helpful?

Solution

There are valid reasons why cache blocks are relatively large.

The fraction of storage taken by tags decreases as the block size increases. Each block must have a tag indicating the address, the validity of the cache block, and other metadata (the data for implementing the replacement policy is spread across the blocks of a given set [index]). A 4 KiB, 2-way cache on a machine with a 24-bit physical address space would require with 4 byte blocks 1024 tags 13-bit tags (including just address and a valid bit, supporting LRU replacement would add a bit per two entries). With no parity or ECC, this would be over 40% overhead. With 8 byte blocks the overhead is reduced to 512 23-bit tags (c.20%). With 16 byte blocks, the overhead is halved again.

Using larger blocks also reduces the number of requests needed to handle a given amount of memory. Each request has some overhead for communication and handling. In designs with shared address and data bandwidth, the bandwidth overhead would be similar to that for tagging. There would also be overhead in buffering requests. With snooping cache coherence (excluding old systems that could use a shared bus or snoop filtering and similar optimizations) every miss must send a probe request to every other coherent cache.

Using larger blocks also provides a simplistic prefetching mechanism. When the next level of the memory hierarchy is several times higher latency, prefetching becomes relatively attractive even if it sometimes evicts more useful data. Even for non-stack data, there is typically significant spatial locality such that the probability that a nearby location will be accessed soon is higher than the probability that a location which may not have been used for more than 500 accesses (for a 2-way cache with 512 sets and LRU replacement). In addition, such can be exploited by software to provide "free" prefetching.

Larger cache blocks are also a better fit to DRAM characteristics. Since the days of fast page mode DRAM, accessing multiple locations in the same DRAM "page" has been significantly more bandwidth-efficient than more scattered accesses. DDR3 requires bursts of four sequential locations (so a 64-bit wide interface must fetch 32 bytes), and even that uses burst chop (which impacts DRAM chip bandwidth). By using larger cache blocks, all accesses become sequential bursts. (One could narrow the width of the memory interface, but a 16-bit DDR3 interface would require two DRAM cycles to transfer 64 bits. Alternatively, one could use a prefetch buffer to store a longer burst—some memory controllers can prefetch nearby blocks to exploit DRAM page locality benefits—, but that would add some storage overhead and complexity.)

(Another, minor benefit may be facilitating way prediction and way memoization. This can be used to reduce latency or energy use. Even a simplistic predictor that associated the last hit way with the register used to generate the base address would have some success with unit stride accesses to arrays and data structures with multiple members accessed with significant temporal locality.)

While using larger cache blocks does cost excess bandwidth and storage overhead for the data itself and introduces false sharing issues in multithreaded code, the benefits are significant.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top