How does random access memory work? Why is it constant-time random-access?

Question 1

Because software likes O(1) "working" memory and thus the hardware is designed to behave that way

The basic point is that the address space of a program is thought as abstractly having O(1) access performance, i.e. whatever memory location you want to read, it should take some constant time (which anyway isn't related to the distance between it and the last memory access). So, being arrays nothing more than contiguous chunks of address space, they should inherit this property (accessing an element of an array is just a matter of adding the index to the start address of the array, and then dereferencing the obtained pointer).

This property comes from the fact that, in general, the address space of a program have some correspondence with the physical RAM of the PC, which, as the name (random access memory) partially implies, should have by itself the property that, whatever location in the RAM you want to access, you get to it in constant time (as opposed, for example, to a tape drive, where the seek time depends from the actual length of tape you have to move to get there).

Now, for "regular" RAM this property is (at least AFAIK) true - when the processor/motherboard/memory controller asks to a RAM chip to get some data, it does so in constant time; the details aren't really relevant for software development, and the internals of memory chips changed many times in the past and will change again in the future. If you are interested in an overview of the details of current RAMs, you can have a look here about DRAMs.

The general concept is that RAM chips don't contain a tape that must be moved, or a disk arm that must be positioned; when you ask to them a byte at some location, the work (mostly changing the settings of some hardware muxes, that connect the output to the cells where the byte status is stored) is the same for any location you could be asking for; thus, you get O(1) performance

There is some overhead behind this (the logical address have to be mapped to physical address by the MMU, the various motherboard pieces have to talk with each other to tell the RAM to fetch the data and bring it back to the processor, ...), but the hardware is designed to do so in more or less constant time.

So:

arrays map over address space, which is mapped over RAM, which has O(1) random access; being all maps (more or less) O(1), arrays keep the O(1) random access performance of RAM.

The point that does matter to software developers, instead, is that, although we see a flat address space and it normally maps over RAM, on modern machines it's false that accessing any element has the same cost. In facts, accessing elements that are in the same zone can be way cheaper than jumping around the address space, due to the fact that the processor has several onboard caches (=smaller but faster on-chip memories) that keep recently used data and memory that is in the same neighborhood; thus, if you have good data locality, continuous operations in memory won't keep hitting the ram (which have much longer latency than caches), and in the end your code will run way faster.

Also, under memory pressure, operating systems that provide virtual memory can decide to move rarely used pages of you address space to disk, and fetch them on demand if they are accessed (in response to a page fault); such operation is very costly, and, again, strongly deviates from the idea that accessing any virtual memory address is the same.

Question 2

The calculation to get from the start of the array to any given element takes only two operations, a multiplication (times sizeof(element)) and addition. Both of those operations are constant time. Often with today's processors it can be done in essentially no time at all, as the processor is optimized for this kind of access.

Question 3

When I say array[4] = 12 in a program, I'm really just storing the bit representation of the memory address into a register. This physical register in the hardware will turn on the corresponding electrical signals according to the bit representation I fed it. Those electrical signals will then somehow magically ( hopefully someone can explain the magic ) access the right memory address in physical/main memory.

I am not quite sure what you are asking but I dont see any answers related to what is really going on in the magic of the hardware. Hopefully I understood enough to go through this long winded explanation (which is still very high level).

array[4] = 12;

So from comments it sounds like it is understood that you have to get the base address of array, and then multiply by the size of an array element (or shift if that optimization is possible) to get the address (from your programs perspective) of the memory location. Right of the bat we have a problem. Are these items already in registers or do we have to go get them? The base address for array may or may not be in a register depending on code that surrounds this line of code, in particular code that precedes it. That address might be on the stack or in some other location depending on where you declared it and how. And that may or may not matter as to how long it takes. An optimizing compiler may (often) go so far as to pre-compute the address of array[4] and place that somewhere so it can go into a register and the multiply never happens at runtime, so it is absolutely not true that the computation of array[4] for a random access is a fixed amount of time compared to other random accesses. Depending on the processor, some immediate patterns are one instruction others take more that also has a factor on whether this address is read from .text or stack or etc, etc...To not chicken and egg that problem to death, assume we have the address of array[4] computed.

This is a write operation, from the programmers perspective. Starting with a simple processor, no cache, no write buffer, no mmu, etc. Eventually the simple processor will put the address on the edge of the processor core, with a write strobe and data, each processors bus is different than other processor families, but it is roughly the same the address and data can come out in the same cycle or in separate cycles. The command type (read, write) can happen at the same time or different. but the command comes out. The edge of the processor core is connected to a memory controller that decodes that address. The result is a destination, is this a peripheral if so which one and on what bus, is this memory, if so on what memory bus and so on. Assume ram, assume this simple processor has sram not dram. Sram is more expensive and faster in an apples to apples comparison. The sram has an address and write/read strobes and other controls. Eventually you will have the transaction type, read/write, the address and the data. The sram however its geometry is will route and store the individual bits in their individual pairs/groups of transistors.

A write cycle can be fire and forget. All the information that is needed to complete the transaction, this is a write, this is the address, this is the data, is known right then and there. The memory controller can if it chooses tell the processor that the write transaction is complete, even if the data is nowhere near the memory. That address/data pair will take its time getting to the memory and the processor can keep operating. Some systems though the design is such that the processors write transaction waits until a signal comes back to indicate that the write has made it all the way to the ram. In a fire and forget type setup, that address/data will be queued up somewhere, and work its way to the ram. The queue cant be infinitely deep otherwise it would be the ram itself, so it is finite, and it is possible and likely that many writes in a row can fill that queue faster than the other end can write to ram. At that point the current and or next write has to wait for the queue to indicate there is room for one more. So in situations like this, how fast your write happens, whether your simple processor is I/O bound or not has to do with prior transactions which may or may not be write instructions that preceded this instruction in question.

Now add some complexity. ECC or whatever name you want to call it (EDAC, is another one). The way an ECC memory works is the writes are all a fixed size, even if your implementation is four 8 bit wide memory parts giving you 32 bits of data per write, you have to have a fixed with that the ECC covers and you have to write the data bits plus the ecc bits all at the same time (have to compute the ecc over the full width). So if this was an 8 bit write for example into a 32 bit ECC protected memory then that write cycle requires a read cycle. Read the 32 bits (check the ecc on that read) modify the new 8 bits in that 32 bit pattern, compute the new ecc pattern, write the 32 bits plus ecc bits. Naturally that read portion of the write cycle can end up with an ecc error, which just makes life even more fun. Single bit errors can be corrected usually (what good is an ECC/EDAC if it cant), multi-bit errors not. How the hardware is designed to handle these faults affects what happens next, the read fault may just trickle back to the processor faulting the write transaction, or it may go back as an interrupt, etc. But here is another place where one random access is not the same as another, depending on the memory being accessed, and the size of the access a read-modify-write definitely takes longer than a simple write.

Dram can also fall into this fixed width category, even without ECC. Actually all memory falls into this category at some point. The memory array is optimized on the silicon for a certain height and width in units of bits. You cannot violate that memory it can only be read and written in units of that width at that level. The silicon libraries will include many geometries of ram, and the designers will chose those geometries for their parts, and the parts will have fixed limits and often you can use multiple parts to get some integer multiple width of that size, and sometimes the design will allow you to write to only one of those parts if only some of the bits are changing, or some designs will force all parts to light up. Notice how the next ddr family of modules that you plug into your home computer or laptop, the first wave is many parts on both sides of the board. Then as that technology gets older and more boring, it may change to fewer parts on both sides of the board, eventually becoming fewer parts on one side of the board before that technology is obsolete and we are already into the next.

This fixed width category also carries with it alignment penalties. Unfortunately most folks learn on x86 machines, which dont restrict you to aligned accesses like many other platforms. There is a definite performance penalty on x86 or others for unaligned accesses, if allowed. It is usually when folks go to a mips or usually an arm on some battery powered device is when they first learn as programmers about aligned accesses. And sadly find them to be painful rather than a blessing (due to the simplicity both in programming and for the hardware benefits that come from it). In a nutshell if your memory is say 32 bits wide and can only be accessed, read or write, 32 bits at a time that means it is limited to aligned accesses only. A memory bus on a 32 bit wide memory usually does not have the lower address bits a[1:0] because there is no use for them. those lower bits from a programmers perspective are zeros. if though our write was 32 bits against one of these 32 bit memories and the address was 0x1002. Then somebody along the line has to read the memory at address 0x1000 and take two of our bytes and modify that 32 bit value, then write it back. Then take the 32 bits at address 0x1004 and modify two bytes and write it back. four bus cycles for a single write. If we were writing 32 bits to address 0x1008 though it would be a simple 32 bit write, no reads.

sram vs dram. dram is painfully slow, but super cheap. a half to a quarter the number of transistors per bit. (4 for sram for example 1 for dram). Sram remembers the bit so long as the power is on. Dram has to be refreshed like a rechargable battery. Even if the power stays on a single bit will only be remembered for a very short period of time. So some hardware along the way (ddr controller, etc) has to regularly perform bus cycles telling that ram to remember a certain chunk of the memory. Those cycles steal time from your processor wanting to access that memory. dram is very slow, it may say 2133Mhz (2.133ghz) on the box. But it is really more like 133Mhz ram, right 0.133Ghz. The first cheat is ddr. Normally things in the digital world happen once per clock cycle. The clock goes to an asserted state then goes to a deasserted state (ones and zeros) one cycle is one clock. DDR means that it can do something on both the high half cycle and on the low half cycle. so that 2133Ghz memory really uses a 1066mhz clock. Then pipeline like parallelisms happen, you can shove in commands, in bursts, at that high rate, but eventually that ram has to actually get accessed. Overall dram is non-determinstic and very slow. Sram on the other hand, no refreshes required it remembers so long as the power is on. Can be several times faster (133mhz * N), and so on. It can be deterministic.

The next hurdle, cache. Cache is good and bad. Cache is generally made from sram. Hopefully you have an understanding of a cache. If the processor or someone upstream has marked the transaction as non-cacheable then it goes through uncached to the memory bus on the other side. If cacheable then the a portion of the address is looked up in a table and will result in a hit or miss. this being a write, depending on the cache and/or transaction settings, if it is a miss it may pass through to the other side. If there is a hit then the data will be written into the cache memory, depending on the cache type it may also pass through to the other side or that data may sit in the cache waiting for some other chunk of data to evict it and then it gets written to the other side. caches definitely make reads and sometimes make writes non-deterministic. Sequential accesses have the most benefit as your eviction rate is lower, the first access in a cache line is slow relative to the others, then the rest are fast. which is where we get this term of random access anyway. Random accesses go against the schemes that are designed to make sequential accesses faster.

Sometimes the far side of your cache has a write buffer. A relatively small queue/pipe/buffer/fifo that holds some number of write transactions. Another fire and forget deal, with those benefits.

Multiple layers of caches. l1, l2, l3...L1 is usually the fastest either by its technology or proximity, and usually the smallest, and it goes up from there speed and size and some of that has to do with cost of the memory. We are doing a write, but when you do a cache enabled read understand that if l1 has a miss it goes to l2 which if it has a miss goes to l3 which if it has a miss goes to main memory, then l3, l2 and l1 all will store a copy. So a miss on all 3 is of course the most painful and is slower than if you had no cache at all, but sequential reads will give you the cached items which are now in l1 and super fast, for the cache to be useful sequential reads over the cache line should take less time overall than reading that much memory directly from the slow dram. A system doesnt have to have 3 layers of caches, it can vary. Likewise some systems can separate instruction fetches from data reads and can have separate caches which dont evict each other, and some the caches are not separate and instruction fetches can evict data from data reads.

caches help with alignment issues. But of course there is an even more severe penalty for an unaligned access across cache lines. Caches tend to operate using chunks of memory called cache lines. These are often some integer multiple in size of the memory on the other side. a 32 bit memory for example the cache line might be 128 bits or 256 bits for example. So if and when the cache line is in the cache, then a read-modify-write due to an unaligned write is against faster memory, still more painful than aligned but not as painful. If it were an unaligned read and the address was such that part of that data is on one side of a cache line boundary and the other on the other then two cache lines have to be read. A 16 bit read for example can cost you many bytes read against the slowest memory, obviously several times slower than if you had no caches at all. Depending on how the caches and memory system in general is designed, if you do a write across a cache line boundary it may be similarly painful, or perhaps not as much it might have the fraction write to the cache, and the other fraction go out on the far side as a smaller sized write.

Next layer of complexity is the mmu. Allowing the processor and programmer the illusion of flat memory spaces and/or control over what is cached or not, and/or memory protection, and/or the illusion that all programs are running in the same address space (so your toolchain can always compile/link for address 0x8000 for example). The mmu takes a portion of the virtual address on the processor core side. looks that up in a table, or series of tables, those lookups are often in system address space so each one of those lookups may be one or more of everything stated above as each are a memory cycle on the system memory. Those lookups can result in ecc faults even though you are trying to do a write. Eventually after one or two or three or more reads, the mmu has determined what the address is on the other side of the mmu is, and the properties (cacheable or not, etc) and that is passed on to the next thing (l1, etc) and all of the above applies. Some mmus have a bit of a cache in them of some number of prior transactions, remember because programs are sequential, the tricks used to boost the illusion of memory performance are based on sequential accesses, not random accesses. So some number of lookups might be stored in the mmu so it doesnt have to go out to main memory right away...

So in a modern computer with mmus, caches, dram, sequential reads in particular, but also writes are likely to be faster than random access. The difference can be dramatic. The first transaction in a sequential read or write is at that moment a random access as it has not been seen ever or for a while. Once the sequence continues though the optimizations fall in order and the next few/some are noticeably faster. The size and alignment of your transaction plays an important role in performance as well. While there are so many non-deterministic things going on, as a programmer with this knowledge you modify your programs to run much faster, or if unlucky or on purpose can modify your programs to run much slower. Sequential is going to be, in general faster on one of these systems. random access is going to be very non-deterministic. array[4]=12; followed by array[37]=12; Those two high level operations could take dramatically different amounts of time, both in the computation of the write address and the actual writes themselves. But for example discarded_variable=array[3]; array[3]=11; array[4]=12; Can quite often execute significantly faster than array[3]=11; array[4]=12;

Question 4

Arrays in C and C++ have random access because they are stored in RAM - Random Access Memory in a finite, predictable order. As a result, a simple linear operation is required to determine the location of a given record (a[i] = a + sizeof(a[0]) * i). This calculation has constant time. From the CPU's perspective, no "seek" or "rewind" operation is required, it simply tells memory "load the value at address X".

However: On a modern CPU the idea that it takes constant time to fetch data is no-longer true. It takes constant amortized time, depending on whether a given piece of data is in cache or not.

Still - the general principle is that the time to fetch a given set of 4 or 8 bytes from RAM is the same regardless of the address. E.g. if, from a clean slate, you access RAM[0] and RAM[4294967292] the CPU will get the response within the same number of cycles.

#include <iostream>
#include <cstring>
#include <chrono>

// 8Kb of space.
char smallSpace[8 * 1024];

// 64Mb of space (larger than cache)
char bigSpace[64 * 1024 * 1024];

void populateSpaces()
{
    memset(smallSpace, 0, sizeof(smallSpace));
    memset(bigSpace, 0, sizeof(bigSpace));
    std::cout << "Populated spaces" << std::endl;
}

unsigned int doWork(char* ptr, size_t size)
{
    unsigned int total = 0;
    const char* end = ptr + size;
    while (ptr < end) {
        total += *(ptr++);
    }
    return total;
}

using namespace std;
using namespace chrono;

void doTiming(const char* label, char* ptr, size_t size)
{
    cout << label << ": ";
    const high_resolution_clock::time_point start = high_resolution_clock::now();
    auto result = doWork(ptr, size);
    const high_resolution_clock::time_point stop = high_resolution_clock::now();
    auto delta = duration_cast<nanoseconds>(stop - start).count();
    cout << "took " << delta << "ns (result is " << result << ")" << endl;
}

int main()
{
    cout << "Timer resultion is " << 
        duration_cast<nanoseconds>(high_resolution_clock::duration(1)).count()
        << "ns" << endl;

    populateSpaces();

    doTiming("first small", smallSpace, sizeof(smallSpace));
    doTiming("second small", smallSpace, sizeof(smallSpace));
    doTiming("third small", smallSpace, sizeof(smallSpace));
    doTiming("bigSpace", bigSpace, sizeof(bigSpace));
    doTiming("bigSpace redo", bigSpace, sizeof(bigSpace));
    doTiming("smallSpace again", smallSpace, sizeof(smallSpace));
    doTiming("smallSpace once more", smallSpace, sizeof(smallSpace));
    doTiming("smallSpace last", smallSpace, sizeof(smallSpace));
}

Live demo: http://ideone.com/9zOW5q

Output (from ideone, which may not be ideal)

Success  time: 0.33 memory: 68864 signal:0
Timer resultion is 1ns
Populated spaces
doWork/small: took 8384ns (result is 8192)
doWork/small: took 7702ns (result is 8192)
doWork/small: took 7686ns (result is 8192)
doWork/big: took 64921206ns (result is 67108864)
doWork/big: took 65120677ns (result is 67108864)
doWork/small: took 8237ns (result is 8192)
doWork/small: took 7678ns (result is 8192)
doWork/small: took 7677ns (result is 8192)
Populated spaces
strideWork/small: took 10112ns (result is 16384)
strideWork/small: took 9570ns (result is 16384)
strideWork/small: took 9559ns (result is 16384)
strideWork/big: took 65512138ns (result is 134217728)
strideWork/big: took 65005505ns (result is 134217728)

What we are seeing here are the effects of cache on memory access performance. The first time we hit smallSpace it takes ~8100ns to access all 8kb of small space. But when we call it again immediately after, twice, it takes ~600ns less at ~7400ns.

Now we go away and do bigspace, which is bigger than current CPU cache, so we know we've blown away the L1 and L2 caches.

Coming back to small, which we're sure is not cached now, we again see ~8100ns for the first time and ~7400 for the second two.

We blow the cache out and now we introduce a different behavior. We use a strided loop version. This amplifies the "cache miss" effect and significantly bumps the timing, although "small space" fits into L2 cache so we still see a reduction between pass 1 and the following 2 passes.