YES, in many contexts, there’s an advantage for sequential access patterns, but that advantage can range from negligible to profound depending on which other assumptions you are making and what you mean by the question. The dominant explanation(s) for why sequential memory access can be “faster” come from understanding CPU architecture, not the low-level DRAM circuitry, although you could also take the perspective that CPU architecture is a byproduct of optimizing for the performance characteristics of the DRAM circuitry.
Another post has already explained that the DRAM circuitry has significant optimizations for sequential access; and while that information is literally true, it’s misleading; I’d strongly caution that your code doesn’t directly interact with the DDR protocol; it’s executing on a CPU that interacts with the DDR protocol on your behalf, and the CPU does such a good job of hiding the low-level details that intuitions derived from understanding DRAM topology (banks, rows, columns, channels, etc) very rarely translate into useful intuitions for performance optimization. This is true even for professional software performance engineers who intimately understand the entire software/hardware stack and are willing to spend months to eek out a 10% win. I think about memory bandwidth every day (bw is my system’s dominant cost, not cpu cycles), but I still haven’t found an opportunity to exploit my knowledge of how memory bandwidth comes from multiple parallel queues rather than a single queue for the entire system. If you are curious about low-level DRAM details, I’d also suggest this YouTube video, just don’t let anything you learn influence your programming choices. It doesn’t work how you think it works.
As a programmer, it’s far more critical to understand the processor architecture than the low-level DRAM architecture, and while you could spend several semesters learning CPU architecture (worth it!), there’s also a simple mental model that will make you heads and heals superior as an effective software performance engineer. I use it constantly, and only rarely use my more precise understanding of the details.
The most useful mental model:
The most important thing to understand is that modern CPUs always read from DRAM in units of 64-bytes, which is equivalent to the size of one “cache line” in the CPU’s internal L1/L2/L3 SRAM caches. Maybe some future chip will move to wider 128 or 256 byte reads, but every chip I’m aware of uses 64-byte cache-lines and it’s been this way for a long time; 64-bytes is an architectural “sweet spot”, driven by the the width of the DRAM bus / how many sequential bytes DRAM wants to read to be efficient.
The mental model I’m proposing is to simply count the number of 64-byte DRAM reads (aka “L3 line fills”) that your code will trigger, something which can often, but not always be accurately estimated from first principles. Your edge comes from the cases that can be estimated. The complicated boundary cases require profiling tools, although you can usually estimate lower and upper bounds on the answer using the simple method. It works as follows…
If you write code that makes millions of reads from a small array (eg, a 1 KiB array), then because that array fits in L1 cache, those aren’t DRAM reads and I don’t count them when I’m trying to count memory accesses. For cached reads like that, your code can approach two reads per clock cycle on most of the chips I can think of. Local variables, small lookup tables, generally anything small and frequently accessed isn’t going to DRAM. This stuff costs 0 in the mental model I’m proposing.
Even arrays up to a few MiB might fit in the L2 or L3 cache, as long as they are accessed frequently enough and there’s not too much competition from other the other data being accessed. You usually don’t need to care that there are multiple levels of cache, with L1 being smaller/faster, L2 and L3 being larger/slower. From a memory-bandwidth perspective, L3 cache is still “free” (doesn’t consume precious DRAM bandwidth) and you’d need a very careful microbenchmark to detect that there’s a limit on the total throughput from L3 cache; in practice, it’s usually infinite to most algorithms. It’s mainly for code limited by memory latency (serialized reads) that the distinction between L1 vs L3 starts to really matter.
If you write code that makes millions of random reads from an array that’s big (several 100 MiB or bigger), then except for a few lucky cache hits, each of your reads will cause the CPU to read a 64-byte cache line from DRAM. Your could be reading 1-byte values, 4 byte values, 8-byte values, doesn’t matter; I count all of those as +1 read and it’s going to pull 64-bytes from the DRAM. Count the number DRAM reads, not the bytes you are using.
Similarly, if you’re working in C++, and you’ve got a struct Foo that’s 64-bytes (ie, sizeof(Foo) == 64), and you’ve got an array of Foo structs that’s 100+ MiB, and then millions of times you select a random Foo struct from the array and read several fields from that struct, then that’s still the same access pattern and it’s +1 read per Foo struct that you touch. Once you read any field from that Foo struct, it’s in the L1 cache and you can access all of the other fields at close to zero cost. Note that I deliberately crafted this example to guarantee that C++ will align each Foo struct to a 64-byte boundary. If you did something weird such that the Foo structs aren’t aligned, causing them to straddle two different cache-lines, that might cause you to have to read two cache lines. Or if your Foo struct is 128 bytes, depending on which fields you access, that might be +1 or +2 reads. Most languages try to align data to improve performance, and fundamental numeric types (int / float / etc) are always aligned unless you’ve deliberately told the compiler to disable a struct’s alignment (eg, GCC’s `packed` attribute), or used raw pointer math to treat an arbitrary byte-aligned address as a numeric type, or C++ bitfields can obviously thwart alignment, or other equally funky hackery (eg, combining the `union` keyword with any other language feature can create professional opportunities to file bugs against the compiler team).
Count the number of DRAM reads, not the bytes you are using.
OK, so what about sequential access? And what does “faster” even mean?
One “trivial” way that sequential access can be faster is if you sequentially read 1-million 1-byte values, then that’s only 1M/64 = 15.6k cache lines. Versus if you did 1 million random 1-byte reads, that’s 1 million cache lines. This isn’t obvious to people who don’t think about software performance every day, but if you fully internalize my block-oriented mental model, then this example is “trivial” because you’re comparing 15.6k DRAM-reads against 1M DRAM-reads. Doing less work is obviously faster.
Another “unfair” case is if your random accesses are “serialized”, meaning that your code needs to complete one read to figure out the address of the next read. Traversing linked-lists is a classic example of this because until you read the data for one node, you can’t get the pointer to the next node. This makes linked-lists horrifically slow on modern CPUs, where we expect to achieve throughput by enabling a single thread of execution to be performing 10+ reads in parallel (maybe even 20 to 30 on some chips).
A more subtle “unfair” case is that when the loop that’s issuing the reads also contains if-statements that sometimes go one way and sometimes the other, when the CPU makes the wrong branch prediction, it has to backtrack and loses the speculative progress that it was making on figuring out the next memory address that your program is going to read. When this happens too often, it limits the parallelism of your random reads, and mitigating this problem is one of the most important use-cases for adding software prefetch instructions to code. These “branch-mispredicts” have less impact on sequential access patterns because sequential access is fundamentally predictable, so the CPU doesn’t need to speculate through your code to figure out where the next read is going to go (the hardware prefetcher kicks in). It’s also possible to run into a related issue if there’s too many assembly ops between your reads, such that the reorder buffer fills up with pending instructions before achieving sufficient read parallelism; this too is cured by issuing software prefetch instructions for the data you will read soon, with the prefetches typically spaced about 32 cache lines in advance of your actual reads. Or, if you simplify your loop such that it’s just doing the random reads, and doesn’t have either of those issues, you’ll typically find that software prefetching has no effect on performance b/c the ROB was already achieving sufficient parallelism. So if you don’t know how to software prefetch, probably just don’t worry about it, and maybe avoid mixing unpredictable branches into the loops that read from memory.
But even if we stage a “fair” competition between sequentially reading X * 64-bytes versus making X random 64-byte reads, sequential reads still have some subtle efficiency advantages.
If you test performance from a single thread on a machine that’s mostly idle (ie, you’re not competing with other threads for access to the memory bus), then you’ll find that on many chips, you can perform roughly twice as many sequential reads per second as random reads. The explanation for this is that the CPU has clever circuitry that watches your memory access pattern and tries to predict which cache-lines your code is going to want next (the “hardware prefetcher”). Why does this hardware prefetching help? Well, this gets a bit deep. My current understanding is that in the random access case, your parallelism is limited by that fact that each core can only have a limited number of L1 misses outstanding at one time. Intel calls this the “Line-Fill Buffer” and it usually has something like 10 or 24 slots (varies between chips), and there are similar mechanisms on AMD and Arm. DRAM physics dictate that reads have ~60ns of latency (and this barely changes as the tech evolves), which is ~400 clock cycles, so that would rate limit throughput of random parallel reads to at most 1 per ~40 clock cycles. Versus when the hardware prefetch mechanisms kick in, they don’t consume line-fill slots and can effectively perform more reads in parallel. Sadly, if you issue software prefetch instructions, CPUs seem to treat that as needing a line-fill entry, so it’s only reads triggered by hardware-prefetching that are able to achieve greater parallelism than the size of your line-fill buffer.
But this performance difference evaporates when you start benchmarking multi-threaded systems. Because the memory bus is shared between all threads, there’s nowhere near enough bandwidth from the memory subsystem for all threads to sustain the kind of memory read throughput that you can achieve with a single thread that has no competition (eg, 10+ GiB/s vs <1 GiB/s). When saturating the memory bus from multiple threads, I get very close to the same read throughput when testing a largish array (~0.5 GiB) that’s too big for cache, but small enough that I don’t incur any TLB misses in the random access scenario, and even when I use a huge array (~50 GiB) where TLB misses come into the picture, random access is only ~10% slower. Note that while the TLB itself only has a few entries, the page table entries it is reading will often be in the CPU’s cache, so TLB misses add a little bit of latency, but don’t necessarily increase the number of DRAM reads. With huge pages, each page table entry can map as much as 1 GiB of RAM, so it only takes a handful of page table entries to map an entire system worth of RAM into the same process and that small page table is easily cacheable. Note that this observation about sequential and random having roughly the same system-level throughput is what allows me to conclude that the single-threaded advantage for sequential access is caused by the line-fill buffer mechanism, as opposed to an actual savings at the level of the DDR protocol timings.
So in the end, once you internalize my block-oriented mental model, from a throughput perspective, random 64 byte line-fills are roughly the same cost as sequential 64-byte line-fills. You might need to use software prefetching to ensure you get sufficient read parallelism in branch heavy code. Your code might need to be efficient enough in other ways to make the memory-bus your tightest resource limit. And you’ll start to think really hard about how to pack bytes together to maximize the value of each line-fill that your code triggers. But when counting memory-bandwidth, it’s just 1 read, 2 reads, 3 reads, 4 reads, multiply by 64 bytes. Ignore the small arrays, count the big arrays, and if you’ve got an intermediate sized array (or a non-uniform access probability), pretending it’s a big array can give you a worst-case analysis, but if you want the precise answer, first-principles is insufficient and you’ll have to use fancy profiling tools to measure your full-stack under a realistic loadtest to empirically determine how often your reads hit in the cache when all of your code paths are competing for the same cache space. Note that when running in a shared hosting environment (eg, if you buy a cloud VM that’s <1/2 the full machine size, meaning you get a fractional chip), the cache hit-rate and memory-bandwidth analysis in general becomes a la-la land of unknowability because the other threads sharing your L3 cache and memory bus are someone else’s code and what that code is could change from day to day, meaning your code could get faster and slower based on events out of your own control (this “sharing effect” is a big problem with present day computer architectures).
My final word of advice is this: Don’t optimize anything! The vast majority of software logic is not performance critical, executes in negligible time, and runs on trivially cheap hardware resources that aren’t even fully utilized. Write your code to be clear and concise, and easy to express, then only make it faster if it takes too long to run AND you’ve already written decent tests for it, as well as a benchmark to measure improvement. Or if you want to get into performance engineering, pursue software optimization challenges in your free-time, but don’t make the mistake of mixing over-optimization into your productive coding work, especially not anything that needs to be maintained or updated or used by anyone else.