Flash has a flash translation layer (FTL). It translates linear block addresses (LBA) into physical addresses ("PHY").
Flash can write blocks at a granularity similar to a memory page (cells, around 4-16 KB). It can erase only sets of blocks, at a much larger granularity (around 512-ish cell sized blocks).
The FTL will try to find free pages to write your data to. In the background, it will also try to move data around to generate unused erase blocks and then erase them.
In flash, seeks are essentially free. That means that it does no longer matter if blocks are adjacent. Also, because of the FTL, adjacent FTL are not necessarily adjacent on the physical layer. And even if you do not rewrite a block, it may be that the garbage collection moves data around at the PHY layer in order to generate completely empty erase blocks.
The net effect is that positioning as seen from the OS no longer matters at all from the OS layer, and that the OS layer has zero control over adjacency and erase at the PHY layer. Rewriting, defragging, or other OS level operations cannot control what happens physically at the flash layer.
TRIM is a "blatant layering violation" in the Linus sense: It tells the disk "hardware" what the OS thinks it no longer needs. TRIM'ed blocks can be given up and will not be kept when the garbage collector tries to free up an erase page.
> In flash, seeks are essentially free. That means that it does no longer matter if blocks are adjacent.
> The net effect is that positioning as seen from the OS no longer
matters at all from the OS layer, and that the OS layer has zero control over adjacency and erase at the PHY layer. Rewriting, defragging, or other OS level operations cannot control what happens physically at the flash layer.
I don't agree with this. The "OS visible position" is relevant, because it influences what can realistically be written together (multiple larger IOs targeting consecutive LBAs in close time proximity). And writing data in larger chunks is very important for good performance, particularly in sustained write workloads. And sequential IO (in contrast to small random IOs) does influence how the FTL will lay out the data to some degree.
Disagree, because my understanding your OS visible positions have zero relevance to what will actually be translated to PHYs.
If you feed your NVMe a stream of 1GB writes spread out at completely randomised OS visible places (LBAs), the FTL may very well write it sequentially and you get the solid sustained write performance.
Conversely, you may try to write 1GB of sequential LBAs, and your FTL may very well spread it out all across the physical blocks simply because that's what’s available.
What I'm saying is that sequential reads and writ workloads are good, but whether the OS considers them sequential or not in terms of LBAs is irrelevant. The controller ignores LBAs and abstracts everything away.
My understanding could be wrong, so please correct me if I am.
That may sometimes be true the first times you write the random data (but in my experience it's often not true even then, and only if you carefully TRIMed the whole filesystem and it was mostly empty). But on later random writes it's rarely true, unless your randomness pattern of exactly the same as in the first run. To make room the FTL will (often in the background) need to read the non-written parts of erase blocks sized data assigned in the previous runs, just to be able to write out the new random writes. At some point new writes need to wait for this. Slowing things down.
Whereas with larger/sequential writes, there's commonly no need for read-modify-write cycles. The entire previous erase block sized chunks can just be marked as reusable with new content - the old data isn't relevant anymore.
This is pretty easy to see by just running benchmarks with sustained sequential and random write IO. But on some devices it'll take a bit - initially the writes are all in a faster area (e.g. using SLC flash instead of denser/cheaper mlc/tlc/qlc).
Of course, if all the random writes are >= erase block size, with a consistent alignment to multiples of the write size, then you're not going to see this - it's essentially sequential enough.
Thanks for this part, I feel like this was a crucial piece of information I was missing. Also explains my observations about TRIM not being as important as people claim it is, the firmware on modern flash storage seems more than capable of handling this without OS intervention.
TRIM is useful, it gives the GC important information.
TRIM is not that important as long as the device is not full (less than 80%, generally speaking, but it is very easy to produce pathological cases that are way off in either direction). Once the device fills up above that it is crucial.
Flash can write blocks at a granularity similar to a memory page (cells, around 4-16 KB). It can erase only sets of blocks, at a much larger granularity (around 512-ish cell sized blocks).
The FTL will try to find free pages to write your data to. In the background, it will also try to move data around to generate unused erase blocks and then erase them.
In flash, seeks are essentially free. That means that it does no longer matter if blocks are adjacent. Also, because of the FTL, adjacent FTL are not necessarily adjacent on the physical layer. And even if you do not rewrite a block, it may be that the garbage collection moves data around at the PHY layer in order to generate completely empty erase blocks.
The net effect is that positioning as seen from the OS no longer matters at all from the OS layer, and that the OS layer has zero control over adjacency and erase at the PHY layer. Rewriting, defragging, or other OS level operations cannot control what happens physically at the flash layer.
TRIM is a "blatant layering violation" in the Linus sense: It tells the disk "hardware" what the OS thinks it no longer needs. TRIM'ed blocks can be given up and will not be kept when the garbage collector tries to free up an erase page.