Flash has a flash translation layer (FTL). It translates linear block addresses ...

anarazel · on June 11, 2021

> In flash, seeks are essentially free. That means that it does no longer matter if blocks are adjacent.

> The net effect is that positioning as seen from the OS no longer matters at all from the OS layer, and that the OS layer has zero control over adjacency and erase at the PHY layer. Rewriting, defragging, or other OS level operations cannot control what happens physically at the flash layer.

I don't agree with this. The "OS visible position" is relevant, because it influences what can realistically be written together (multiple larger IOs targeting consecutive LBAs in close time proximity). And writing data in larger chunks is very important for good performance, particularly in sustained write workloads. And sequential IO (in contrast to small random IOs) does influence how the FTL will lay out the data to some degree.

dannyw · on June 12, 2021

Disagree, because my understanding your OS visible positions have zero relevance to what will actually be translated to PHYs.

If you feed your NVMe a stream of 1GB writes spread out at completely randomised OS visible places (LBAs), the FTL may very well write it sequentially and you get the solid sustained write performance.

Conversely, you may try to write 1GB of sequential LBAs, and your FTL may very well spread it out all across the physical blocks simply because that's what’s available.

What I'm saying is that sequential reads and writ workloads are good, but whether the OS considers them sequential or not in terms of LBAs is irrelevant. The controller ignores LBAs and abstracts everything away.

My understanding could be wrong, so please correct me if I am.

anarazel · on June 12, 2021

That may sometimes be true the first times you write the random data (but in my experience it's often not true even then, and only if you carefully TRIMed the whole filesystem and it was mostly empty). But on later random writes it's rarely true, unless your randomness pattern of exactly the same as in the first run. To make room the FTL will (often in the background) need to read the non-written parts of erase blocks sized data assigned in the previous runs, just to be able to write out the new random writes. At some point new writes need to wait for this. Slowing things down.

Whereas with larger/sequential writes, there's commonly no need for read-modify-write cycles. The entire previous erase block sized chunks can just be marked as reusable with new content - the old data isn't relevant anymore.

This is pretty easy to see by just running benchmarks with sustained sequential and random write IO. But on some devices it'll take a bit - initially the writes are all in a faster area (e.g. using SLC flash instead of denser/cheaper mlc/tlc/qlc).

Of course, if all the random writes are >= erase block size, with a consistent alignment to multiples of the write size, then you're not going to see this - it's essentially sequential enough.

notaplumber · on June 11, 2021

Thanks for this part, I feel like this was a crucial piece of information I was missing. Also explains my observations about TRIM not being as important as people claim it is, the firmware on modern flash storage seems more than capable of handling this without OS intervention.

isotopp · on June 11, 2021

The GC in the device cleans up.

TRIM is useful, it gives the GC important information.

TRIM is not that important as long as the device is not full (less than 80%, generally speaking, but it is very easy to produce pathological cases that are way off in either direction). Once the device fills up above that it is crucial.