Related: *"High-bandwidth flash progress and future"* (15 comments), https://new...

trollbridge · 2026-03-15T16:01:10 1773590470

Yeah, I've wondered if we might see a revival of this kind of technology.

Weryj · 2026-03-15T17:55:12 1773597312

I’ve been considering buying 8x64g models and setting them as equal priority swap disks (to mitigate the low throughput) for this exact reason.

MrDrMcCoy · 2026-03-15T19:11:29 1773601889

Can confirm doing so is awesome. Get some slightly bigger ones and partition them for additional use as zil. They're extremely satisfying to use, and depressing to remember that we'll never see their like again.

Weryj · 2026-03-19T08:49:23 1773910163

Do you have any more details? This is such a niche idea, that I’d be buying blind.

MrDrMcCoy · 2026-03-19T21:36:43 1773956203

Sure! This is more or less how I'm using Optane in my storage box:

Two of U.2x4 to PCIe x16 riser cards, one loaded with 960GB Intel-branded Optanes, one with 1.5TB IBM-branded. PCIe bifurcation is set up in the BIOS to let them all come up properly, where they just show as regular NVMe. Riser cards like this can easily be substituted for PCIe to SAS/Oculink to U.2 cables, if that would be more accommodating to your chassis.

Once they all come up, partition them for your preferred split of swap and ZFS special. Swap should have them all mounted with the same priority and discard=pages, I also recommend setting up zswap (not zram swap) with lz4 as an additional layer of fast, evictable memory pool, as well as `vm.overcommit_memory=2` and `vm.swappiness=150`. This will effectively give you really good memory tiering for workloads and file cache.

When adding the other partitions to ZFS, use `-ashift=12 special mirror dev dev special mirror dev dev ...`. ZFS special covers all metadata, the intent log (sorta write cache), and optionally small files. I like to set it up so <= 8k small files get sent there, but you can probably go higher depending on how much capacity you allocate. My ~24T of allocated data ended up being ~150GB special with 8k small file, and that's with the whole pool configured with deduplication and blake3 for all hashes. Blake3 is fast as heck, but has very long hashes, so from a metadata standpoint, I'm using the most expensive option. I mitigate that a bit my setting metadata redundancy to `some`, since my metadata is effectively RAID10 anyway.

With some extra NVMe/Optane allocated to regular ZFS read cache, and all my spinning-rust data VDEVs also as RAID10, it's almost like having the whole array in memory, or at least on fast flash. Eliminating metadata from your drives seeking and letting them be written nearly instantly with Optane does wonderful things for spinning rust :)

newsclues · 2026-03-15T16:04:39 1773590679

in an era of shortages, if there was an optane factory today ready to print money...

walterbell · 2026-03-15T16:21:35 1773591695

Secondary market surplus pricing (~$1/GB) value accrues to the buyer..

zozbot234 · 2026-03-15T16:30:46 1773592246

> (~$1/GB)

Isn't that actually crazy good, even insane value for the performance and DWPD you get with Optane, especially with DRAM being ~$15/GB or so? I don't think ~$1/GB NAND is anywhere that good on durability, even if the raw performance is quite possibly higher.