Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For those who are interested, the design was originally published here:

(Chinese) https://www.high-flyer.cn/blog/3fs/

This file system has been developed and utilized by them for several years .

Compared to the traditional file systems, it is more focused on model training that contains a lot of random reads. Read cache and prefetching are useless in this case. Therefore, they designed the file system without those features to improve the performance.

I google translated some key parts here:

3FS is a special file system because it is almost only used in the scenario of batch reading sample data in the computing node during AI training, and accelerates model training through high-speed computing and storage interaction. This is a large-scale random reading task, and the read data will not be used again in a short time, so we cannot use the most important tool "read cache" to optimize file reading, and even advance reading is useless. Therefore, the implementation of 3FS is also quite different from other file systems.

Specifically, as shown in the figure above, 3FS uses the Linux-based AIO and io_uring interfaces to complete sample reading, because in the 3FS scenario, File Cache has no effect at all, but will consume system memory in a way that is difficult for users to control, affecting the operation of subsequent tasks, so we turned off File Cache and only used Direct I/O mode to read data. But it should be noted that when reading in this way, the buffer pointer, offset and length all need to be aligned. If the user is allowed to do this alignment, additional memory copies will be generated, so we have done the alignment inside the file system, which not only optimizes performance but also facilitates users.



I hope they chose a multiple of 4096 for the alignment to minimize flash read amplification. QLC drives even use 16kib pages.


How critical is random reading of training data when assembling batches?

Put another way: in my experience, supporting fast random reads is a challenging problem, while supporting high sequential reads is fairly straightforward. When is random access to a training set absolutely necessary for training a model?


Imagine you're studying for a test where you are given an image and need to answer the correct class. To prepare, you're given a deck of flashcards with an image on the front and the class on the back.

(Random) You shuffle the deck every time you go through it. You're forced to learn the images and their classifications without relying on any specific sequence, as the data has no signal from sequence order.

(Fixed order) Every time you go through the deck, the images appear in the exact same order. Over time you may start to unconsciously memorize the sequence of flashcards, rather than the actual classification of each image.

When it comes to actually training a model, if the batches are sampled sequentially from a dataset, it risks learning from correlations caused by the sequencing of the data, resulting in poor generalization. In contrast, when you sample the batches randomly, the model is biased and encouraged to learn features from the data itself rather than from any signals that arise from artifacts of the ordering.


Why, then are so many successful models trained on multiple passes through sequential data? Note, I'm not naive in this field, but as an infra person, random reads make my life very difficult, and if they're not necessary, I'd rather not deal with them by having to switch to an approach that can't use readahead and other strategies for high throughput io.


Why did they take time to invent an entire FS?


The engineers at DeepSeek seem extremely smart and well-funded, so my guess is they looked at the alternatives and concluded none of them would allow them to make their models work well enough.


They did this back in their trading firm days, and...

Imagine that you have a sequence of numbers. You want to randomly select a window of, say, 1024 consecutive numbers, a sequence, as input to your model. Now, say, you have n items in this sequence, you want to sample n/c (c is a constant and << 1024) sequences in total. How to do fixed shuffle?

The key is, we have overlap in data we want to read. If we brute force fixed shuffle and expand, we need to save 1024/c times more than original data.

This isn't useful for LLMs, but hey, wonder how it started?


I guess I'm much more of the "materialize the shuffle asychronously from the training loop" kind of person. I agree, the materialization storage cost is very high, but that's normally been a cost I've been willing to accept.

As an ML infra guy I have had to debug a lot of failing jobs over the years, and randomizing datapipes are one of the hardest to debug. Sometimes there will be a "record-of-death" that randomly gets shuffled into a batch, but only causes problems when it is (extremely rarely) coupled with a few other records.

I guess I'll just have to update my priors and accept that inline synchronous randomization with random reads is a useful-enough access pattern in HPC that it should be optimized for. Certainly a lot more work and complexity, hence my question of just how necessary it is.


Yeah, I don't want to do this either. This is a super special case, after exploring alternatives with our researchers it's unfortunately needed. As for record-of-death, we made sure that we do serialize all rng state and have our data pipeline perfectly reproducible even when starting from checkpoint.

Building a system for serving read-only data at NVMe SSD speed (as in IOPS) took surprisingly few effort, and is mostly enough for training data. Kudos to DeepSeek who decided to spend extra effort to build a full PFS and share it.


On an SSD, random and sequential reads have nearly the exact same performance. Even on large arrays of spinning rust this is essentially true.


This hasn't been my experience; I see much higher sequential read results compared to random reads on a wide range of storage from low-end home PC SSDs to high end NVME flash storage in large servers.

It's certainly not true on actual hard drives, and never has been. A seek is around 10ms.


By what metric? I think this is close to true for identical blocksizes, but most benchmarks test sequential transfers with large 1M blocks and random ones with small 4K blocks. In this case, the speed of the fastest NVME drives is more than double for sequential transfers than it is for random ones.

I don't like comparing the two, they're completely different workloads and it's better IMO to look at the IOPS for random transfers, which is where newer, faster SSDs truly excel, and where most people "notice" the performance.


Why is that a random read? Also is it truely random, or from seed? But if prng then they could cache right?


Random is prng. They still cannot cache though because they do many reading "passes" through the same data.

If build a cache that gets hits on the first pass, then it won't work for the second and later passes.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: