So OP's real point is that fsync() sucks in the context of modern hardware where thousands of I/O reqs may be in flight at any given time. We need more fine-grained mechanisms to ensure that writes are committed to permanent storage, without introducing undue serialization.
Well, there already is slightly more fine gained control: in the sync version, you can perhaps call sync write() a few times before calling fsync() once i.e. basically batch up a few writes. That does have the disadvantage that you can't easily queue new writes while waiting for the previous ones. Perhaps you could use calls to write() in another thread while the first one is waiting for fsync() for the previous batch? You could even have lots of threads doing that in parallel, but probably not the thousands that you mentioned. I don't know the nitty gritty of Linux file IO well enough to know how well that would work.
As I said, I don't know anything about fsync in io_uring. Maybe that has more control?
An article that did a fair comparison, by someone who actually knows what they're talking about, would be pretty interesting.
I meant more like, perhaps it's possible to concurrently queue fsync for different writes in a way that isn't possible with the sync API. From your link, it appears not (unless they're isolated at non-overlapping byte ranges, but that's no different from what you can do with sync API + threads):
> Note that, while I/O is initiated in the order in which it appears in the submission queue, completions are unordered. For example, an application which places a write I/O followed by an fsync in the submission queue cannot expect the fsync to apply to the write. The two operations execute in parallel, so the fsync may complete before the write is issued to the storage.
So if two writes are for an overlapping byte range, and you wanted to write + fsync the first one then write + fsync the second then you'd need to queue those four operations in application space, ensuring only one is submitted to io_uring at a time.
Unfortunately, I think sync_file_range() provides much weaker guarantees than byte-range fsync() and even byte-range fdatasync().
As I understand it from historical behaviour and documentation, sync_file_range() doesn't push durability barriers down the underlying storage devices, nor does it ensure that all metadata needed to access the written pages is itself written and made durable, for example when writing to a hole in a sparse file, to the end-hole created by enlarging a file with ftruncate(), or to fallocate'd pages.
As a result, that means sync_file_range() can only be used as a performance tweak, and not for any durability guarantees that fdatasync() / fsync() are used for.
I'd be delighted to find this has improved since I last looked, but that's what I recall about sync_file_range().
You can also directly link submitted operations into a chain that will be executed in-order but without ordering dependencies on other operations not submitted as part of the chain.
Postgres claims to have some kind of commit batching, but I couldn't figure out how to turn it on.
I wanted to scrub a table by processing each row, but without holding locks, so I wanted to commit every few hundred rows, but with only ACI and not D, since I could just run the process again. I don't think Postgres supports this feature. It also seemed to be calling fsync much more than once per transaction.
> It also seemed to be calling fsync much more than once per transaction.
If it's called many more times than once per transaction the likely reason is that wal_buffers is sized small. Whenever generated WAL exceeds wal_buffers, postgres flushes the WAL, so it does not have to reopen the file later. At that point you already gotten most benefits from batching too.
Edit: A second reason is that data pages need to be written out due to cache pressure or such, and that requires the WAL to be flushed first.
Without parallelism, each commit will be at least one fdatasync (or fsync, O_SYNC/O_DSYNC write, depending on configuration). With parallelism, concurrent transaction might be flushed together, reducing the total number of fsyncs.
Some applications, like Apache Kafka, don't immediately fsync every write. This lets the kernel batch writes and also linearize them, both adding speed. Until synced, the data exists only in the linux page cache.
To deal with the risk of data loss, multiple such servers are used, with the hope that if one server dies before syncing, another server to which the data was replicated, performs an fsync without failure.
I feel like you can try to FAFO with that on a distributed log like Kafka (although also... eww, but also I wonder whether NATS does the same thing or not...)
I would think for something like a database, at most you'd want to have something like the io_uring_prep_fsync others mentioned with flags set to just not update the metadata.
To be clear, in my head I'm envisioning this case to be a WAL type scenario; in my head you can get away with just having a separate thread or threads pulling from WAL and writing to main DB files... but also I've never written a real database so maybe those thoughts are off base.
The Linux RWF_DSYNC flag sets the Full Unit Access (FUA) bit in write requests. This can be used instead of fdatasync(2) in some cases. It only syncs a specific write request instead of the entire disk write cache.
It looks plausible: XFS's xfs_dio_write_end_io() updates the on-disk file size. Do you have a link to documentation that confirms this is true for Linux or POSIX filesystems?
Edit: POSIX 1003.1-2017 defines fdatasync(2) behavior in 3.384 Synchronized I/O Data Integrity Completion, where it says "For write, when the operation has been completed or diagnosed if unsuccessful. The write is complete only when the data specified in the write request is successfully transferred and all file system information required to retrieve the data is successfully transferred".
So I think POSIX does guarantee that a write at the end of the file with O_DSYNC/followed by fdatasync(2) (and therefore, Linux RWF_DSYNC) is sufficient. Thank you for pointing out that RWF_DSYNC is sufficient for appends, vlovich123!
Not really, RWF_DSYNC is equivalent to open(2) with O_DSYNC when writing which is equivalent to write(2) followed by fdatasync(2) and:
fdatasync() is similar to fsync(), but does not flush modified
metadata unless that metadata is needed in order to allow a
subsequent data retrieval to be correctly handled. For example,
changes to st_atime or st_mtime (respectively, time of last access
and time of last modification; see inode(7)) do not require
flushing because they are not necessary for a subsequent data read
to be handled correctly. On the other hand, a change to the file
size (st_size, as made by say ftruncate(2)), would require a
metadata flush.