So OP's *real* point is that fsync() sucks in the context of modern hardware whe...

quietbritishjim · 2025-07-20T13:05:29 1753016729

Well, there already is slightly more fine gained control: in the sync version, you can perhaps call sync write() a few times before calling fsync() once i.e. basically batch up a few writes. That does have the disadvantage that you can't easily queue new writes while waiting for the previous ones. Perhaps you could use calls to write() in another thread while the first one is waiting for fsync() for the previous batch? You could even have lots of threads doing that in parallel, but probably not the thousands that you mentioned. I don't know the nitty gritty of Linux file IO well enough to know how well that would work.

As I said, I don't know anything about fsync in io_uring. Maybe that has more control?

An article that did a fair comparison, by someone who actually knows what they're talking about, would be pretty interesting.

loeg · 2025-07-20T14:10:52 1753020652

> As I said, I don't know anything about fsync in io_uring. Maybe that has now control?

io_uring fsync has byte range support: https://man7.org/linux/man-pages/man2/io_uring_enter.2.html#...

quietbritishjim · 2025-07-20T16:55:10 1753030510

Sorry, that was a typo in my comment (now edited). "Now" was meant to be "more" i.e. "perhaps [io_uring] has more control [than sync APIs]?"

Byte range is support is interesting but also present in the Linux sync API:

https://man7.org/linux/man-pages/man2/sync_file_range.2.html

I meant more like, perhaps it's possible to concurrently queue fsync for different writes in a way that isn't possible with the sync API. From your link, it appears not (unless they're isolated at non-overlapping byte ranges, but that's no different from what you can do with sync API + threads):

> Note that, while I/O is initiated in the order in which it appears in the submission queue, completions are unordered. For example, an application which places a write I/O followed by an fsync in the submission queue cannot expect the fsync to apply to the write. The two operations execute in parallel, so the fsync may complete before the write is issued to the storage.

So if two writes are for an overlapping byte range, and you wanted to write + fsync the first one then write + fsync the second then you'd need to queue those four operations in application space, ensuring only one is submitted to io_uring at a time.

jlokier · 2025-07-21T00:58:58 1753059538

> Byte range is support is interesting but also present in the Linux sync API: https://man7.org/linux/man-pages/man2/sync_file_range.2.html

Unfortunately, I think sync_file_range() provides much weaker guarantees than byte-range fsync() and even byte-range fdatasync().

As I understand it from historical behaviour and documentation, sync_file_range() doesn't push durability barriers down the underlying storage devices, nor does it ensure that all metadata needed to access the written pages is itself written and made durable, for example when writing to a hole in a sparse file, to the end-hole created by enlarging a file with ftruncate(), or to fallocate'd pages.

As a result, that means sync_file_range() can only be used as a performance tweak, and not for any durability guarantees that fdatasync() / fsync() are used for.

I'd be delighted to find this has improved since I last looked, but that's what I recall about sync_file_range().

gpderetta · 2025-07-20T20:15:35 1753042535

You can insert synchronization OPs (i.e. barriers) in the queue to guarantee in-order execution.

wtallis · 2025-07-20T23:25:47 1753053947

You can also directly link submitted operations into a chain that will be executed in-order but without ordering dependencies on other operations not submitted as part of the chain.

immibis · 2025-07-20T14:20:10 1753021210

Postgres claims to have some kind of commit batching, but I couldn't figure out how to turn it on.

I wanted to scrub a table by processing each row, but without holding locks, so I wanted to commit every few hundred rows, but with only ACI and not D, since I could just run the process again. I don't think Postgres supports this feature. It also seemed to be calling fsync much more than once per transaction.

azlev · 2025-07-20T15:14:04 1753024444

commit_delay

https://www.postgresql.org/docs/current/runtime-config-wal.h...

anarazel · 2025-07-21T12:59:34 1753102774

> It also seemed to be calling fsync much more than once per transaction.

If it's called many more times than once per transaction the likely reason is that wal_buffers is sized small. Whenever generated WAL exceeds wal_buffers, postgres flushes the WAL, so it does not have to reopen the file later. At that point you already gotten most benefits from batching too.

Edit: A second reason is that data pages need to be written out due to cache pressure or such, and that requires the WAL to be flushed first.

morningsam · 2025-07-20T18:17:15 1753035435

Looking through the options listed under "Non-Durable Settings", [1] I guess synchronous_commit = off fits the bill?

[1]: https://www.postgresql.org/docs/current/non-durability.html

hardwaresofton · 2025-07-21T01:50:16 1753062616

Nope, Other commenter noted it:

https://www.postgresql.org/docs/current/runtime-config-wal.h...

Don't use synchronous_commit = off is durability ~= 0 (i.e. "I hope the write made it to disk")

sgarland · 2025-07-20T14:42:09 1753022529

Maybe I don’t understand what you’re trying to do, but you can directly control how frequently commits occur.

    BEGIN
    INSERT … —- batch of N size
    COMMIT AND CHAIN
    INSERT …

PaulDavisThe1st · 2025-07-20T15:57:28 1753027048

Chance of Postgres commit mapping 1:1 onto posix fsync or equivalent: slim.

anarazel · 2025-07-21T13:10:19 1753103419

Without parallelism, each commit will be at least one fdatasync (or fsync, O_SYNC/O_DSYNC write, depending on configuration). With parallelism, concurrent transaction might be flushed together, reducing the total number of fsyncs.

ImPostingOnHN · 2025-07-20T17:08:13 1753031293

Some applications, like Apache Kafka, don't immediately fsync every write. This lets the kernel batch writes and also linearize them, both adding speed. Until synced, the data exists only in the linux page cache.

To deal with the risk of data loss, multiple such servers are used, with the hope that if one server dies before syncing, another server to which the data was replicated, performs an fsync without failure.

to11mtm · 2025-07-20T20:30:56 1753043456

I feel like you can try to FAFO with that on a distributed log like Kafka (although also... eww, but also I wonder whether NATS does the same thing or not...)

I would think for something like a database, at most you'd want to have something like the io_uring_prep_fsync others mentioned with flags set to just not update the metadata.

To be clear, in my head I'm envisioning this case to be a WAL type scenario; in my head you can get away with just having a separate thread or threads pulling from WAL and writing to main DB files... but also I've never written a real database so maybe those thoughts are off base.

stefanha · 2025-07-20T14:31:31 1753021891

The Linux RWF_DSYNC flag sets the Full Unit Access (FUA) bit in write requests. This can be used instead of fdatasync(2) in some cases. It only syncs a specific write request instead of the entire disk write cache.

zozbot234 · 2025-07-20T15:22:58 1753024978

You should prefer RWF_SYNC in case the write involves changes to the file metadata (For example, most append operations will alter the file size).

stefanha · 2025-07-20T17:15:53 1753031753

Agreed, when metadata changes are involved then RWF_SYNC must be used.

RWF_DSYNC is sufficient and faster when data is overwritten without metadata changes to the file.

vlovich123 · 2025-07-20T21:55:36 1753048536

No that’s incorrect. File size changes caused by append are covered by fdatasync in terms of durability guarantees.

stefanha · 2025-07-21T11:41:35 1753098095

It looks plausible: XFS's xfs_dio_write_end_io() updates the on-disk file size. Do you have a link to documentation that confirms this is true for Linux or POSIX filesystems?

Edit: POSIX 1003.1-2017 defines fdatasync(2) behavior in 3.384 Synchronized I/O Data Integrity Completion, where it says "For write, when the operation has been completed or diagnosed if unsuccessful. The write is complete only when the data specified in the write request is successfully transferred and all file system information required to retrieve the data is successfully transferred".

So I think POSIX does guarantee that a write at the end of the file with O_DSYNC/followed by fdatasync(2) (and therefore, Linux RWF_DSYNC) is sufficient. Thank you for pointing out that RWF_DSYNC is sufficient for appends, vlovich123!

LtdJorge · 2025-07-20T16:39:33 1753029573

Not really, RWF_DSYNC is equivalent to open(2) with O_DSYNC when writing which is equivalent to write(2) followed by fdatasync(2) and:

  fdatasync() is similar to fsync(), but does not flush modified
       metadata unless that metadata is needed in order to allow a
       subsequent data retrieval to be correctly handled.  For example,
       changes to st_atime or st_mtime (respectively, time of last access
       and time of last modification; see inode(7)) do not require
       flushing because they are not necessary for a subsequent data read
       to be handled correctly.  On the other hand, a change to the file
       size (st_size, as made by say ftruncate(2)), would require a
       metadata flush.