Honestly seems pretty reasonable to me. SQLite operates above the filesystem, and it seems to me the filesystem is the appropriate place to do this kind of error correction/detection. If you need this kind of durability of your data, why are you storing anything on a filesystem that doesn't do this?
I'm going to present a potentially controversial opinion: you should always consider adding checksums.
Checksums are great, easy and almost assuredly cheap. CRC32C has hardware acceleration support on any modern CPU. Writing a masked CRC32C is incredibly cheap and you'd be surprised at all the places you can catch corruption.
The sooner you write a checksum, the more value it adds. If you calculate a checksum later on, such as when you're rotating files on disk, or archiving to cold storage, it may be too late. What do those checksums represent? The checksums may be accurate, such as those used by your filesystem, but the data is already corrupted before they were calculated.
No matter how fast CRC32 is, you still have to transfer the data from disk to cpu. I suspect reading the entire db would introduce unacceptable latency & i/o strain in many usecases for sqlite.
Pages?. Postgres does that apparently if enabled. And when a page is read, it is already read… The server just checks the checksum. Only overhead should be recalculating the checksum and validating it.
Now, in case a bit flips somewhere you do not touch often, it would probably keep chugging along without noticing that. Which kind of makes sense?
All of it. With a checksums-of-checksums scheme like a Merkele tree, you can effectively and efficiently checksum all the data and keep incremental changes cheap. You only need to update the checksums of the data blocks you touched and their ancestor nodes in the tree.
I use sqlite for some "logging-like" thing a lot, the file is in the ~1.5GB range and growing, and every minute some data is logged to it. Having to read 1.5GB of data from the disk every minute, to add a (few times) timestamp and one 64 bit number to that data seems pointless.
If you already know the checksum for some huge chunk of the message then you don't need to recompute it to append some data and get a new checksum (at least for CRC). On the read side you would want to have checksums at whatever granularity you want to be able to read but for a larger combined CRC checksum you don't need to ever reread data to append or prepend to it.
CRC32C is not that cheap, even with hardware acceleration. Postgres uses CRC32C for the WAL an it definitely shows up in profiles, despite HW acceleration. And that's at a far lower throughput than you can have with e.g. sequential scans. Which is why postgres chose a different checksum (FNV) for data checksums...
And if you care about data integrity, you shouldn't be using these. It's zfs, btrfs if you don't want to deal with the licensing issue, or bcachefs (though it is pretty new at this point)
That's poor advice. Use a database with checksums enabled like DB2, Oracle, Postgres, SQL Server or SQLite with the Checksum VFS Shim. TFA recommends the latter for good reason.
Running a database on ZFS results in reduced throughput, while Btrfs and Bcachefs are outright performance disasters. [1][2]
> And if you care about data integrity, you shouldn't be using these [ext4 or XFS]
FUD. i've been using ext4 since it came out (and ext2 and ext3 before that), on dozens of computers, and have literally never once had any corruption with it. With XFS i had one instance of filesystem corruption way back in the early 2000's. i've read of _many_ more cases of data loss with ZFS and Btrfs on Linux than i have read about due to ext4 corruption.
If you truly believe this, there’s no reason you would want to add checksumming to SQLite either. Either the data you write can be corrupted or it can’t.
Yeah. I agree. That being said we do line in a world where non-checksummed filesystems and non-ECC RAM are common especially on consumer devices. That being said disk or file-level checksums are an incomplete solution without ECC RAM anyways.