In the most non-inflammatory way possible: I am not sure I'm convinced of the performance benefits of crossing a JNI boundary to invoke a method in another language, versus letting the JVM JIT and optimize native JVM code.
Would be interesting to see benchmarks here, do you know if there are any public ones available, specific to QuestDB's interop code?
In certain situations, crossing the JNI boundary can be advantageous. When data resides in "native" memory, outside the Java heap, the coordination of concurrent execution logic can be handled in Java, while the actual execution occurs in C++ or Rust. In this context, the negligible penalty for crossing the JNI boundary once per 10 million rows pales in comparison to the substantial benefits achievable through C++ optimization or CPU-specific assembly.
I see... but that seems a little weak considering it's a funded product, the first adjective they use to describe it is "fast", and good old C++ would totally slay it. The author has a C++ background, maybe he could spend an afternoon trying that?
I couldn't even try to count the number of great posts I've read about fast hash tables from e.g. Paul Khuong alone...
One of our distribution channels is Maven Central where we ship Java 11 compatible library. Embedded users preclude us from leveraging latest Java features.
It is not always read-write-modify. There is no evidence of this pattern in Ubuntu when there is no memory pressure. Merge occurs when block is partially updated after kenel had lost state of the block, which can happen under memory pressure.
On x86, and I think every architecture, when you write to a memory mapping that is not already backed by a writable page, the kernel is notified that user code is trying to write. And the kernel needs to fill in the contents of the page, which requires a read if the page isn’t already loaded.
It has to be this way! The write could be a read-modify-write instruction. Or it could be a plain store, but I’ve never heard of hardware with write-only memory with fine enough granularity to make this work.
The sole exception is if the page in question is all zeros and the kernel can know this without needing to read the file. This might sometimes be the case for an append-only database. I don’t know exactly what QuestDB does.
Also:
> As soon as you mmap a file, the kernel allocates page table entries (PTEs) for the virtual memory to reserve an address range for your file,
Nope. It just makes a record of the existence of the mapping. This is called a VMA in Linux. No PTEs are created unless you set MAP_POPULATE.
> but it doesn't read the file contents at this point. The actual data is read into the page when you access the allocated memory, i.e. start reading (LOAD instruction in x86) or writing (STORE instruction in x86) the memory.
What are these LOAD and STORE instructions in x86? There are architectures reasonably described as load-store architectures, and x86 isn’t one of them.
> On x86, and I think every architecture, when you write to a memory mapping that is not already backed by a writable page, the kernel is notified that user code is trying to write. And the kernel needs to fill in the contents of the page, which requires a read if the page isn’t already loaded.
This is very true. Perhaps there wasn't enough context to what the article is describing. The read problem started to occur on database that is subject to constant write workload. Data is flowing in all the time at variable rate. Typically blocks are "hot" and being filled in fully within seconds if not millis.
Zeroing the file is an option to try. QuestDB allocates disk with `posix_fallocate()`, which doesn't have the required flags. We would need to explore `fallocate()`. Thanks.
If you are allocating zeroed space and then writing a whole page quickly, then you may well avoid a read. And doing this under extreme memory pressure will indeed cause the page to be written to disk while only partly written by user code, and it will need to be read back in. Reducing memory pressure is always good.
I would expect quite a bit better performance if you actually write a entire pages using pwrite(2) or the io_uring equivalent, though.
Messing with fallocate on a per-page basis is asking for trouble. It changes file metadata, and I expect that to hurt performance.
great, we are on the same page! `fallocate()` (or posix one) is called on large swades of file. 16MB default. Not too often to hurt performance. I wonder if zeroing the file with `fallocate()` will result in actual disk writes or is it ephemeral?
I am excited for what Memory API brings. The biggest pain with native memory management for us are "struct" or lack of those. Extremely bug prone memory arithmetic I personally am very motivated to see the back of.
FFI - we need to wait and see how this actually evolves.
We are not looking at any of this just yet. While it is undoubtedly fun, there are few other things we're busy with.
I couldn't agree more. I don't think compiler vectorization is that useful even for columnar (!) database we're building. The specialized JIT doesn't even use AVX512 because too much effort for little to no gain.
Vectorization (auto or manual) can really help optimize bottlenecks like evaluating simple comparisons once you're out of easy algorithmic wins. It takes so much attention it's only worth doing in the most critical cases IMO.
Auto-vectorization can be fragile.
Manual vectorization is a ton of work and difficult to maintain.
Thank you for sharing the post! I’m vlad, cto of questdb - we went all the way trying to optimize CSV ingestion with io_uring, and ended up benchmarking against some OLAP peers in the process, Any feedback is greatly appreciated !
Comparison with old version is actually in the article for the patient reader. It could go to the top but I don't think it will make a difference. At the end of the day it is the article at the official QuestDB website which gives the reader a spoiler about the bias.
I am intrigued what Timescale is going to publish next.
Agree. And for a blog post it can even have a story like: "We compared with ClickHouse and we were 10x slower, than we looked at this case and made it 100x faster. Thank you, benchmark and ClickHouse developers that showed us use case where we could do better."
For me benchmarking is usual - "Why this query takes so long? We need to improve it. Sometimes 1000x times."
Right? How do the folks at QuestDB know that their new JIT engine is actually responsible for those performance improvements? My understanding is that, index or not, data is still sorted by time in questdb, which is exactly what the ClickHouse engineers are replicating in the new schema.
The query Clickhouse picked on does not actually leverage time order. Perhaps clickhouse vendors on this thread can comment on relevance of the date partitioning for this query. My best guess is that it might help the execution logic to create data chunks for parallel scan.
QuestDB does also use partitions for this purpose but we also calculate chunks dynamically based on available CPU to distribute load across cores more evenly
That's fair enough and I get the broader point about ClickHouse being rather inflexible wrt query performance. It still seems like the initial sorting key for CH would've been the worst possible one for all benchmark scenarios.
What an extremely unfair comment. Having read QuestDBs blog, it’s quite clear they’ve taken great pains to point out that a single specific benchmark isn’t the be all and end all of DB analysis.
They quite clearly start out by saying they’re only looking to demonstrate the impact of a specific new DB feature they’ve created, and are using benchmarks that illustrate the difference. They make zero claims that QuestDB is faster than Clickhouse overall, and quite carefully point out that prospective users need to run their own benchmarks on their own data to figure out what DB will work for them.
I’m commenting specifically on the blog post provided by GP, which the parent comment made some pretty derogatory comments about.
I’ve made no attempt to deeply research what QuestDB have said else where, because I don’t care. I don’t use, or have a need, for any of the products mentioned in any of the linked articles. I’m only interested in the narrow discussion of the original blog post provided by GP, to which OP post is replying to.
I am in fact very proud of my team, who worked very hard on both implementation and the article. It is disappointing to read unfounded insults where we made every effort to be fair.
I appreciate your benchmark and was interested to learn about how QuestDB processes TSBS queries efficiently. I work extensively with ClickHouse and it's always enlightening to learn about how other databases achieve high performance. Your descriptions of the internals are clear and easy to follow, especially since you included comparisons with older versions of QuestDB.
That said, I think I can understand how some users might be a little put off by the comparisons. Your article effectively says "ClickHouse is really slow" without giving readers any easy way to judge what was happening under the covers. I was personally a bit frustrated not to have the time to set up TSBS and dig into what was going on. I therefore appreciated Geoff's effort look up the results and show that the default index choices didn't make a lot of sense for this particular case. That does not detract from QuestDB's performance at least from my perspective.
Anyway congratulations on the performance improvement. As a famous character in Star Wars said, "we will watch your career with great interest."
I wonder what "every effort to be fair" means ? The first thing you could have done is reach out to ClickHouse Community to ask for optimization suggestions
"fair" means that we comparing apples to apples. Ad-hoc, unindexed predicate, compiled by QuestDB into AVX2 assembly (using AsmJIT) vs same predicate complied by Clickhouse (I'm assuming by LLVM). One can perhaps view this as comparing SIMD-based scans from both databases. Perhaps we generate better assembly, which incidentally offers better IO.
We all understand that creating very specific index might improve specific query performance. Great, Clickhouse geared the entire table storage model to be ultra specific for latitude search. What if you search by longitude, or other column? Back to the beginning.
JIT-compiled predicates offer arbitrary query optimisation with zero impact on ingestion. This is sometimes useful.
What would you offer assuming that we reached out, other than creating an index?
Clickhouse does better than we do in other areas. It JITs more complicated expressions, such as some date functions. It optimises count() queries specifically. For example we collect "found" rowed_ids in an array. Clickhouse does not specifically for count(). We still have work to do. On other hand we ingested this very dataset about 5x quicker than clickhouse, which we left out because article is not about "QuestDB is faster than Clickhouse"
Doesn't matter, since that clearly wasn't the purpose of the article. After all, they were totally happy to add an index for another competing DB as long as they happened to win that comparison. Then they crow about how they beat having an index.
So, maybe do not create specific scenarios for corner cases and then generalize outcome? And write articles about common scenarios that is important for people who will use technology on daily basis.
Full disclosure: I am CTO of QuestDB and I took part in JIT implementation. The quote above is not mine, it was written by Clickhouse staff. "utilizes its full indexing strategy" statement is false and is news to me.
Aggregation is also optimised quite a bit via SIMD and map-reduce. They are as fast as the “where” predicates. Multiple field keyed aggregation is not as optimal yet. I would also suggest our demo site (free and fully open) to see how queries that you use work.
Cleanup is semi manual for now. Time partitions can be removed or detached via SQL. We’re working on automating that.
> Time partitions can be removed or detached via SQL. We’re working on automating that.
Cool! Will that be continuous queries that can be used for downsampling?
I'm working on load testing and monitoring tools. Since either can produce enough metrics to overflow available storage, the downsampling story ends up as important as write speed for me. I imagine that's true of a lot of metric database scenarios--what happens if they go on...forever?