Ask HN: What do you know about bitemporal databases?

jonahbenton · on Dec 22, 2019

My personal opinion is this is difficult to do correctly and ergonomically in SQL. One has to be clear on segregating the metamodel (the machinery that supports as-of timekeeping) from the model (the actual data). What one winds up wanting is a different sublanguage, an additional as-of syntax on a per table basis, which kind of suggests that is not really the right solution domain.

They are not SQL but Datomic[1] and Crux[2] are first class bitemporal databases, look there for design guidelines and commentary.

1. https://datomic.com/

2. https://opencrux.com/

j-pb · on Dec 22, 2019

Isn't datomic 'just' a temporal DB? Through their transactor design there's always a total order on events which removes the need for the whole "what did you see when you created this" vs. "what would you have seen if you would have had access to the entire distributed state of the system" shenanigans.

jonahbenton · on Dec 23, 2019

It is both.

The transactor design ensures serialization level isolation for writes- but semantically writes are considered "creating facts" and the time at which they are written is the time at which they are "known" and one can ask questions about the state of the world- the collection of facts- that were known as of a particular time. And of course time is also a data type, so one can time based questions on multiple levels.

The model is deeply general- facts are assertions that "this attribute of this entity had this value at this time." One neat capability is that you can "go back in time" and simulate applying transactions that did not occur, supporting counterfactual and hypothetical analytics, and various kinds of testing scenarios.

Attempting to carry through the implications of that model at the application level in SQL is very hard.

j-pb · on Jan 4, 2020

What you describe isn't a bitemporal database though, it's a temporal one.

Bitemporality afaik means that you get two views on your data, one that describes state relative to individual writers (what you saw at write time in a distributed and potentially inconsistent system) and what is actually true considering global state (what if you had had absolute knowledge of the distributed state at write time).

nobodywillobsrv · on Jan 4, 2020

Has anyone managed to wrap CRUX from python? I want to keep all logic in python ideally. Only thing keeping me away.

There is some other project crux that has a lot of python and this pollutes the Google searches.

sharms · on Dec 22, 2019

I have been experimenting with Crux (https://opencrux.com) and it feels very easy to work with. In past systems once they grew, I always needed to implement audit logs / tables / fields, and this solves that and other problems with respect to having a transaction time, having certain data at a certain time, and being able to update history without losing the history of what was updated.

I think many projects try to re-implement in a ad-hoc way, maybe without knowing it, so if you can start with a bitemporal database it gives a big head start.

mnm1 · on Dec 22, 2019

You may want to look into datomic or crux.

https://github.com/juxt/crux https://www.datomic.com/

ripley12 · on Dec 22, 2019

This is one of my favourite topics. Most large DBs I've seen in the wild use some form of temporal tables, but it's often a hacky hand-rolled solution. I'd really like to see more support for bitemporal tables from vendors.

As noted above, the SQL:2011 standard supports bitemporality; here's a good summary from SIGMOD Record: https://cs.ulb.ac.be/public/_media/teaching/infoh415/tempfea...

This is a good summary of vendor support for temporal tables: https://www2.cs.arizona.edu/~rts/sql3.html

Oracle and DB2 both support bitemporal tables, SQL Server only supports system-versioned tables. There is an extension for Postgres that supports system-versioned tables, but it hasn't been updated in 2 years: https://github.com/arkhipov/temporal_tables

In the past, I've manually implemented system-versioning for audit purposes in SQL Server using triggers, T4 templates and the CONTEXT_INFO variable to store user IDs. It worked OK, but schema changes became much more difficult.

filmfact · on Dec 22, 2019

Best to use a database with native bi-temporal support, a query planner can get very confused about tables with bi-temporal columns and few constraints, to the extent that it always fails to push the temporal conditions up through the joins resulting in massive unnecessary materialization.

Also - start thinking about archiving now - n lg n is fast at the start - two years later, not so much.

Good luck

dragonwriter · on Dec 22, 2019

> Best to use a database with native bi-temporal support, a query planner can get very confused about tables with bi-temporal columns and few constraints

Why would they have few constraints, unless the DB lacks adequate support for the kind of constraints that logically apply with temporal data (now, admittedly, many databases do lack that support, but Postgres via exclusion constraints on ranges doesn't, and the functionality, while supporting bi-(or tri- or quad-, or N-, which once you start analyzing the temporal relations that logically apply to data are much more common than one might think, there's no magic universal model)temporal data, isn't limited to just that.

filmfact · on Dec 22, 2019

Thanks - I hadn’t come across the exclusion ranges in postgresql before - I’d need to take a look at how it works in practice.

sixdimensional · on Dec 22, 2019

I agree with filmfact's comments.

I've built a lot of temporal tables in SQL, by hand. Time can be a tricky concept, for both users and developers too. Time travelling is even more confusing for people.

Temporal tables were added to the SQL standard in SQL:2011 [1], so a number of SQL databases support it in the DBMS itself. I have not personally used any of these implementations in my own projects yet, and had been using my own implementations for this previously. But, I have been looking into these features over the last few years and considering switching to using something provided by the DBMS itself.

I have seen a number of different implementations/patterns of temporal support:

1) Record effectivity to/from dates, "current flag" to make it easier for people to find the current row. This can be used in normal 3NF relational databases, but a good example of where this is often applied (I see someone else mentioned it) is in denormalized dimensional models - slowly changing dimension (SCD) type 2. That is more for data warehousing, but the point of the temporal tracking columns is the same.

2) On the note above about dimensional models, a most common pattern is OLTP and OLAP, where transactional/OLTP data is append only with no deletes (or only soft deletes) and the historical/temporal tracking is done in a separate purpose built data warehouse/system (OLAP).

3) "Current" table vs. "history" table, so only current records are in one table and history are in another. This pattern can be helpful, if you have strict constraints on the "current" table, but don't want to enforce them on the "history". For example, imagine a column that needed to be unique. You want to ensure that this constraint is enforced by the database, but you can't enforce it in history necessarily where there would be multiple instances of a row. There are other ways around this problem, but something to think about. Filtered indexes (indexes that only apply to certain rows) can help with this problem.

4) "Audit logs" are not temporal solutions. I don't know how many times vendors have said "oh we track history" when in reality, they only track changes to records, and that is not a temporal database.

5) Data "snapshots" are another solution to being able to go back in time. I haven't used such a solution myself to solve a bi-temporal problem, as I don't think it's really intended for that, but there are solutions that exist that do this. Usually this is implemented by the DBMS or storage system, and it involves storing a snapshot of data on disk and then storing deltas/changes - but usually not for extremely long periods of time.

A few other thoughts:

1) Temporal databases are often append-only, this can actually be an advantage - because you never delete, it opens up certain database technologies as a possibility that only allow append (for example, many serialized data storage formats in HDFS are append only). HOWEVER, keep in mind privacy concerns. If you never delete, but have to comply with privacy laws (such as GDPR, right to be forgotten, etc.) you will have to think of how to address that.

2) Storage naturally grows faster with temporal systems since you can maintain many copies. Physical table/collection/dataset partitioning often comes in handy when you need to scale this pattern. You have to pay special attention to partitioning keys. For example, if you partition by a date tied to how fast your data grows, historical data might end up unevenly distributed throughout partitions. This applies if you are talking about a distributed system (e.g. cluster of nodes) or even a single system with a partitioned dataset across multiple disks, arrays, etc.

EDIT: 3) One simple thing people overlook - what to use as a date/time stamp! Temporal databases that natively support this often have sequence number management internally. If you are doing this yourself, make sure to use either GMT dates, or date/time with timezone offset. It will make your life easier in the long run if you have to deal with global time differences, and since such a system usually lasts a long time, it makes sense to use a standard. I would avoid using Unix timestamps or anything that is too implementation dependent (internal sequence numbers, etc.) if you are going to build it yourself.

[1] https://en.wikipedia.org/wiki/SQL:2011

ripley12 · on Dec 22, 2019

> I would avoid using Unix timestamps or anything that is too implementation dependent (internal sequence numbers, etc.) if you are going to build it yourself.

I generally agree, but this might depend on the use case. Timestamps (as they are provided by the DB or OS) may not be granular enough if you need to record the exact order of changes in a system with a lot of near-simultaneous writes.

I ended up storing both sequence numbers and timestamps in one SQL Server implementation for this reason.

sixdimensional · on Dec 23, 2019

Interestingly, sequence numbers are good for the use case of immutable logs and change data capture. But that is a different use case than temporal tracking. In SQL Server, CDC uses log sequence numbers (LSN) whereas temporal features use datetime2.

I guess I didn't mean to say that sequence numbers shouldn't be used at all, tables that are bitemporal still have keys and identity values, just to use real dates and preferably GMT for the from/to dates for consistency.

Sequence numbers are especially good for audit log forensics so long as the logs are immutable or protected. Gaps, etc. help with detection of deletions, tampering, etc.

I agree with your comment about use case. Good point!

pablobaz · on Dec 22, 2019

For versioning numeric data complications include:

- changing dimensions - handling changing types e.g. A column of strings becoming a compound of floats - efficiently handling appends - efficiently handling corrections of previous data - how to cleanup, remove old versions

The version store storage engine in Arctic provides fast and efficient versioning for numeric data. The code is quite readable so it might be worth having a look at its implementation too.

https://github.com/man-group/arctic/blob/master/README.md

NovemberWhiskey · on Dec 22, 2019

Not sure what language you're planning to use, but if you end up with Java and want an ORM to layer on top of an enterprise-grade database, you could do a lot worse than Reladomo: https://github.com/goldmansachs/reladomo

The ORM has built-in support for various time-series data concepts including bitemporality.

ripley12 · on Dec 22, 2019

That's quite cool, thanks. Documentation for the temporal chaining features in case anyone else is curious: https://goldmansachs.github.io/reladomo-kata/reladomo-tour-d...

cdcarter · on Dec 22, 2019

It sounds like you might only need temporal tables and not the whole complication of _bi_temporal tables.

If you just need to get data at a point in time, you only need to store (so called) application-time. If you need to handle both a “created date” and “effective date” type scenario, then you’ll need bitemporality.

mrfox321 · on Dec 22, 2019

Are you trying to build a backtesting framework?

I've come up with a solution specifically for pandas dataframes. Not sure how similar a solution would be for your use case.

touisteur · on Dec 22, 2019

Hi, do you have a write-up somewhere or a github for this? Very interested.

adamnemecek · on Dec 22, 2019

It's funny, I've never heard that term but looking it up, I find some connections to some things I'm working on (that are not bitemporal databases).

What are some good resources?

A spreadsheet is a bitemporal database of sorts right?

rejschaap · on Dec 22, 2019

I don't see how a spreadsheet is either bitemporal or a database. But you could do some simple bitemporal modeling in a spreadsheet. I would imagine querying to be extremely limited and cumbersome though.

nobodywillobsrv · on Jan 4, 2020

A spreadsheet (with undo) is pretty much exactly a bitemporal database but it only shows you the "latest" version. You can undo to go back in time but it is not very convenient. THere are also DAGs and the code is data. Building this at scale is what a lot of large orgs end up trying to do. Tricky bit is probably the trade of between lazy eval (pull?) and streaming push through the calcs. I think push is needed for low latency but am not 100%.

derision · on Dec 22, 2019

Isn't this just temporal tables in SQL Server?

ripley12 · on Dec 22, 2019

SQL Server supports only half of the SQL:2011 temporal features – it does system-versioned tables but not application-time tables. When I asked MS engineers about it at Build 2018, there were no concrete plans to implement application-time.

a_c · on Dec 22, 2019

I use django for my day to day. I understand its "migration" https://docs.djangoproject.com/en/dev/topics/migrations as a kind of bitemporal database. Would you mind providing more context about your use case so maybe more concrete consideration can be discussed?

sixdimensional · on Dec 22, 2019

Migration is about forward/reverse engineering of the data structure/content. It isn't really what's meant by bitemporal databases, although I see why you might think that. Temporal databases [1] are a specific area of study and pattern.

[1] https://en.wikipedia.org/wiki/Temporal_database

sansnomme · on Dec 22, 2019

https://opencrux.com/

babygoat · on Dec 22, 2019

Is this different than a relational database with slowly changing dimensions?

deterministic · on Dec 23, 2019

Check out "datomic".

j88439h84 · on Dec 22, 2019

There was mentat. Almost.