The hardest part of building any search engine is keeping the index up-to-date w...

canadiantim · on July 11, 2022

Have you ever considered using PGroonga (https://pgroonga.github.io/) to further extend FTS in postgres? They've got a drilldown feature that is basically their implementation of facets. Always wondered what your thoughts on that might be?

I've got a django setup with postgres and facets implemented as you detailed in your blog (thank you for all the amazing info over the years btw!), now I'm looking to extend it with PGroonga and potentially considering Hasura to try and further extend what I can do while still using Postgres as my single source of truth without need for keeping any other service in sync.

simonw · on July 12, 2022

That's really interesting- I hadn't seen that project (or Groonga) before.

Looks like it might be similar to https://zombodb.com/ which I haven't tried yet but which looks like a really interesting solution in this space.

seedless-sensat · on July 15, 2022

Perhaps more similar to zombodb's hinted new pronto plugin: https://twitter.com/zombodb/status/1547389026235187201

canadiantim · on July 12, 2022

Yeah ZomboDB looks interesting, I guess the main difference is Groonga is an embedded database, more like sqlite; whereas Elastic Search you still gotta spin up an instance and be responsible for it.

I guess you're still responsible for the Groonga db on the postgres server too but theoretically it seems more secure than spinning up another service on a different server

brightball · on July 12, 2022

I retired a dedicated Solr instance and replaced it with PG full text search in a couple of days. The sync inconsistency was creating a ton of support inquiries.

Never missed it. It’s the feature that hooked me on PG.

daitangio · on July 12, 2022

By thw way, also SQLite seems to have a pretty good search engine inside itself: https://www.sqlite.org/fts5.html

disqard · on July 12, 2022

I personally use this, and I highly recommend it!

cies · on July 12, 2022

> The hardest part of building any search engine is keeping the index up-to-date with changes made to the underlying data store.

This deserves mention, as it solves that problem: https://github.com/zombodb/zombodb

From the README:

> ZomboDB brings powerful text-search and analytics features to Postgres by using Elasticsearch as an index type. Its comprehensive query language and SQL functions enable new and creative ways to query your relational data.

> From a technical perspective, ZomboDB is a 100% native Postgres extension that implements Postgres' Index Access Method API. As a native Postgres index type, ZomboDB allows you to CREATE INDEX ... USING zombodb on your existing Postgres tables. At that point, ZomboDB takes over and fully manages the remote Elasticsearch index and guarantees transactionally-correct text-search query results.

I find other things also hard in search engines: dealing with the plethora of human languages and all the requirements we may have to processing them. A mature solution like ES therefor is almost a must in the more demanding cases.

almog · on July 11, 2022

> The hardest part of building any search engine is keeping the index up-to-date with changes made to the underlying data store.

That makes sense and I'm curious to know if you're referring to the challenges that arise from attempting to implement a distributed heterogeneous transaction that includes system of record (say RDBMS) and derived data (search engine), or did you think about something that's more specific to search engines?

simonw · on July 11, 2022

Just the general challenge of making sure that updates to your database are reflected in your index promptly.

I gave a talk about this a few years ago: https://speakerdeck.com/simon/the-denormalized-query-engine-... / https://www.youtube.com/watch?v=NzcvewgqYog

matesz · on July 12, 2022

Isn’t it as easy as updating search index every time change is being made to the database? That’s what we did back in 2012, the website had only 2mln uniques per month though.

It is some work and definitely not as simple as having search index embedded in the database, but again almost everybody have something like celery running For async task execution anyway.

ngrilly · on July 12, 2022

The difficulty is to make this reliable by ensuring that the FTS index is not drifting away from its source of truth (a SQL database for example). What if your process crashes after writing to the SQL database but before writing to the FTS engine? You need some kind of Change Data Capture, track it and apply the changes to the FTS index.

paulmd · on July 12, 2022

A project I was working on wanted to do this, the basic idea was use Oracle (not my choice) as a source of truth and for OLTP and immediately blow any changes off to SOLR replicas for fulltext searching (for scalability/etc). Big-picture the idea was (imo, never set out directly) that they wanted clients to optimistically search on (and perhaps work on) the data from SOLR but "they also wanted the DB as an authoritative source".

We had it set up so that SOLR would basically run the queries, and return only the UUIDs of the database objects you wanted to return (plus of course any faceting data/etc). Then we would turn around and load the actual results from the query from the RDBMS - it's the source-of-truth and the data is freshest. Conveniently, the same way you generate the JSON for SOLR also works fantastic for generating the JSON for returning the data - this can be the same function so there will never be a mismatch between what's generated via DB-JSON and what's generated via ORM-JSON. It also means basically the entire backend layer can be pretty much generic - at the most basic level, you are just hammering off a query to SOLR and getting some UUIDs back, and then doing a batch-load from the DB, and getting some (uuid, record_version, json_body) tuples back, there is nothing "object-specific" in the backend object handler per-se, at least in the search/fetch code. Inserts/updates still need to work directly on the relational model, but all your search/retrieval code is completely generic (beyond like, table/view names). Furthermore - there is no "ORM collection spam" as it fires off individual queries to get collections for each row in the batch - since the object is already just a JSON string when it comes back, loading 250 rows will have (theoretically) exactly 250 results, there is no combinatorial blowout from joins, or hammering the DB with 250 additional queries (which perhaps fire off their own queries). One string per object, done.

I had an intelligent batch-loader built out with temporary tables/etc so it wasn't even "load 200 rows one at a time"... you inserted the rows you want to load, inner join your temporary table (which is still completely generic and nothing object-specific) and you get 200 results in a single query, fast as hell, without destroying your query planner with IN(:1, :2...) since every query has identical literal sqltext.

Instead of directly poking specific records into SOLR, the backend microservice would tell SOLR "there are new records to pull" and SOLR would do what they call a "delta-import" and actually go out to the database itself over JDBC and do an incremental import of all rows changed since the update. Since SOLR itself could ensure that only one import was running at the time, it would implicitly "batch" large sets of updates into a single query rather than hammering out a bunch of individual ones.

There's always tradeoffs though - the biggest one that concerned me is that since the DB is what's generating the JSON, you have essentially offloaded the entire string-building onto your DB engine. Yeah C/C++ string-building is a lot faster than Java, but you can't scale-out your database like you can a backend service or SOLR replica. Best case you can use microservices and split your database into a couple pieces, but that has problems too of course. It's possible that maybe Postgres has some read-replica type stuff that might be able to scale that out, but, that wasn't really going to be the issue probably. And if you move the JSON-ifying to an external source, you lose the nice performance benefits in terms of one-row-per-result, now you have the collection/combinatorial problems again.

(I wrote an intelligent object-caching mechanism as well, so it wouldn't rebuild every row every time... it'd only pull the rows that it didn't have in cache and would mix the cache+DB results to build the result list while minimizing DB load. So it's not all as bad as it sounds... ORM blowout overloads the database too and you fix that by caching too.)

One inherent problem to this design is desyncs - this isn't specific to my approach but just a problem of having a two-phase operation with things happening in the meantime. There are no exclusive locks on a SOLR while you're doing an update in the DB. I did add code to watch for this as much as possible and could accept it/log it, retry a couple times (will probably fail again on a hot table), try a full rebuild of the SOLR collection, etc, according to the specific situation. But while you can detect missing rows (in solr but not in DB) and stale rows (DB/solr have different versions) you could not ever detect the last kind... it's in DB but SOLR doesn't know about it. If SOLR can't search it for whatever reason (missing/stale/etc) and you never ask DB for it, then it doesn't exist.

As you note, it also means that instead of building Java models/entities you are now (also) building JSON models inside views, and our DBAs were green and already heavily overloaded, so this was not a good thing in terms of team scaling. It definitely pushed the workload into some weird places in that sense. In the long term what I'd have wanted to do is go further into the "ORM building exercise" here and automatically build matching view schemas from a document. A lot of these really map down to very similar Jackson-object-mapper or "hbm2ddl.auto" style code-generation tasks... complex to write the ORM and tedious to write your bindings but not actually all that difficult. A list<T> at the java level means you need a (whatever) JSON_LIST_AGG(TABLE_T, ...) element with that field name in the view, etc. Collections do suck in RDBMS, JSON lists of uuids or lists of populated (or partially populated) child elements make JDBC behave a lot better.

Another oddity is that now you basically have two (or more) "facades" onto the same database objects... and this messes with "default" caching behavior. After all, writes go into the database on the ORM relational entity, but they come out on the JSON-entity, and you need to tell the cache about that. It's not hard, but you have to write a custom adapter. Also, if you do a batch-insert (we of course used uuids generated in the application layer) or batch-update, you then have to do a batch-select immediately afterwards to retrieve the JSON-ified record, otherwise there's nothing to put into the cache or return to the client, since the application layer itself never has a JSON-serialization mapping (that all lives in the DB).

(In the long term the direction I was trying to point us was that rather than having JSON views just be a straightforward JSONization of a given table/object we could maintain different JSON-builder-views for different SOLR views servicing different types of services if needed... I kinda wonder how far you'll get on a single representation.)

That project was a dumpster fire and after I left I think they've pretty much thrown all this out. Easier to just have everything in DB (not even 21c - no native document support lol) be text and have the application layer do all the JSONifying.

But I'm still proud of turning some pretty nonsensical design requirements into a reasonably cohesive rube-goldberg machine. It's not what I'd have designed, but we had some, uh, overpromising and mismanagement involved, and it was a reasonable attempt at implementing the idea that was sold to the client imo. None of us had any experience in this stack and we were all trying to feel our way through it.

Ironically a bunch of this "full/incremental resync" stuff also ended up being a nice mechanism to bootstrap the SOLR cluster at startup (a full update instead of a delta) and for catching config errors. SOLR and the DB must stay in sync, so, SOLR importer is pointed at the wrong DB (according to the SOLR instance's JDBC URL)? That's a nope, application won't deploy, 503 Fix Your Shit.

(in hindsight I wonder if we'd gone postgres (I did argue for it, but that one was actually a legit understandable business decision...) if we could have done a FDW and gone with a "push" model by having Postgres publish its changes right to SOLR...)

nine_k · on July 12, 2022

That's interesting; I thought that computing the relevance of found texts, using a language and text structure models of sorts, is the hardest part.

simonw · on July 12, 2022

Sure, that's where the hard computer science is - but if you use a tool like Elasticsearch other people have spent decades solving those problems for you already.

soperj · on July 11, 2022

How hard is this on the database?

brightball · on July 12, 2022

It’s just like any other index.