Making an Endpoint 50x Faster

anton_gogolev · on Dec 7, 2019

The whole "...we don’t want duplicated subscriber lists" is a very arbitrary requirement.

What's more, all this could have been avoided if a proper relational database structure (not even denormalized, GordonS) was there in the first place.

cogman10 · on Dec 7, 2019

Agreed. This is a poor schema made worse.

Querying the json is why this is slow. Pull that garbage out of the json and into a couple of tables and you'd be much faster.

Izkata · on Dec 7, 2019

> Pull that garbage out of the json and into a couple of tables and you'd be much faster.

It looks to me like a user-generated search query, not normalized data. A blob column makes perfect sense if this is the case, and pulling it out into other tables/columns would significantly overcomplicate everything else.

cogman10 · on Dec 7, 2019

More complicated than a custom hashing scheme and required ordering?

I don't buy that. Tables aren't complex.

Izkata · on Dec 8, 2019

Tables aren't complex, but this particular use can get there. Off the top of my head:

1) Complexity inherent in search criteria: Multiple values per field and whether it's AND or OR. (This can go much much further, as ours does, but I'm limiting it here to what's in the blog post)

2) Centralization of criteria: Multiple tables pretty much means a table per field, which are only related by foreign keys back to the central table.

3) Following on from 2, since there's no central location to identify all the relevant tables, it's much easier to miss one in a JOIN.

4) Following on from 1: Serialization/deserialization, for creating new entries and populating the submission form for an "edit" mode. (At minimum you'll have to reconstruct the multiple-values-per-field situation from the tables)

5) Following on from 3 and 4, alternate uses such as adding a search description derived from the criteria.

There's easy ways around some of this, such as in (4) forcing all the app code to go through the same serialization/deserialization (which sidesteps most of the others), but that's no guarantee that there won't be stray code accessing the tables directly for another reason (hitting on (3)), and almost completely erases the benefit of fitting the criteria into a table relation. It's essentially the same as just using a blob field, except a lot more complicated.

Sometimes blob fields really are the right choice, even in a relational database.

sundayedition · on Dec 7, 2019

Agreed, but it's pretty clear this was a legacy app that for reasons unknown, wasn't worth a rewrite.

shrubble · on Dec 7, 2019

Why not use a materialized view?

https://www.postgresql.org/docs/9.3/sql-creatematerializedvi...

mobilemidget · on Dec 7, 2019

"Because order should not matter, both arrays are sorted before they are compared."

I do not understand this line very well. If order does not matter, why sort anything?

ignoramous · on Dec 7, 2019

Consider character sequences: [n, e, t] and [t, e, n]. If you'd want to know if the lists are equivalent (have the same characters, same number of times), then sorting them would make the comparison straight-forward since order doesn't matter.

See: https://leetcode.com/problems/valid-anagram/solution

mobilemidget · on Dec 7, 2019

ah yes but of course, probably did not have enough coffee yet when I read it :) thank you

But then perhaps it is worth storing it sorted ( or as well if original value is important) and making it even faster?

GrayShade · on Dec 7, 2019

They order does not matter and can be arbitrary. But when you compare two entries you need to sort them, because the entries are arrays. It's like the trick of converting sttings to lower-case to get an case-insensitive comparison (which is a bad idea in a non-Latin world, but that's another story).

lowmagnet · on Dec 7, 2019

I applaud the effort put in for performance, but why not just enqueue the change notification and do the sends on the back end? This sort of fiddly process is exactly why deferred action is appropriate.

hawkice · on Dec 7, 2019

You're not wrong, but I think starting with structuring data to be reasonably performant is always the best first step. You can always make it deferred later if you want or need. This way you don't hammer your DB and leave weird structural issues in your model.

GordonS · on Dec 7, 2019

TL;DR: a short article describing a normalised schema, with big performance gains when denormalising data.

Article is very application-specific, but is perhaps a reminder that denormalisation can be useful when working with relational databases.

mhd · on Dec 7, 2019

I wouldn't describe anything with a JSON kitchen sink field as "normalized".

setr · on Dec 7, 2019

If it was normalized, there wouldn't be a problem. You'd just index the subscriber list and postgres would handle the = or in queries properly.

But json to postgres 9.3 is no better than an abitrary string that stores a delimited list.

This is a denormalized workaround for a denormalized schema, with big performance gains.