Yup. Anymore, “we’ll handle that in code” reads to me as “I don’t understand my tools, and don’t want to learn.” Or hubris. The sheer number of times I’ve seen data integrity errors because someone didn’t think they needed a database-level constraint is too damn high.
The other one (also related) is normalization. They’ll have hundreds of millions of rows of duplicated, low-cardinality strings, because “joins are expensive,” while somehow missing the fact that the increased I/O from reduced page-packing also has a performance cost.
Yeah but the increased I/O is cheaper. It's easier to add another webserver as opposed to upgrading your db server.
And I don't think it's as simple as you make it. So where I work we use amongst other things Rails. There are places in our codebase where using joins just isn't feasable cause the database would use too much memory and let's just say N is large.
But since we use Rails we can have it query the tables apart and join the tables through the defined model. We literally save like 20 seconds in some cases because 1 HUGE query becomes 8 straightforward ones with maximum index usage.
And because we have this capability in Rails we would never use something like this, cause that would neccesate us holding two mental models and have a clear "what do we run where" directive which honestly is a PTA.
That sounds like you’re either joining on an unindexed column, or have outdated statistics. 8 queries implies 8 tables, which is well under the limit for both Postgres and MySQL where the planner may give up and choose a suboptimal plan. You may have something like “SELECT * FROM foo JOIN bar ON foo.id = bar.foo_id WHERE bar.baz = 'qux'”, and if there’s no composite index on (bar.foo_id, bar.baz), the planner will choose whichever column it thinks is more selective; it then has to go get the value for the other one, and that can be quite expensive at scale. Even if you have a separate index on each of those, there’s no guarantee the planner will decide to merge them.
> Yeah but the increased I/O is cheaper. It's easier to add another webserver as opposed to upgrading your db server.
I’m referring to I/O on the DB. Rows are stored in pages that are generally 8 KiB (Postgres and MS SQL Server default) or 16 KiB (InnoDB default). If you can fit 200 rows per page, a given query will probably have to fetch fewer pages than if you can only fit 100 rows per page.
> 8 queries implies 8 tables, which is well under the limit for both Postgres and MySQL where the planner may give up and choose a suboptimal plan.
I'm not sure thats true for Postgres. optimal join ordering is np-hard, and finding an optimal join requires exhaustive search through n! combinations (n=number of joins) - thats why postgres generally uses heuristics to figure out join order. 8 is also the default value of "join_collapse_limit" setting in postgres, so it can't ever reliably optimize over 8 joins at a time. Additionally, postgres starts using "genetic algorithms" aka testing random combinations of joins with 12 joins by default (geqo_threshold setting).
I generally agree its better to use database to its fullest, but I would say 8 joins is probably the "limit". Internally at work I've advised teams to try to avoid anything more than 6 joins for "hot-path" queries.
> literally save like 20 seconds in some cases because 1 HUGE query becomes 8 straightforward ones with maximum index usage.
I don’t understand how splitting a query up would have any relationship to index utilization; the planner should trivially pick up on it?
Also are you sure you’re not solving a different problem[0]? Doing joins manually being faster doesn’t smell right, except in the case of data duplication increasing total resultset size substantially
Like the cost of increased network load from not filtering through the join should outweigh anything else in the equation
The other one (also related) is normalization. They’ll have hundreds of millions of rows of duplicated, low-cardinality strings, because “joins are expensive,” while somehow missing the fact that the increased I/O from reduced page-packing also has a performance cost.