TIL about Verse looks cool I'll have to check it out.
> SQL is not a pipeline, it is a graph.
Maybe it's both? and maybe there will always be hard-to-express queries in SQL, and that's ok?
the RDBMS's relational model is certainly a graph and joins accordingly introduce complexity.
For me, just as creators of the internet regret that subdomains come before domains, I really we could go back in time and have `FROM` be the first predicate and not `SELECT`. This is much more intuitive and lends itself to the idea of a pipeline: a table scan (FROM) that is piped to a projection (SELECT).
I'm as big a SQL stan as the next person and I'm also very skeptical anytime anyone says that SQL needs to be replaced.
At the same time, it's challenging that SQL cannot be iteratively improved and experimented upon.
IMHO, PRQL is a reasonable approach to extending SQL without replacing SQL.
But what I'd love to see is projects like Google's zeta-sql [1] and Substrait [2] get more traction. It would provide a more stable, standardized foundation upon which SQL could be improved, which would make the case for "SQL forever" even more strong.
I agree that CTEs help solve the problem of being able to read a SQL query from top to bottom, but I wouldn't say they're a panacea!
Personally, it's weird to me that `FROM` (scan) comes after `SELECT` (projection). IMHO the datasource should come first!
CTEs don't solve this problem they just let you chain multiple SELECTs together.
A real use case is that it would allow intellisense to kick in a lot earlier!
Instead you have to write `SELECT * FROM my_table` and only after can you edit the `*` and get auto-complete suggestions of the columns from `my_table`
if I could tell myself in 2015 who had just found the feather library and was using it to power my unhinged topic modeling for power point slides work, and explained what feather would become (arrow) and the impact it would have on the date ecosystem. I would have looked at 2026 me like he was a crazy person.
Yet today I feel it was 2016 dataders who is the crazy one lol
Indeed. feather was a library to exchange data between R and pandas dataframes. People tend to bash pandas but its creator (Wes McKinney) has changed the data ecosystem for the better with the learnings coming from pandas.
I know pandas has a lot of technical warts and shortcomings, but I'm grateful for how much it empowered me early in my data/software career, and the API still feels more ergonomic to me due to the years of usage - plus GeoPandas layering on top of it.
Really, prefer DuckDB SQL these days for anything that needs to perform well, and feel like SQL is easier to grok than python code most of the time.
> Really, prefer DuckDB SQL these days for anything that needs to perform well, and feel like SQL is easier to grok than python code most of the time.
I switched to this as well and its mainly because explorations would need to be translated to SQL for production anyways. If I start with pandas I just need to do all the work twice.
chdb's new DataStore API looks really neat (drop in pandas replacement) and exactly how I envisioned a faster pandas could be without sacrificing its ergonomics
Do people bash pandas? If so, it reminds me of Bjarne's quip that the two types of programming languages are the ones people complain about and the ones nobody uses.
He missed talking about the poor extensibility of pandas. It's missing some pretty obvious primitives to implement your own operators without whipping out slow for loops and appending to lists manually.
Yes (mostly) is the answer. You can use arrow as a backend, and I think with v3 (recently released) it's the default.
The harder thing to overcome is that pandas has historically had a pretty "say yes to things" culture. That's probably a huge part of its success, but it means there are now about 5 ways to add a column to a dataframe.
Adding support for arrow is a really big achievement, but shrinking an oversized api is even more ambitious.
I also use polars in new projects. I think Wes McKinney also uses it. If I remember correctly I saw him commenting on some polars memory related issues on GitHub. But a good chunk of polars' success can be attributed to Arrow which McKinney co-created. All the gripes people have with pandas, he had them too and built something powerful to overcome those.
I saw Wes speak in the early days of Pandas, in Berkeley. He solved problems that others just worked around for decades. His solutions are quirky but the work was very solid. His career advanced a lot IMHO for substantial reasons.. Wes personally marched through swamps and reached the other side.. others complain and do what they always have done.. I personally agree with the criticisms of the syntax, but Pandas is real and it was not easy to build it.
love the animations! I’ve been dreaming of doing the same to get people from csvs to something like Lance but with stops at page files, parquet, and Iceberg along the way
same! but if you get it inevitably wrong the first time it gives you the pinyin. but i struggled to get it to transcribe the consonants I was making let alone the tones. i'm pretty sure i'm not as bad as that!
I'm a big SQL stan here and I love the concept and if you ever wanna chat about how it might integrate with dbt let me know :)
conceptual questions:
1) why did you pick SQL?
to increase the Total Addressable Userbase with the thinking that a SQL API means more people can use it than those who know Python or Typescript?
2) What isn't or will never be supported by this relational model?
what are the constraints? Clickhouse comes to mind w/ it's intentionally imposed limitations on JOINs
3) databases are historically the stickiest products, but even today SQL dialects are sticky because of how closely tied they are to the query engine. why do you think users will adopt not only a new dialect but a new engine? Especially given that the major DWH vendors have been relentlessly competing to add AI search vector functionality into their products?
4) mindsdb comes to mind as something similar that's been in the market for a while but I don't hear it come up often. what makes you different?
playground feedback:
1) why are there no examples that:
a) use `JOIN` (that `,` is unhinged syntax imho for an implicit join)
b) don't use `*` (it's cool that there's actual numbers!)
2) i kinda get why the search results defaults to a UI, but as a SQL person I first wanted to know what columns exist. I was happy to see "raw table" was available but it took me a while to find it. might be have raw table and UI output visible at the same time with clear instructions on what columns the query requires to populate the UI
Would love to chat about it, and talk about dbt integration. There's a few use cases that have come up where this would be really helpful. I'll PM you.
1) So we do actually have a python and typescript API, it's just the console web experience is SQL only as it feels the best for that kind of experience. The most important thing though is that it's declarative. This helps keep things relatively simple despite all the configuration complexity, and is also the best for LLMs/agents as they can iterate on the syntax without doc context.
2) Yeah exactly, joins is something we can't do at the moment, and i'm not sure the exact solution their honestly. Under the hood most of Shaped's offline data is built around Clickhouse, and we do want to build a more standard SQL interface just so you can do ad-hoc, analytical queries. We're currently trying to work if we should integrate it more directly with ShapedQL or just keep it as a separate interface (e.g. a ShapedQL tab vs a Clickhouse SQL tab).
3) We didn't really want to create a new SQL dialect, or really a new database. The problem is none of the current databases are well suited for search and recommendations, where you need extremely low latency, scalable, fault-tolerance, but also the ability to query based on a user or session context. One of the big things here is that because Shaped stores the user interactions alongside the item catalog, we can encode real-time vectors based on those interactions all in an embedding query service. I don't think that's possible with any other database.
4) I haven't looked into mindsdb too much, but this is a good reminder for me to deep dive into it later today. From taking a quick pass on it, my guess is the biggest difference is that we're built specifically for real-time search, recommendations and RAG, and that means latency, and ability to integrate click-through-rate models and things becomes vital.
Thanks so much for the playground syntax, have some follow up questions but i'm going to pm you if that's okay. Agreed on the being able to see which columns exist.
reply