Why not SQL for pure declarative queries? Here's llm-hallucinated sql query of the polars example:
SELECT country, SUM(amount - discount) AS total
FROM purchases
WHERE amount <= (
SELECT MEDIAN(amount) * 10
FROM purchases
WHERE country = purchases.country
)
GROUP BY country;
It might be just an issue of familiarity but sql seems the most straightforward and easy to understand for me.
e.g., we can use regexes to query text. Python is a general-purpose language, you can query text without using regexes but it would be insanity to ignore regexes completely (I don't know how easy is to invoke regexes from R). Another example, bash pipeline can be embedded in Python ("generate --flag | filter arg | sink") without reimplementing it in pure Python (you can do it but it would be ugly). No idea how easy it is to invoke shell commands from R. SQL is just another DSL in this case -- use it in Python when it makes the solution more readable.
It looks like llm hallucinated the query that doesn't group by country to get the median. Here's version generated after asking to fix it:
SELECT p.country, SUM(p.amount - p.discount) AS total
FROM purchases p
JOIN (
SELECT country, MEDIAN(amount) * 10 AS median_amount
FROM purchases
GROUP BY country
) m ON p.country = m.country
WHERE p.amount <= m.median_amount
GROUP BY p.country;
I just don't get these to be honest -- besides the fact that author missed simple things like `df.groupby('var',as_index=False)`, isn't this obviously arbitrary "this is easier my way" complaints? (I did R before all the chaining stuff was popular, and I wouldn't stuff everything into a single command like that even now. It isn't like you get lazy evaluation or any special data processing magic.)
So I get people love chaining and tidyverse, good for you, I don't. But at least I can acknowledge that my way (or this way) people have different preferences and one is not intrinsically easier.
Norm Matloff has a blog where he essentially just argues the opposite of all the tidyverse stuff, https://github.com/matloff/TidyverseSkeptic, but it is the same idea in reverse to me (one is not obviously easier to learn than the other IMO).
This is the crux of it in my opinion. 90% of my time using pandas seems to go on data formatting and cleaning to get info into the perfect form for pandas to do a relatively simple set of operations on. It gives we computational speed in exchange for a ton of fiddly tweaking.
The biggest thing for me is that data frames are collections of series, and not collections of rows. That just seems wrong on such a fundamental level to me.
I commented a bit late on the earlier one so I’ll repeat myself here.
These comparisons feel like they are geared to analysts, not programming oriented data scientists. Which is fine, I suppose.
——- prior comment ——-
This is fine and all, (although I’m not impressed by the quality of the python code), but the examples don’t show how to do meta programming.
I often dont want to manually write out column names, but programmatically specify them, and similar for a lot of other of these examples. I don’t want to manually configure them.
I haven’t seen examples of that higher level programming in these various R python comparisons. It’s always manual examples.
The examples usually feel like manual analyst query type tasks. The tone in this one strongly reinforces that with text like “oh and Maria asked me to xyz”
> The examples usually feel like manual analyst query type tasks. The tone in this one strongly reinforces that with text like “oh and Maria asked me to xyz”
Can you give an example of what you mean by “meta programming” and “manual” in this context?
I use pandas regularly, but I’m not sure exactly what you mean here. I’m wondering if you’re referring to some techniques I’m not aware of that could be useful.
Well does the R df$colname behave like panda df.colname or is it like df[“colname”] which opens up a whole new layer where it can be df[var]?
And there’s a ton of helpful things like dict unpacking of named aggs for groups bys. The dictionary can be the result of code or it can be spelled out.
Also, a side note, both pandas and R make it easy to chain or pipe a long series of ops. I try to keep the number down though. In other languages this can be frowned on as a hard to debug train wreck pattern, esp if it grows to long. Usually I define filters and functions outside the train and pass them in rather inline the filters or lambdas as they’re so much easier to debug that way.
You can also debug by putting each method on a separate line, which allows for debugging by commenting out individual methods one at a time. It enhances readability too.
Example:
(df.method1()
While it’s not perfect and it’s not ggplot2, Seaborn is definitely a big improvement over bare matplotlib. You can still use matplotlib to modify the plots it spits out if you want to but the defaults are pretty good most of the time.
I don't understand why base R silently converting integers into characters is seen as a positive? Especially coupled with the comment "In R there's rarely any uncertainty when loading genomic data", surely this just increases uncertainty?
In Python, I have been finding Polars nicer to use:
Not as compact as the R example but gets a bit closer compared to the pandas approach.- https://pypi.org/project/polars/
- https://github.com/pola-rs/polars/