Why Pandas feels clunky when coming from R

cmdlineluser · on Feb 22, 2024

Nice article.

In Python, I have been finding Polars nicer to use:

  (purchases
     .filter(pl.col("amount") <= pl.col("amount").median().over("country") * 10)
     .group_by("country")
     .agg(total = (pl.col("amount") - pl.col("discount")).sum())
  )

Not as compact as the R example but gets a bit closer compared to the pandas approach.

- https://pypi.org/project/polars/

- https://github.com/pola-rs/polars/

d0mine · on Feb 24, 2024

Why not SQL for pure declarative queries? Here's llm-hallucinated sql query of the polars example:

    SELECT country, SUM(amount - discount) AS total
    FROM purchases
    WHERE amount <= (
        SELECT MEDIAN(amount) * 10
        FROM purchases
        WHERE country = purchases.country
    )
    GROUP BY country;

It might be just an issue of familiarity but sql seems the most straightforward and easy to understand for me.

anakaine · on Feb 24, 2024

Probably because the article wasn't about comparing to SQL, or any other database, but rather looked at the R vs Python debate specifically?

d0mine · on Feb 24, 2024

What is wrong suggesting an alternative approach that makes the solution more readable?

Using an appropriate DSL for the problem may be useful. In Python:

    df.to_sql('purchases', db, index=False)
    print(*db.execute(query))
    # -> ('Canada', 270) ('USA', 8455)

e.g., we can use regexes to query text. Python is a general-purpose language, you can query text without using regexes but it would be insanity to ignore regexes completely (I don't know how easy is to invoke regexes from R). Another example, bash pipeline can be embedded in Python ("generate --flag | filter arg | sink") without reimplementing it in pure Python (you can do it but it would be ugly). No idea how easy it is to invoke shell commands from R. SQL is just another DSL in this case -- use it in Python when it makes the solution more readable.

d0mine · on Feb 24, 2024

It looks like llm hallucinated the query that doesn't group by country to get the median. Here's version generated after asking to fix it:

    SELECT p.country, SUM(p.amount - p.discount) AS total
    FROM purchases p
    JOIN (
        SELECT country, MEDIAN(amount) *  10 AS median_amount
        FROM purchases
        GROUP BY country
    ) m ON p.country = m.country
    WHERE p.amount <= m.median_amount
    GROUP BY p.country;

wodenokoto · on Feb 24, 2024

You get into a lot of other problems that are straightforward in pandas/R but very difficult in SQL.

d0mine · on Feb 24, 2024

It is not either or. Use Python where it is strong, and execute SQL queries from Python where appropriate.

Python as a glue language is one of its strong sides.

Ingaz · on Feb 23, 2024

Pandas is a great library but I was not able to make myself love its data query API.

I discovered that duckdb can make queries to pandas dataframes in SQL (postgresql parser) and do it faster than pandas itself.

Now I have zero motivation pandas queries.

Maybe for coloring or smth like that

apwheele · on Feb 23, 2024

I just don't get these to be honest -- besides the fact that author missed simple things like `df.groupby('var',as_index=False)`, isn't this obviously arbitrary "this is easier my way" complaints? (I did R before all the chaining stuff was popular, and I wouldn't stuff everything into a single command like that even now. It isn't like you get lazy evaluation or any special data processing magic.)

So I get people love chaining and tidyverse, good for you, I don't. But at least I can acknowledge that my way (or this way) people have different preferences and one is not intrinsically easier.

Norm Matloff has a blog where he essentially just argues the opposite of all the tidyverse stuff, https://github.com/matloff/TidyverseSkeptic, but it is the same idea in reverse to me (one is not obviously easier to learn than the other IMO).

jessekv · on Feb 23, 2024

pandas feels clunky when coming from python too

weebull · on Feb 24, 2024

This is the crux of it in my opinion. 90% of my time using pandas seems to go on data formatting and cleaning to get info into the perfect form for pandas to do a relatively simple set of operations on. It gives we computational speed in exchange for a ton of fiddly tweaking.

The biggest thing for me is that data frames are collections of series, and not collections of rows. That just seems wrong on such a fundamental level to me.

mint2 · on Feb 23, 2024

This is a repost from two days ago. why?

I commented a bit late on the earlier one so I’ll repeat myself here.

These comparisons feel like they are geared to analysts, not programming oriented data scientists. Which is fine, I suppose.

——- prior comment ——- This is fine and all, (although I’m not impressed by the quality of the python code), but the examples don’t show how to do meta programming.

I often dont want to manually write out column names, but programmatically specify them, and similar for a lot of other of these examples. I don’t want to manually configure them.

I haven’t seen examples of that higher level programming in these various R python comparisons. It’s always manual examples.

The examples usually feel like manual analyst query type tasks. The tone in this one strongly reinforces that with text like “oh and Maria asked me to xyz”

rrr_oh_man · on Feb 23, 2024

> The examples usually feel like manual analyst query type tasks. The tone in this one strongly reinforces that with text like “oh and Maria asked me to xyz”

That is 80% of lowly data analyst work in a corp.

proamdev123 · on Feb 23, 2024

Can you give an example of what you mean by “meta programming” and “manual” in this context?

I use pandas regularly, but I’m not sure exactly what you mean here. I’m wondering if you’re referring to some techniques I’m not aware of that could be useful.

mint2 · on Feb 23, 2024

Well does the R df$colname behave like panda df.colname or is it like df[“colname”] which opens up a whole new layer where it can be df[var]?

And there’s a ton of helpful things like dict unpacking of named aggs for groups bys. The dictionary can be the result of code or it can be spelled out.

Also, a side note, both pandas and R make it easy to chain or pipe a long series of ops. I try to keep the number down though. In other languages this can be frowned on as a hard to debug train wreck pattern, esp if it grows to long. Usually I define filters and functions outside the train and pass them in rather inline the filters or lambdas as they’re so much easier to debug that way.

proamdev123 · on Feb 24, 2024

You can also debug by putting each method on a separate line, which allows for debugging by commenting out individual methods one at a time. It enhances readability too. Example: (df.method1()

      .method2()

      .method3()
)

proamdev123 · on Feb 24, 2024

The code didn’t render properly, and it’s not letting me edit the comment. Here’s my example: (df.method1()

     .method2()

     .method3()

)

PartiallyTyped · on Feb 23, 2024

Perhaps mods decided to bump this again. It happens.

karencarits · on Feb 23, 2024

Previous discussion from a couple a days ago: https://news.ycombinator.com/item?id=39438491

elliotwagner · on Feb 25, 2024

Maybe i’m just used to it but I don’t mind using Pandas, in jupyter notebook feels quite intuitive this feedback

flownoon2 · on Feb 22, 2024

Agreed. Doubly so for ggplot2 over matplotlib.

tonyarkles · on Feb 23, 2024

While it’s not perfect and it’s not ggplot2, Seaborn is definitely a big improvement over bare matplotlib. You can still use matplotlib to modify the plots it spits out if you want to but the defaults are pretty good most of the time.

https://seaborn.pydata.org/

mtekman · on Feb 23, 2024

There's also this new Gem:

https://mstdn.science/@mtekman/111964030735312859

_Wintermute · on Feb 23, 2024

I don't understand why base R silently converting integers into characters is seen as a positive? Especially coupled with the comment "In R there's rarely any uncertainty when loading genomic data", surely this just increases uncertainty?

mtekman · on Feb 23, 2024

because the vector contains characters, so clearly it encodes categorical data and not numeric