Your original comment as written claims that numpy cannot scale using threads be...

Your original comment as written claims that numpy cannot scale using threads because of the GIL. You admit that is wrong, but somehow can't read your comment back and understand that it says that. What you really meant was that combinations of pure Python and numpy don't scale trivially using threads, which is true but not what you wrote. You were actually just thinking of your PyTorch specific use case, which you evidently haven't figured out how to scale properly, and oversimplified a complaint about it.

> I don't think you understood what you did before and you already wasted a lot of CPUs.

No CPUs were wasted lol. You are clearly confused about how threads and processes in Python work. You also don't seem to understand hierarchical parallelization, which is simply a pattern that works well in cases where you can better maximize parallelism using combination of processes and threads.

There are probably better ways to address your preprocessing problem, but I get the impression you're one of those people only incidentally using Python out of necessity to run PyTorch jobs and frustrated or haven't yet come to the realization that you need to learn how to optimize your Python compute workload because PyTorch doesn't do everything for you automatically.