I'm in about the same situation as OP. We have a small cluster of Power9 and it's been unmaintained and unused for a while so I will set it up from scratch. Been looking into solutions that would be a good fit, for the moment we are just a few students/postdoc, so manual scheduling is feasible, but eventually we would like to make it available to other students at the institution.
My candidates are also
- slurm + ray/lightning/etc.
- determined.ai (maybe together with slurm)
Some advertise a kubernetes setup with kubeflow but I would imagine that is a bit too complex for a small cluster.
Anyone else with experience in this? Any other suggestions?
To make the environments as reproducible as possible it would be great to also have a setup based on docker containers and maybe nix, but not sure if it is feasible on ppc64. Guix and Spack have also come up in my searches.
My candidates are also - slurm + ray/lightning/etc. - determined.ai (maybe together with slurm)
Some advertise a kubernetes setup with kubeflow but I would imagine that is a bit too complex for a small cluster.
Anyone else with experience in this? Any other suggestions?
To make the environments as reproducible as possible it would be great to also have a setup based on docker containers and maybe nix, but not sure if it is feasible on ppc64. Guix and Spack have also come up in my searches.
edit: typo