I'm a PhD student and we currently have a DL server at my lab that I manage. Looking for a way to administer loads and environments to create reproducible models for undergraduate students and other researchers I arrived to determined.ai. It felt interesting to share with the HN crowd.
That’s cool. I was wondering how this compares to ray (which I use with my institutions slurm-based clusters). The scheduler system that determined.ai has seems a lot more granular which suits the workloads you get with a team of people doing a bunch of deep learning model prototyping. Our debug queue has a five minute preempt time which sometimes adds a lot of friction for quick debugging iteration when utilization is maxed out
I'm in about the same situation as OP. We have a small cluster of Power9 and it's been unmaintained and unused for a while so I will set it up from scratch. Been looking into solutions that would be a good fit, for the moment we are just a few students/postdoc, so manual scheduling is feasible, but eventually we would like to make it available to other students at the institution.
My candidates are also
- slurm + ray/lightning/etc.
- determined.ai (maybe together with slurm)
Some advertise a kubernetes setup with kubeflow but I would imagine that is a bit too complex for a small cluster.
Anyone else with experience in this? Any other suggestions?
To make the environments as reproducible as possible it would be great to also have a setup based on docker containers and maybe nix, but not sure if it is feasible on ppc64. Guix and Spack have also come up in my searches.
I work for determined.ai, glad you find it useful! Feel free to reach out if you need any help or questions with Determined. The examples are the best resources IMO to configure your models and data to work in determined.