I've always wanted to implement a FFT from scratch and play with it to separate audio waves but then a full time job came along. I guess once you separate vocals from everything else you can just feed it to a speech to text?
To be completely honest, as a human that does not speak English natively, i find some lyrics hard to understand. I've seen native English speakers also having this problem. I think it's only neutral for a NN to do the same mistakes.
Source separation is commonly done by applying masks to the spectrogram. Deep learning is used to train the mask masks for different instruments' parameters. As you mentioned, this is the approach we will follow in the subsequent steps.
To be completely honest, as a human that does not speak English natively, i find some lyrics hard to understand. I've seen native English speakers also having this problem. I think it's only neutral for a NN to do the same mistakes.