You need to study state-of-art a bit more. BERT_large can't reproduce results it...

You need to study state-of-art a bit more. BERT_large can't reproduce results its authors achieved with TPUs on a Titan V/Tesla P100 as for getting there you need to use substantially larger batch sizes that won't fit into 12/16GB. If you get a V100/Titan RTX, it would fit, but you'd wait ~1 year for a single training session (40 epochs) to finish.

MS already published another model based on BERT that is even better. It's unlikely memory x #GPUs would go down in foreseeable future; it's more like that everybody will start as large models as their infrastructure allows if they find something that improves target metrics.