You need to study state-of-art a bit more. BERT_large can't reproduce results its authors achieved with TPUs on a Titan V/Tesla P100 as for getting there you need to use substantially larger batch sizes that won't fit into 12/16GB. If you get a V100/Titan RTX, it would fit, but you'd wait ~1 year for a single training session (40 epochs) to finish.
MS already published another model based on BERT that is even better. It's unlikely memory x #GPUs would go down in foreseeable future; it's more like that everybody will start as large models as their infrastructure allows if they find something that improves target metrics.
MS already published another model based on BERT that is even better. It's unlikely memory x #GPUs would go down in foreseeable future; it's more like that everybody will start as large models as their infrastructure allows if they find something that improves target metrics.