I think it's simplification to compare progress only on LLM level.
We had big progress in AI in last 2 years but have to take into account more than text token generation. We have image generation that is not only super realistic but you just text what you want to modify without learning complicated tools like ComfyUI.
We have text to speech and audio to audio that is not only very realistic and fluent with many languages but also can express emotions in speech.
We have video generation that is really more realistic every month and taking less computation.
There is big progress in 3d models generation. Speech to text is still getting improved and fast enough to run on phones reducing latency. Next frontier is how AI is applied for robotics. No to mention areas not sexy to end users but in application in healthcare.
I mentioned that OP focused only of not much improvements in text token generation (since gpt 4.0) but those models got multimodal and not every AI e.g. generative AI are based on tokens but on diffusers.
I have a similar feeling. While LLMs have given me a new way to do search/questions, it is the byproducts that feel like the actual game changers. For me, it is vision models and pretty impressive STT and TTS. I am blind, so I have my own reasons why Vision and Speech have so many real world applications for me. Sure, LLMs are still the backbone of the applications emerging, but the real progress in terms of use cases is in the fringes.
Heck, I wrote myself my own personal radio moderator in a few hundred lines of shell, later rewritten in Python. As a simple MPD client. Watch out for a queued track which has albumart, and pass the track metadata + picture to the LLM. Send the result through a pretty natural sounding TTS, and queue the resulting sound file before the next track. Suddenly, I had a radio moderator that would narrate album art for me. It gave me a glimpse into a world that wouldn't have been possible before. And while the LLM is basically writing the script, the real magic comes from multimodal and great sounding TTS.
Much potential for really cool looking/sounding PoCs. However, what makes me worry is that there is not much progress on (to me) obvious shortcomings. For instance, OpenAI TTS really can't speak any numbers correctly. Digits maybe, but once you hand it something like "2025" the chance is high it will have pronounciation problems. In the first months, this felt like bad but temporarily acceptable. A year later, it feels like hilariously sad that nothing has been done to address such a simple yet important issue. You know that something bad is going on when you start to consider expanding numbers to written-out form before passing the message to the TTS. My girlfriend keeps joking that since LLMs, we now have computers that totally can not compute correctly. And she has a point. Sure, hand the LLM a tool to do calculations, and the situation improves somewhat. But it seems to be underlying, as shown by the problems of TTS.
Vision models have so many applications for me... However, some of them turn out to be actually unusable in practice. That becomes clear when you use a vision model to read the values off a blood pressure sensor. Take three photos, and you get three slightly different values. Not obviously made up stuff, but numbers that could be. 145/90, 147/93, 142/97. Well, the range might be clear, but actually, you can never be sure. Great for scene and art descriptions, since hallucinations almost fall through the cracks. But I would never use it to read any kind of data, neither OCR'd text nor, gasp, numbers! You can never know if you have been lied to.
But still, some of the byproducts of LLMs feel like a real revolution. The moment you realize why whisper is named like that. When you test it on your laptop, and realize that it just transcribed the YouTube video you were rather silently running in the background. Some of this stuff feels like a big jump.
I’m kind of disappointed that the generative AI hype has overshadowed how many non-generative tasks are basically “solved”, especially in vision.
Human level object recognition can easily be trained up for custom use cases. Image segmentation is amazing. I can take a photo of a document and it’s accurately OCRd. 10-15 years ago that would be unfathomable.
I think current LLMs would give AI a much better reputation if they focused on non generative applications. Sentiment analysis, translation, named entity extraction, etc. these were all problems that data folks have been wrestling with that could very well be seen as “solved” and a big win for AI that businesses would b able onconfidently integrate into their workflows, but instead they went with the generative route and we have to deal with hallucinations and slop
Ahh, I wanted to list translation as another "byproduct". That totally feels like solved now.
However, while OCR done by vision models feels neat, I personally dont feel like it changed anything for me. I have been using KNFB Reader and later Seeing AI, and both have sufficiently solved the "OCR a document you just photographed" use case for me. They even aid the picture taking process by letting me know that a particular edge of the document is not visible.
Besides, I still don't understand the actual potential for hallucinations when doing OCR through vision models fully. I have a feeling there are a number of corner cases which will lead to hallucinations. The tendency to fill in things that might fit but aren't there is rather concerning. Talking about spelling errors and numerical data.
We had big progress in AI in last 2 years but have to take into account more than text token generation. We have image generation that is not only super realistic but you just text what you want to modify without learning complicated tools like ComfyUI.
We have text to speech and audio to audio that is not only very realistic and fluent with many languages but also can express emotions in speech.
We have video generation that is really more realistic every month and taking less computation.
There is big progress in 3d models generation. Speech to text is still getting improved and fast enough to run on phones reducing latency. Next frontier is how AI is applied for robotics. No to mention areas not sexy to end users but in application in healthcare.