This fortunately is a solved problem. Or will be, once Amazon, Apple and Google get out of their asses and plug a better voice recognition model to an LLM.
Silly how OpenAI could blow all voice assistants out of the water today, if they just added Android intents as function calls to the ChatGPT app. Yes, the "voice chat mode" is that good.
I know i'm getting close to Torment Nexus territory but how do you get an LLM to run code as the response? Given that an LLM basically calculates the most probable text that follows a prompt, how do you then go from that response to a function call that flips a lightswitch? Seems like you'd need some other ML/AI that takes the LLM output and figures out it most likely means a certain call to an API and then executes that call.
With alexa i can program if/then statements, like basically when i say X then do Y. If something like chatgpt requires the same thing then i don't see the advantage.
> With alexa i can program if/then statements, like basically when i say X then do Y. If something like chatgpt requires the same thing then i don't see the advantage.
Yes, I was thinking about even something as if/then, which could be configured in the UI and manifest to GPT-4 as the usual function call stuff.
The advantage here would be twofold:
1. GPT-4 won't need you to talk a weird command language; it's quite good at understanding regular talk and turning it into structured data. It will have no problem understanding things like "oh flip the lights in the living room and run some music, idk, maybe some Beatles", followed by "nah, too bright, tone it down a little", and reliably converting them into data you could feed to your if/else logic.
2. ChatGPT (the app) has a voice recognition model that, unlike Google Assistant, Siri and Alexa, does not suck. It's the first model I've experienced that can convert my casual speech into text with 95%+ accuracy even with lots of ambient noise.
Those are the features ChatGPT app offers today. Right now, if they added a basic bidirectional Tasker integration (user-configurable "function calls" emitting structured data for Tasker, and ability for Tasker to add messages into chat), anyone could quickly DIY something 20x better than Google Assistant.
At some point you've got to get from language to action, yes - in my case, I use the LLM as a multi-stage classifier, mapping from a set of high-level areas of capability, to more focused mappings to specific systems and capabilities. So the first layer of classification might say something like "this interaction was about <environmental control>" where <environmental control> is one of a finite set of possible systems. The next layer might say something like "this is about <lighting>", and the next layer may now have enough information to interrogate using a specific enough prompt (which may be generated based on a capability definition, so for example "determine any physical location, an action, and any inputs regarding colour or brightness from the following input" - which can be generated from the possible inputs of the capability you think you're addressing).
Of course this isn't fool proof, and there still needs to be work defining capabilities of systems, etc. (although these are tasks AI can assist with). But it's promising - "teaching" the system how to do new things is relatively simple, and effectively akin to describing capabilities rather than programming directly.
> If something like chatgpt requires the same thing then i don't see the advantage.
So LLMs today can do this a few ways. One they can write and execute code. You can ask for some complex math (eg calculate the tip for this bill), and the LLM can respond with a python program to execute that math, then the wrapping program can execute this and return the result. You can scale this up a bit, use your creativity at the possiblities (eg SQL queries, one-off UIs, etc).
You can also use an LLM to “craft a call to an API from <api library>”. Today, Alexa basically works by calling an API. You get a weather api, a timer api, etc and make them all conform to the Alexa standard. An LLM can one-up it by using any existing API unchanged, as long as there’s adequate documentation somewhere for the LLM.
An LLM won’t revolutionize Alexa type use cases, but it will give it a way to reach the “long tail” of APIs and data retrieval. LLMs are pretty novel for the “write custom code to solve this unique problem” use case.
Yup, from where I see it, the only thing(s) holding llms back from generating api calls on the fly in a voice chat scenario is probably latency (and to a lesser degree malformed output)
Yea, the latency is absolutely killing a lot of this. Alexa first-party APIs of course are tuned, and reside in the same datacenter, so its fast, but a west-coast US LLM trying to control a Philips Hue will discover they're crossing the Atlantic for their calls, which probably would compete with an LLM for how slow it can be.
> and to a lesser degree malformed output
What's cool, is that this isn't a huge issue. Most LLMs how have "grammar" controls, where the model doesn't select any character as the next one, it selects the highest-probability character that conforms to the grammar. This dramatically helps things like well-formed JSON (or XML or... ) output.
Disagree. Extra latency of adding LLMs to a voice pipeline is not that much compared to doing voice via cloud in the first place. Improved accuracy and handling of natural language queries would be worth it relative to the barely-working "assistants" that people only ever use to set timers, and they can't even handle that correctly half the time.
The basic idea is to instruct the llm to output some kind of signal in text (often a json blob) that describes what it should do, then have a normal program use that json to execute some function.
ChatGPT's voice bot vs. PI's voice bot is lacking in Pi's personality and zing. PI is completely free and Ive been using it since beginning of October. Chat GPT's i have to pay $20 and for a lesser voice (personality / tone of voice is more monotone) bot.
It's staggering to me that Apple has not improved on the UI for "try again" or "keep trying", whether the fault is with Siri itself, or just network conditions. It seems like (relatively) low-hanging fruit, compared to the challenges of improving the engine. (I don't use any other voice assistants, no idea how well they do here.)
"Siri, lights to HALF."
"Siri, lights to HAAAAALF."
"Siri, LIGHTS TO FIFTY PERCENT!"