The robot companies are all get closer, but I don't think a humanoid robot will do anything at a cheaper cost than an actual human in 2026. It definitely looks like there is a path though.
There are a lot of bad health outcomes built into our society, yes, but by the time people are confronted with the health impacts of cars, agriculture subsidies, for-profit healthcare, etc. it is likely that drugs will be necessary to treat the very real, immediate problems which any given patient has. Reversing the subsidies for things like car-dependency would positively benefit millions of people but it’s a generational change, not something most individuals can do.
I agree about the significance of those large-scale changes; still ...
> Reversing the subsidies for things like car-dependency would positively benefit millions of people but it’s a generational change, not something most individuals can do.
Individuals frequently can chose to not use a car, of course. Still, it's not realistic for everyone or all the time, especially in a society built for automobile use.
> by the time people are confronted with the health impacts of cars, agriculture subsidies, for-profit healthcare, etc. it is likely that drugs will be necessary
My point is that there are other treatments for illness. I doubt it's a coincidence that this patentable technology is so relied on in a hyper-capitalist society; other countries with better health outcomes use far fewer pills, iirc. Who will fund the large-scale study that says a valuable pill is unnecessary?
> Individuals frequently can chose to not use a car, of course
To some extent, yes, but my point was that it’s not realistic for many people because we treat walkable neighborhoods like luxuries. If you wake up in your 40s with a bad back and cardio problems because you live in a suburb and drive everywhere, you can’t roll back the clock and build sidewalks, legalize density, or run decent transit and on average don’t have the money to move somewhere dramatically better.
I think a growing number of people, especially younger ones, realize this is unsustainable but it took generations to get here and it’ll take a while to change trajectories, too. If gas prices had stayed high in the seventies that might have gone differently but a huge percentage of American neighborhoods are designed to minimize physical activity and that’s often enforced by law.
That's what I meant by, it's not realistic for everyone and everywhere.
> I think a growing number of people, especially younger ones, realize this is unsustainable but it took generations to get here and it’ll take a while to change trajectories, too.
Urbanist movements, including walkable communities, are much older than this younger generation. I think within a certain segment - well-educated upper middle class, maybe - it's long had influence.
I think they need to bring those ideas to other segments of society, which they have a hard time doing.
I definitely don’t think that it’s new to the current young generation but I am optimistic that they might have enough political clout to actually make progress. My neighborhood narrowly avoided becoming a highway in the 60s so we have some older folks who have been fighting car culture since before I was born, but there were a lot of people who didn’t really care because it was more affordable in the past, but their kids are a lot more motivated because it’s so financially non-viable now.
In the United States, the other big factor was recognizing how much it wasn’t just car culture but racism driving things. Despite the current moment, I get the impression that a lot of people are more aware of how much avoiding sidewalks and transit was driven by racism and just hurt everyone.
>Maybe drugs, or these drugs, aren't the most efficient solutions. Shouldn't we direct resources toward more efficient ones?
Turns out all the low hanging fruit have already been picked, so the only "more efficient ones" left are stuff like gene therapy, which are absurdly expensive, but still theoretically cheaper than a lifetime of care. Unsurprisingly the high sticker price draws much backlash from the public and politicians.
> all the low hanging fruit have already been picked
What is that based on?
Also, I'm not talking about 'low hanging fruit' necessarily; only solutons that become cost effective for vendors if drug prices aren't so extreme.
There's reason to think there is low-hanging fruit: Research is incentivized for the most profitable solutions for the vendors, not the most cost-effective solutions for patients.
>Also, I'm not talking about 'low hanging fruit' necessarily; only solutons that become cost effective for vendors if drug prices aren't so extreme.
>There's reason to think there is low-hanging fruit: Research is incentivized for the most profitable solutions for the vendors, not the most cost-effective solutions for patients.
High drug prices also mean you can charge more for one-off cures. See, the gene therapy example above.
Are you trying to suggest that this is an example of a planned economy? Maybe you should look at definitions of planned vs market economies. You still have design and regulation in a market economy.
Why would a China or India care if it were a viable treatment? Unless a country wants to use their population as lab rats, it takes money and scientists to actually confirm a treatment is safe and effective.
Right, this result seems meaningless without a human clinician control.
I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.
Ideally this should be done blind, I don't know if BetterHelp allows for therapy through a text chat interface? Where the therapist has no idea it's for a study and so isn't trying to "do better" then they would for any average client.
Because while I know a lot of people for whom therapy has been life-changing, I also know of a lot of terrible and even unprofessional therapy experiences.
The results are not meaningless but they are not comparing humans against LLMs. The goal is to have something that can be used to test LLMs on a realistic mental health support.
The main points of our methodology are:
1) prove that is possible to simulate patients with an LLM. Which we did.
2) prove that an LLM as a Judge can effectively score conversations according to several dimensions that are similar to how clinicians are also evaluated. Which we also did and we show that the average correlation with human evaluators is medium-high.
Given 1) and 2) we can then benchmark LLMs and as you see, there is plenty of room for improvement. We did not claim anything regarding human performance... its likely that human performance also needs to improve :) thats another study
So the results are meaningful in terms of establishing that LLM therapeutic performance can be evaluated.
But not meaningful in terms of comparing LLMs with human clinicians.
So in that case, how can you justify the title you used for submission, "New benchmark shows top LLMs struggle in real mental health care"?
How are they struggling? Struggling relative to what? For all your work shows, couldn't they be outperforming the average human? Or even if they're below that, couldn't they still have a large net positive effect with few negative outcomes?
I don't understand where the negative framing of your title is coming from.
LLMs have room for improvement (we show that their scores are medium-low on several dimensions).
Maybe the average human also has lots of room for improvement. One thing does not necessarily depend on the other.
the same way we can say that LLMs still have room for improvement on a specific task (lets say mathematics) but the average human is also bad at mathematics...
We don't do any claims about human therapists. Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy. Showing this is the first step to improve them
But you chose the word "struggle". And now you say:
> Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy.
That implies they're not currently good at therapy. But you haven't shown that, have you? How are you defining that a score of 4 isn't already "good"? How do you know that isn't already correlated with meaningfully improved outcomes, and therefore already "good"?
Everybody has room for improvement if you say 6 is perfection and something isn't reaching 6 on average. But that doesn't mean everybody's struggling.
I take no issue with your methodology. But your broader framing, and title, don't seem justified or objective.
> Right, this result seems meaningless without a human clinician control.
> I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.
Does it really matter? Per the OP:
>>> Across all models, average clinical performance stayed below 4 on a 1–6 scale. Performance degraded further in severe symptom scenarios and in longer conversations (40 turns vs 20).
I'd assume a real therapy session has far more "turns" than 20-40, and if model performance starts low and gets lower with longer length, it's reasonable to expect it would be worse than a human (who typically don't the the characteristic of becoming increasingly unhinged the longer you talk to them).
> Betterhelp is a nightmare for clients and therapists alike. Their only mission seems to be in making as much money as possible for their shareholders. Otherwise they don't seem at all interested in actually helping anyone. Stay away from Betterhelp.
So taking it as a baseline would bias any experiment against human therapists.
Yes, it absolutely does matter. Look at what you write:
> I'd assume
> it's reasonable to expect
The whole reason to do a study is to actually study as opposed to assume and expect.
And for many of the kinds of people engaging in therapy with an LLM, BetterHelp is precisely where they are most likely to go due to its marketing, convenience, and price. It's where a ton of real therapy is happening today. Most people do not have a $300/hr. high-quality therapist nearby that is available and that they can afford. LLM's need to be compared, first, to the alternatives that are readily available.
And remember that all therapists on BetterHelp are licensed, with a master's or doctorate, and meet state board requirements. So I don't understand why that wouldn't be a perfectly reasonable baseline.
> I love how the top comment on that Reddit post is an affiliate link to an online therapy provider.
Posted 6 months after the post and all the rest of the comments. It's some kind of SEO manipulation. That reddit thread ranked highly in my Google search about Betterhelp being bad, so they're probably trying to piggyback on it.
I’m not against affiliate links. I’m just pro-disclosure especially for something as important as therapy and it seems like maybe you should mention you make $150 for each person that signs up.
This is a good point. We have not tested the clinicians but I believe they would not score each other perfectly as we observed some disagreement also between the scores which also reflects different opinions between clinicians
It is nice to have an accurate measure of things and a human baseline would be additionally helpful too.
Many things can be useful before they reach the level of world's best. Although with AI, non-intuitive failure modes must be taken into consideration too.
Rails tries to more tightly integrate with the front-end which causes a lot of turn over the years. Django projects from 10 years ago are still upgradable in a day or two. Rails does include some nice stuff though, but I much prefer Django's code first database models than Rail's ActiveRecord.
Those Django models are a pain to work with if you have to access the database with any other tool that is not the original Django app. The only sane way to design a database managed by Django models and migrations is not using any inheritance between models or you'll end up with a number of tables, each one adding a few fields. Django ORM will join them for you but you are on your own if you ever have to write queries with some other tool.
I do agree that Rails' asset stuff has been giant pain over the years and has not kept up well. On the other hand, some apps that adopted separate Rails APIs and a separate (for example, React) frontend have been fine. You're right though that their opinion here added more headaches that necessary!
I’d agree in that I never things like Coffeescript but I think that Rails’ frontend solution since 7.0 of Hotwire has been excellent.
Being able to sprinkle just enough JavaScript on server rended HTML works really well. That you can now use it iOS and Android apps too makes it a simpler alternative to React IMHO.
I still much prefer server rendering in a monolith than dealing with GraphQL, backend for frontend and the complexity of micro services and distributed transactions.
BTW Hotwire and Hotwire Native are also options for Django too.
Will there be a large-dose trial? I imagine that’s going to be more difficult, as the weight-loss effect of GLP-1 means you cant include frail patients in your target group. And GLP-1 at high doses also has some unfavourable side-effects: nausea, vomiting, diarrhea, constipation.
It doesn't say if a million drones are going to be purchased from a defense contractor. Hopefully it goes to a commerical US drone company that makes drones for consumers, film, inspections, etc with an order of million military-harden drones from the Goverment. There would an expection they could tool up to many millions in a time of conflict.
Defense contractors already cover small batches of super-specialized drones.
As a long-time Django user, I would not use Django for this. Django async is probably never the right choice for a green-field project. I would still pick FastAPI/SQLAlchemy over Express and PostHog. There is no way 15 different Node ORMs are going to survive in the long run, plus Drizzle and Prisma seem to be the leaders for now.
FastAPI/SQLAlchemy won’t be more scalable than a typical Django setup. The real bottleneck is the threading model, not the few microseconds the framework spends before handing off to user code. Django running under uWSGI with green threads can outperform Go-based services in some scenarios, largely thanks to how efficient Python’s C ABI is compared to Go.
Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.
If one says, "we don't use an ORM", you will incrementally create helper functions for pulling the data into your language to tweak the data or to build optional filters and thus will have an ad hoc, informally-specified, bug-ridden, slow implementation of half of an ORM.
There is a time and place for direct SQL code and there is a time and place for an ORM. Normally I use an ORM that has a great escape hatch for raw SQL as needed.
I've always used SQL directly since I stopped using ORMs, and it didn't result in a halfway implemented ORM. Maybe back when there was no jsonb for your blob o' fields cases, it was different.
But yeah don't do a high level lang's job in C or C++
The main advantage of an ORM isn’t query building but its deep integration with the rest of the ecosystem.
In Django, you can change a single field in a model, and that update automatically cascades through to database migrations, validations, admin panels, and even user-facing forms in the HTML.
I'd have to try this for myself before judging it. Apple's CoreData tried and miserably failed to do this, and I wasn't fond of the Laravel ORM either, but Django is probably a better example than those.
The steam generator that the fusion generator connects to might be more expensive than solar at this point. That would be even if fusion cost nothing and had infinite amounts of fuel, there would be no customers for its energy on a sunny afternoon.
reply