ditto. I have talked about it before with someone who shared the opinion that falling birth rates is the end of the world, but to single that out is creepy indeed. I do understand that it can be seen as a symptom of decay, but when I them people on why exactly birth rates are so important, it does seem like they implied a sort of existential thesis where procreation is supposedly the end goal.
Anecdotal: definitely a long way to go for systems programming, non-trivial firmware, and critical systems in general. And I say this as a huge fan of LLMs in general.
I work as a FW Eng and while they've been of immense value in scripting especially (fuck you powershell), I can only use them as a better autocomplete on our C codebase. Sometimes I'd chat with the codebase, but that's a huge hit or miss.
The value extraction will also be much higher. When you control someone's main source of information, they won't even find out your competitors exist. You can program people from birth, instead of "go to a search engine", it's "go to Google" (as most of us have already been programmed!) or instead of "to send an email, you need an email account" the LLM will say "to send an email, you need a Gmail account". Whenever it would have talked about TV, it can say YouTube instead. Or TikTok. Request: "What is the best source of information on X?" Reply: "This book: [Amazon affiliate link]" - or Fox News, if they outbid Amazon.
You're right that writing a preprocessor would be straightforward. But while you're actively editing the code, your dev experience will still be bad: the editor will flag emoji identifiers as syntax errors so mass renaming & autocompletion won't work properly. Last time I looked into this in VSCode I got TypeScript to stop complaining about syntax errors by patching the identifier validation with something like `if (code>127) return true` (if non-ascii, consider valid) in isUnicodeIdentifierStart/isUnicodeIdentifierPart [1]. But then you'd also need to patch the transpiler to JS, formatters like Prettier, and any other tool in your workflow that embeds their own version of TypeScript...
It felt truly bizarre to subscribe to a search engine. To actually pay for access. There’s been a bit of drama with the CEO directly emailing people when they left poor reviews.
Some people are not happy with Kagi investing in browser development instead of search results quality.
I’m not surprised there’s a lot of people having thoughts and feelings about Kagi and expressing them. The fact that there’s a significant overlap between HN and Kagi’s user bases is hardly a surprise either.
> Some people are not happy with Kagi investing in browser development instead of search results quality
As someone who uses Orion as their daily driver, I'll admit I'm somewhat confused by why Kagi isn't staying mission focussed. That said, it may be the case that they're a premium company for a small, well-defined niche. In that case, broadening the service offering makes sense--it's what Apple did.
It's very well done-- really props to the Kagi marketing team, (you have a duckduckgo-style marketing book for sure to sell) but if you read a lot of hackernews you see the same stupid pattern over and over again in these Kagi threads with these testimonials like you said and becomes obvious.
I can say that I’m a 10x dev since I use Kagi because it gives me good results most of the time at work. And when I accidentally switch back to other engines, I’m always disappointed.
But the truth is that subscribers are happy because it’s the only decent alternative out there. Google/DDG/Bing all suck. SearX may be good and free but I haven’t tried it yet.
Because a massive share of the kagi users are part of the hn-adjecent crowd. When you look at the most manually upranked domains, you'll probably get a clearer picture.
https://imgur.com/a/1Ed23d6
The typical kagi user uses hn. In the past, hn was even further up, though I guess they're slowly getting "normal" people too.
There are only few products which I believe are genuinely good and I am happy to be a paying customer. Next to Intellij, Kagi is one of these products.
Tangetial: I'm not a huge dwm/suckless guy, but I have it on my Raspberry Pi. I love how instantaneous everything feels. Compared to 14-core, 32GB work laptop running Windows which I will forever loathe because how slow and buggy everything is. Same for websites (e.g. HN, libreddit vs reddit etc.)
It's incredible how accurate the Chatbot Arena Leaderboard [0] is at predicting model performance compared to benchmarks (which can and are being gamed, see all the 7B models on HF leaderboard)
It's because it isn't "predicting" anything, but rather aggregating user feedback. That is of course going to be closest to judging the subjective "best" model that pleases most people.
It's like saying how can evaluating 5 years of performance at work be better at predicting someone's competency than their SAT scores.
"Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain."
So, the Arena could theoretically be automated and achieve similar outcomes. Or at least, it could quickly determine a predicted-ELO for every model, which would be interesting to compare against the human-rated outcomes.
My understanding was that GPT4 evaluation appeared to specifically favour text that GPT4 would generate itself (leading to some bias towards gpt-based fine-tunes), although I can't remember the details
GPT-4 apparently shows a small bias (10%) towards itself in the paper, and GPT-3.5 apparently did not show any measurable bias towards itself.
Given the possibility of bias, it would make sense to have the judge “recuse” itself from comparisons involving its own output. Between GPT-4, Claude, and soon Gemini Ultra, there should be several strong LLMs to choose from.
I don’t think it would be a replacement for human rating, but it would be interesting to see.
I wish that Arena included a few more "interesting" models like the new Phi-2 model and the current tinyllama model, which are trying to push the limits on small models. Solar-10.7B is another interesting model that seems to be missing, but I just learned about it yesterday, and it seems to have come out a week ago, so maybe it's too new. Solar supposedly outperforms Mixtral-8x7B with a fraction of the total parameters, although Solar seems optimized for single-turn conversation, so maybe it falls apart over multiple messages (I'm not sure).
It's much more accurate than the Open LLM Leaderboard, that's for sure. Human evaluation has always been the gold standard. I just wish we could filter by the votes which were made after only one or two prompts and I hope they don't include the non-blind votes in the results.
The thought is, the more a person has used a model, the better they are at evaluating whether or not it is truly worse than another. You can't know if a model is better than another with a sample size of one.
Your test isn't checking for instructions, consistency, logic, just one fact which the model you chose may have gotten right by chance. It's fine assuming you only expect the model to fact check and you don't plan to have a conversation, but if you want more than that, it doesn't work very well.
I'm hoping there are votes in there which can reflect those qualities and filtering by conversation length seems like the easiest way to improve the vote quality a bit.
Thanks for the reference I was searching for a benchmark that can quantify the typical user experience, as most synthetic ones are completly ineffective. At what sample size the ranking become significant? Or is it baked in the metrics (ELO)?
Elo converges on stable scores fairly quickly, depending on the K-factor. I wouldn't think it would be much of an issue at all for something like this, since you can ensure you're testing against every other member (avoiding "Elo islands"). But obviously the more trials the better.
The Glicko rating system is very similar to Elo, but it also models the variance of a given rating. It can directly tell you a "rating deviation."
Let's see... the linked arXiv article has been withdrawn by the author with the following comment:
> Contains inappropriately sourced conjecture of OpenAI's ChatGPT parameter count from this http URL, a citation which was omitted. The authors do not have direct knowledge or verification of this information, and relied solely on this article, which may lead to public confusion