There's a mysterious new, undocumented model in the <https://chat.lmsys.org/>... - Random

simon, 1 month ago

There's a mysterious new, undocumented model in the https://chat.lmsys.org/ arena chat tool called "gpt2-chatbot" - you can access it by selecting "Direct Chat" and then picking it from the big select menu there

It's providing responses that feel significantly more impressive than GPT-4, for both factual-knowledge lookup and logic puzzles. It's possible this is a stealth preview launch of something like GPT-4.5

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ miki

Image

Image alternative text

simon, 1 month ago

It gave me the best ego-prompt response I've seen from any model so far - most models hallucinate something like that I was the CTO of GitHub, for this one every detail it provided was arguably correct (if a little sycophantic towards the end)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ datajake1999

andypiper, 1 month ago

@simon considering the numbers of “me”s there are to cross-pollinate results around, it was spot-on here too! Interesting.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

djh, 1 month ago

@simon Interested how we can make sure we're comparing apples to apples here.

For example this one potentially could use tools like a wikipedia-lookup or similar and return you a refined (RAG'ish) result, no?

I believe they're getting really complicated to compare unless we run them locally.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

simon, 1 month ago

@djh this is one reason I'm annoyed at lmsys for not being transparent about the model they are hosting

I don't think it's doing RAG against anything but I'd like to be sure about that!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

simon, 1 month ago

Blogged a few more notes here, including the system prompt https://simonwillison.net/2024/Apr/29/notes-on-gpt2-chatbot/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dogzilla, 1 month ago

@simon @mpesce How do you evaluate these LLMs? I’m not seeing a huge difference, except between small and large models, but even then not as big as I expected. Assuming I’m not using good prompts for this

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

simon, 1 month ago

@dogzilla @mpesce it's ferociously difficult!

I have a few prompts I use (starting with the ego-prompt "Who is a Simon Willison?") to get a feel for the size of its knowledge and how likely it is to hallucinate, but for a proper evaluation you really have to spend days of time with it as a regular tool

Or rely on the various benchmarks, but they don't really tell you much about how useful it will be for your own use-cases

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

arnicas, 1 month ago

@simon how are we sure it’s gpt2?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

simon, 1 month ago

@arnicas we know for certain it's not GPT2 because that came out in 2019 and had a fraction of the capability of modern models. I think that name is deliberately a joke. https://en.m.wikipedia.org/wiki/GPT-2

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment