simon,
@simon@simonwillison.net avatar

There's a mysterious new, undocumented model in the https://chat.lmsys.org/ arena chat tool called "gpt2-chatbot" - you can access it by selecting "Direct Chat" and then picking it from the big select menu there

It's providing responses that feel significantly more impressive than GPT-4, for both factual-knowledge lookup and logic puzzles. It's possible this is a stealth preview launch of something like GPT-4.5

simon,
@simon@simonwillison.net avatar

It gave me the best ego-prompt response I've seen from any model so far - most models hallucinate something like that I was the CTO of GitHub, for this one every detail it provided was arguably correct (if a little sycophantic towards the end)

andypiper,
@andypiper@macaw.social avatar

@simon considering the numbers of “me”s there are to cross-pollinate results around, it was spot-on here too! Interesting.

djh,
@djh@chaos.social avatar

@simon Interested how we can make sure we're comparing apples to apples here.

For example this one potentially could use tools like a wikipedia-lookup or similar and return you a refined (RAG'ish) result, no?

I believe they're getting really complicated to compare unless we run them locally.

simon,
@simon@simonwillison.net avatar

@djh this is one reason I'm annoyed at lmsys for not being transparent about the model they are hosting

I don't think it's doing RAG against anything but I'd like to be sure about that!

simon,
@simon@simonwillison.net avatar

Blogged a few more notes here, including the system prompt https://simonwillison.net/2024/Apr/29/notes-on-gpt2-chatbot/

dogzilla,
@dogzilla@masto.deluma.biz avatar

@simon @mpesce How do you evaluate these LLMs? I’m not seeing a huge difference, except between small and large models, but even then not as big as I expected. Assuming I’m not using good prompts for this

simon,
@simon@simonwillison.net avatar

@dogzilla @mpesce it's ferociously difficult!

I have a few prompts I use (starting with the ego-prompt "Who is a Simon Willison?") to get a feel for the size of its knowledge and how likely it is to hallucinate, but for a proper evaluation you really have to spend days of time with it as a regular tool

Or rely on the various benchmarks, but they don't really tell you much about how useful it will be for your own use-cases

arnicas,
@arnicas@mstdn.social avatar

@simon how are we sure it’s gpt2?

simon,
@simon@simonwillison.net avatar

@arnicas we know for certain it's not GPT2 because that came out in 2019 and had a fraction of the capability of modern models. I think that name is deliberately a joke. https://en.m.wikipedia.org/wiki/GPT-2

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • DreamBathrooms
  • ngwrru68w68
  • modclub
  • magazineikmin
  • thenastyranch
  • rosin
  • khanakhh
  • InstantRegret
  • Youngstown
  • slotface
  • Durango
  • kavyap
  • mdbf
  • GTA5RPClips
  • JUstTest
  • tacticalgear
  • normalnudes
  • tester
  • osvaldo12
  • everett
  • cubers
  • ethstaker
  • anitta
  • provamag3
  • Leos
  • cisconetworking
  • megavids
  • lostlight
  • All magazines