@nixCraft My boss and juniors at are all gushing over it. Say I MUST use it.
The first problem I give it, the answer is dead wrong. After telling it why it was wrong. It gives ANOTHER wrong answer. Lather, rinse, repeat. It never can up with a working solution.
Plus, it’s stealing mediocre code. Don’t expect anything decent from it.
@sesamecreek@nixCraft My experience exactly. One half of the students will not use it and nail the exam and the other half will be the exact opposite. Sadly, I may add.
@nixCraft I'm not allowed to use chatgpt at $WORK for... reasons, but I did participate in Google's Generative Search Experience, and it was so good initially that it blew me out if the water - it reminded me of Google in early-to-mid 2000s, when you could find what you needed in the first page; before SEO and advertisements ruined it.
@nixCraft#ChatGPT and #Copilot are this decade’s #StackOverflow for junior programmers, which is ironic considering the latter site fed those services and they’re just remixing
@mjgardner@nixCraft it’s gonna be interesting to see how sites like StackOverflow adapt. ChatGPT needs the content but if everybody uses ChatGPT then no one’s there to feed StackOverflow’s revenue model
@jeromechoo@mjgardner@nixCraft I feel like this whole "adaptation conundrum" is overblown. Sure, ChatGPT steals traffic from Soverflow, just like that co-worker dev who knows everything steals it.
If nobody has had/solved that particular question you have, ChatGPT won't "know" it either, and hence, you need to look/post at Soverflow.
Put up a robots.txt equivalent for ChatGPT, sue them when violated, case closed.
@ftranschel@mjgardner@nixCraft Without any provenance behind the foundational training of ChatGPT and LLMs, there's no real way to prove any content was stolen.
@jeromechoo@mjgardner@nixCraft I disagree. If you look at things like arXiv:2012.07805v2 or arXiv:2302.04460v1 it's somewhat obvious how to create a case that would enable courts to rule in favour of the plaintiff to check the training data behind closed doors.
The contextual extraction method employed in 2012.07805v2 suggests that at least 33 repetitions of a very unique string is required for a 1.5B parameter model to recall completely in the output of 10,000 generations.
While promising for PII extraction attacks, I’m not sure we can count on this method creating an obvious case yet for the less differentiated q&a content on StackOverflow.
@jeromechoo@mjgardner@nixCraft Yes, that is right, but not really a problem. You can make the case with anything you find in order to make the court check for something that might actually be problematic. The process really looks like this:
Find a unique set of tokens (1) that can be extracted with high confidence and find an example of your data (2) that is purportedly pirated.
In your statement of claim you use (1) to show that (2) must be verified by looking at the training data.
Computer security people also use the word “honeytoken,” while cartographers say “trap street,” “paper town,” or “phantom settlement.”
More colorful expressions include “Nihilartikel” (Latin + German for “nothing article”) and “Mountweazel” (after a fake entry in the 1975 edition of the New Columbia Encyclopedia.)
@ftranschel@mjgardner@nixCraft I think my issue with StackOverflow specifically is with (1). It might be difficult to even find a unique set of tokens in the content.
That’s a purely anecdotal assumption though. I use it a lot but never felt like any of the code I’ve “borrowed” were uniquely identifying/original to StackOverflow.
Add comment