I just issued a data deletion request to #StackOverflow to erase all of the... - Stackoverflow

KathyReid, 11 days ago

I just issued a data deletion request to #StackOverflow to erase all of the associations between my name and the questions, answers and comments I have on the platform.

One of the key ways in which #RAG works to supplement #LLMs is based on proven associations. Higher ranked Stack Overflow members' answers will carry more weight in any #LLM that is produced.

By asking for my name to be disassociated from the textual data, it removes a semantic relationship that is helpful for determining which tokens of text to use in an #LLM.

If you sell out your user base without consultation, expect a backlash.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Image

Image alternative text

Hyperlynx, 11 days ago

Time to switch to Codidact.

https://software.codidact.com/

It's the same thing as StackOverflow, only run by a not-for-profit.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

sean, 11 days ago

@KathyReid Good stuff! Out of curiosity… when you mention that higher ranked users' posts carry more weight… is there anywhere I can read more about this feature engineering? Are we talking about RAG/search-operators manually annotating CSS selectors to pull user-ranking info per-site? Related: after crawling user rank info, would a RAG/search-provider not keep the info in-cache, i.e. do account deletions actually trickle down to search engines' collection of valuable features?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

KathyReid, 11 days ago

@sean To clarify, I'm not saying they do carry more weight, but I am predicting that when they tokenise the text in SO to train LLMs on, they will give more weight to text created by high-ranking users.

Also, highly-ranked users are likely to have more text in the SO corpus, because they are highly-ranked (and have therefore answered lots of questions, or a small amount of questions well).

That is, when the text is tokenised, high-ranked user-generated text will make up more of the tokens. If I can break that association, then it makes OpenAI's job harder.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

sean, 11 days ago

@KathyReid The high volume of text for high rank users makes total sense from a training bias perspective, though I think anonymizing authors might not change this.

Unless the RAG provider uses specially designed extractors for user rank info in their corpus, I'm doubtful ML could pick up on a numerical rank like SO karma and figure out to weight by this number. That's too much System 2 thinking for ML, IMO!

Still good to give big firms as little free data as possible, of course! ☺

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

KathyReid, 11 days ago

@sean right but I'm guessing that OpenAI will write custom tokenisers for SO content, which probably would take into account user rank info ... So it's not the ML, it's the data preparation.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

KathyReid, 9 days ago

@sean Good questions. The way I see a RAG being constructed / or other knowledge graph would be to associate Contributors with Questions and Answers - so you need the Question Answer relationship to generate plausible answers, but the Contributor Answer relationship lets you rank Answers higher from higher rated contributors:

See something like this:
He, Xiaoxin, Yijun Tian, Yifei Sun, Nitesh V. Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. "G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering." arXiv preprint arXiv:2402.07630 (2024).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

kellogh, 11 days ago

@KathyReid is that really true? did they say that?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

KathyReid, 11 days ago

@kellogh which part, scraping for OpenAI? Absolutely

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

kellogh, 11 days ago

@KathyReid no, about it being weighted by user prevalence. i would’ve thought PII is removed from training data…

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

kellogh, 11 days ago

@KathyReid idk, if you really wanted to fuck with stack overflow, find the worst answers and upvote them as much as possible, have a whole upvote chain that completely erases the value of the point system. that would do a little to harm the model, and a lot to punish the company

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

KathyReid, 11 days ago

@kellogh ah no they have not confirmed that the tokens are weighted by the number of points the user who authored the post has - but if I were doing an LLM from SO, that's how I would approach it - because higher points users are likely to have more reliable answers, and better phrased questions.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

arestelle, 11 days ago

@kellogh @KathyReid I wouldn't count on that, but they could also remove PII (anonymize username etc) while keeping info about the user's ranking, or likely even about the same user having made all your posts, just anonymize/remove the fact that that user was you - just replace your name with anonymous_whatever

(based on my knowledge, which, I did work on a data platform team at a tech company working toward gdpr compliance with European teammates, but they're the more knowledgeable ones)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

KathyReid, 11 days ago

@arestelle @kellogh that's an excellent point - they just replace my username with a random string but the points of that username are still associated with the random string.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

KathyReid, 9 days ago

@arestelle @kellogh excellent points

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dtbell91, 11 days ago

@KathyReid wait, so the only request that can be made is to anonymise your content but not to delete it?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...