KathyReid,
@KathyReid@aus.social avatar

I just issued a data deletion request to to erase all of the associations between my name and the questions, answers and comments I have on the platform.

One of the key ways in which works to supplement is based on proven associations. Higher ranked Stack Overflow members' answers will carry more weight in any that is produced.

By asking for my name to be disassociated from the textual data, it removes a semantic relationship that is helpful for determining which tokens of text to use in an .

If you sell out your user base without consultation, expect a backlash.

Hyperlynx,
@Hyperlynx@aus.social avatar

Time to switch to Codidact.

https://software.codidact.com/

It's the same thing as StackOverflow, only run by a not-for-profit.

sean,
@sean@idf.social avatar

@KathyReid Good stuff! Out of curiosity… when you mention that higher ranked users' posts carry more weight… is there anywhere I can read more about this feature engineering? Are we talking about RAG/search-operators manually annotating CSS selectors to pull user-ranking info per-site? Related: after crawling user rank info, would a RAG/search-provider not keep the info in-cache, i.e. do account deletions actually trickle down to search engines' collection of valuable features?

KathyReid,
@KathyReid@aus.social avatar

@sean To clarify, I'm not saying they do carry more weight, but I am predicting that when they tokenise the text in SO to train LLMs on, they will give more weight to text created by high-ranking users.

Also, highly-ranked users are likely to have more text in the SO corpus, because they are highly-ranked (and have therefore answered lots of questions, or a small amount of questions well).

That is, when the text is tokenised, high-ranked user-generated text will make up more of the tokens. If I can break that association, then it makes OpenAI's job harder.

sean,
@sean@idf.social avatar

@KathyReid The high volume of text for high rank users makes total sense from a training bias perspective, though I think anonymizing authors might not change this.

Unless the RAG provider uses specially designed extractors for user rank info in their corpus, I'm doubtful ML could pick up on a numerical rank like SO karma and figure out to weight by this number. That's too much System 2 thinking for ML, IMO!

Still good to give big firms as little free data as possible, of course! ☺

KathyReid,
@KathyReid@aus.social avatar

@sean right but I'm guessing that OpenAI will write custom tokenisers for SO content, which probably would take into account user rank info ... So it's not the ML, it's the data preparation.

KathyReid,
@KathyReid@aus.social avatar

@sean Good questions. The way I see a RAG being constructed / or other knowledge graph would be to associate Contributors with Questions and Answers - so you need the Question Answer relationship to generate plausible answers, but the Contributor Answer relationship lets you rank Answers higher from higher rated contributors:

See something like this:
He, Xiaoxin, Yijun Tian, Yifei Sun, Nitesh V. Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. "G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering." arXiv preprint arXiv:2402.07630 (2024).

kellogh,
@kellogh@hachyderm.io avatar

@KathyReid is that really true? did they say that?

KathyReid,
@KathyReid@aus.social avatar

@kellogh which part, scraping for OpenAI? Absolutely

kellogh,
@kellogh@hachyderm.io avatar

@KathyReid no, about it being weighted by user prevalence. i would’ve thought PII is removed from training data…

kellogh,
@kellogh@hachyderm.io avatar

@KathyReid idk, if you really wanted to fuck with stack overflow, find the worst answers and upvote them as much as possible, have a whole upvote chain that completely erases the value of the point system. that would do a little to harm the model, and a lot to punish the company

KathyReid,
@KathyReid@aus.social avatar

@kellogh ah no they have not confirmed that the tokens are weighted by the number of points the user who authored the post has - but if I were doing an LLM from SO, that's how I would approach it - because higher points users are likely to have more reliable answers, and better phrased questions.

arestelle,
@arestelle@dice.camp avatar

@kellogh @KathyReid I wouldn't count on that, but they could also remove PII (anonymize username etc) while keeping info about the user's ranking, or likely even about the same user having made all your posts, just anonymize/remove the fact that that user was you - just replace your name with anonymous_whatever

(based on my knowledge, which, I did work on a data platform team at a tech company working toward gdpr compliance with European teammates, but they're the more knowledgeable ones)

KathyReid,
@KathyReid@aus.social avatar

@arestelle @kellogh that's an excellent point - they just replace my username with a random string but the points of that username are still associated with the random string.

KathyReid,
@KathyReid@aus.social avatar

@arestelle @kellogh excellent points

dtbell91,
@dtbell91@aus.social avatar

@KathyReid wait, so the only request that can be made is to anonymise your content but not to delete it?

KathyReid,
@KathyReid@aus.social avatar

@dtbell91 yep

  • All
  • Subscribed
  • Moderated
  • Favorites
  • stackoverflow
  • rosin
  • Youngstown
  • khanakhh
  • ngwrru68w68
  • slotface
  • ethstaker
  • mdbf
  • everett
  • kavyap
  • DreamBathrooms
  • thenastyranch
  • cisconetworking
  • magazineikmin
  • Durango
  • megavids
  • GTA5RPClips
  • anitta
  • tester
  • tacticalgear
  • InstantRegret
  • normalnudes
  • osvaldo12
  • cubers
  • provamag3
  • modclub
  • Leos
  • JUstTest
  • lostlight
  • All magazines