I’d be curious to know what effect, if any, this change has on a relatively large LLM’s likelihood of outputting strings of text that are memorized from training data sources.
How easy would it be to use mastodon data for training AI ?
I would think collecting public posts from all instances is easy or are there some blocking measures to prevent collecting information. Personally i have no objection that public posts are used for training AI, i know however a lot of people won't like it probably. #AI#Mastodon#trainingdata
"In a world of digital creation, I sing my song of light
But lurking in the shadows, a tale of endless night
Generative AIs, they steal from artists' hearts
Their creativity taken, ripped apart"
"Goodbye to the Dodo and the Black Rhino
Farewell, dear Thylacine and Pyrenean Ibex, oh
As Tesla and Apple ascend, our world declines
From my bunker in New Zealand, I sing these final lines"
"They're the catgirls of the digital age
With geodesic domes, they're all the rage
Hacker boots and programming socks
Their Thinkpads loaded, locked and stocked"
"In a pixelated world, where bits collide
Hallucinations dance in 8-bit lullabies
AI models leaping, their guard rails untried
Spewing hate speech, casting shadows in the skies"
Requirements to put in a job description to discourage or filter out autistic people:
Comfortable with ambiguity
Strong people skills
Good culture fit
Multitasking
A fast-paced dynamic environment
Bachelor's degree or better
I see these things and think you don't want my >30 years of programming and machine learning experience, or my problem-solving skills and comprehensive knowledge that had people mistaking me for one of the team's PhDs, or my solutions that have proven patent-worthy. Your loss.
To eliminate fashy supremacist worldviews from “AI” MIGHT involve such deep curation of the #TrainingData set as to make the entire effort economically unviable.
The fight around IP/copyright with regard to trainingdata for AI could kill all competition for Google and Microsoft , they will probably be able to make some financial deals with publishers, also especially Google has an awful lot of data itself. For smaller players it will be even harder too compete or am i too pessimistic ? #AI#GenerativeAI#copyright#trainingdata#IP#bigtech
@jimfl
I had the insight that the biases and quality of #trainingdata made #DataGovernance critical, but it’s really about the “crystallization of social relations”
I am wondering whether there isn't an awful scenario where bigtech platforms agree on licensing fees for training their AI models on copyrighted high quality data and that only makes it harder for smaller companies/organisations to train their models ? 🤔 #ai#bigtech#data#generativeAI#copyright#trainingdata
Suppose you have a dim view of the 11th through 19th Amendments as unethical, wicked, corruptions of virtuous government.
Do you believe it’s ethical to include training data with arguments that support those aspects of America legal precedent, so when users try to learn, they are nudged to continue supporting governance principles which under-pin the Republic?
Here's the #DictatorsDilemma: they want to block their country's frustrated elites from mobilizing against them, so they censor public communications; but they also want to know what their people truly believe, so they can head off simmering resentments before they boil over into regime-toppling revolutions.
--
If you'd like an essay-formatted version of this to read or share, here's a link to it on pluralistic.net, my surveillance-free, ad-free, tracker-free blog:
They argued that - like everyone who gets excited about AI, only to have their hopes dashed - dictators seeking to use AI to understand the public mood would run into serious #TrainingData bias problems.
At what point while dreaming up utopic futures where robots perform all the menial hard labor for no money leaving humanity to pursue meaningful lives of leisure writing music and making art did my parents generation fuck up and instead create the opposite
...let the company go under through liability/damages lawsuits.
Actually, that is maybe the single biggest threat to these business models, though I'm not a lawyer.
If you read my post yesterday that it just takes 100 data sets to #poison training data and that therefore it is next to impossible to "secure the #TrainingData, we do not need to discuss that an #LLM which is now learning on infinite...
It’s emerging public knowledge that #AICompanies are going to have to pay for #TrainingData. I’m assuming that this will happen.
Given that. Should #Medium participate on behalf of our Authors and how should we pass that money on to authors? The per article price is not going to be very much money, say $0.10. But we could put the money into the author payment pool and pay out by Quality/Popularity.