Note that the training data heavily relies on the Bible and its translations.... - Artificial Intelligence

ppatel, 1 year ago

Note that the training data heavily relies on the Bible and its translations. Lots of bias there.

Meta unveils open-source #AI models it says can identify 4,000+ spoken languages and produce speech for 1,000+ languages, an increase of 40x and 10x respectively.

https://www.technologyreview.com/2023/05/22/1073471/metas-new-ai-models-can-recognize-and-produce-speech-for-more-than-1000-languages/

#LLms #LLM #Language #GenerativeAI #MachineLearning

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Image

Image alternative text

objectinspace, 1 year ago

@ppatel Where is the bias, exactly? The goal (as I understand from reading the piece) is to recognize and translate speech to text from different languages. The bible has been translated into thousands of different languages, so it makes sense to me why they would use it as a source. Plus, it's all on github anyway, so anyone cam make whatever improvements to the model they want. Not seeing the issue here.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ppatel, 1 year ago

@objectinspace "However, the team warns the model is still at risk of mistranscribing certain words or phrases, which could result in inaccurate or potentially offensive labels. They also acknowledge that their speech recognition models yielded more biased words than other models, albeit only 0.7% more."

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

objectinspace, 1 year ago

@ppatel Yes I did read the article, my question was how? He just states that it is biased, the end. I get how it is biased historically, but it's not trying to learn about the world, it's just to learn languages. Are there other sources that would have been more appropriate to start from?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ppatel, 1 year ago

@objectinspace Consider the source and what's the system is doing. you're looking at probableistic prediction. The Bible favors a set of language constructs that prioritize a certain level of patriarchal language. This then gets translated to other languages. Language is a social construct. As such, what is used as input is going to determine how the end result is going to turn out.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

objectinspace, 1 year ago

@ppatel Gotcha. Haven't there been a number of efforts to try and... open up the bible by translating it in ways that convey the meaning with more inclusive language? I suppose the sheer number of translations would sort of flatten that out though. I wonder if they used multiples from the same language? Anyway, seems like a problem that would befall pretty much anything from that time period. Can you get away with using more modern texts, if the goal is to get at the building blocks of language?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ppatel, 1 year ago

@objectinspace The training techniques they're using is pretty amazing. But, in order to speed things up, they kinda got lazy. At least, that's my opinion. They could have commissioned multiple works. It would have cost a lot more. They could have also spent a little time with this data and massaged it more. The one advantage is that they're open sourcing this model. If they're also open sourcing the training data, even better.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

objectinspace, 1 year ago

@ppatel Can you include multiple sources into the same model? I assume so but I don't really know how this stuff works. You're probably right though, would be better if they had done that first. But they kind of have to rush it, I mean... everyone else is releasing something... gotta move fast! (TM)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ppatel, 1 year ago

@objectinspace Yes. They can have multiple sources. In fact, the training would be more effective with multiple sources. I imagine the training data is pretty large as it is with the Bible's size and a thousand translations. And then add audio to that.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

objectinspace, 1 year ago

@ppatel Like... they could use, say, the Quran, or Homer's Illiad if they wanted to do straight fiction but I feel like those would run into the same problems.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

arush, 1 year ago

@ppatel @objectinspace Add to this that it depends on which translations they're using. Assuming it's a public domain one, like the King James or any in that translation family except the New King James, yeah there are going to be some definite problems with regard to all kinds of bias.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

objectinspace, 1 year ago

@arush @ppatel That was my thought too, there are a ton of different translations and then translations of translations, it would be interesting to dig a little deeper into that.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

objectinspace, 1 year ago

@arush @ppatel Like if you were to give this model the original source text of the bible back to it and say to translate it back into English, whose version would it most resemble? Would it be a new translation?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ppatel, 1 year ago

@objectinspace @arush Not only that, but the translation wouldn't resemble itself if you ran it a second time or the third.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

arush, 1 year ago

@objectinspace @ppatel So that would be difficult. For starters, you're dealing with three source languages, two of which are very much gendered and do not include a neutral gender and which default to masculine most of the time, but have some pretty peculiar exceptions. Also word phrase meaning is heavily dependent on surrounding context, and it's not a one-to-one match between linguistic gender and human gender even when you're not dealing with something English+

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

arush, 1 year ago

@objectinspace @ppatel has as neutral. My usual advice is that if you want a translation decent translation, first there are several caveats: First, you don't want a literal translation. Second, asking for a translation that conveys what the authors actually meant is impossible, you can only have one that conveys how it was understood during specific periods of time. +

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

arush, 1 year ago

@objectinspace @ppatel Third, you should have multiple translations not just one or two, and fourth, whatever translations you have need to have an extensive footnote aparatus that explains linguistic concepts/grammar/the cultural stuff that impacts language. So yeah in short this was actually a very bad idea.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ppatel, 1 year ago

@arush @objectinspace So much of the nuance of language would get lost. I suspect, however, that these models aren't supposed to create anything but the most basic to intermediate levels of translations. If they manage to extend the use of these models to do additional training, that would be far more interesting. The speech generation component of this model interests me far more than anything else.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

arush, 1 year ago

@ppatel @objectinspace Yeah but if they're looking for basic to intermediate, they really should have picked something else IMHO. And yeah the speech stuff looks really interesting.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

objectinspace, 1 year ago

@arush @ppatel I am curious what would have made for a better base? I'm sure there's lots of stuff that would have worked, but I can see the logic in going with something public domain and as widely translated as the bible.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ppatel, 1 year ago

@objectinspace @arush I don't disagree that they should have looked to something else. As I said in one of my posts earlier, I think they were lazy and found the most common thing they could think of. Lowest common denominator is lowest for a good reason as we know. Now that they know the techniques for training LLMs, they could have gotten other source material and gotten it translated by humans with guidelines.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ppatel, 1 year ago

@objectinspace "While the scope of the research is impressive, the use of religious texts to train AI models can be controversial, says Chris Emezue, a researcher at Masakhane, an organization working on natural-language processing for African languages, who was not involved in the project."

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment