ppatel,
@ppatel@mstdn.social avatar

Note that the training data heavily relies on the Bible and its translations. Lots of bias there.

Meta unveils open-source models it says can identify 4,000+ spoken languages and produce speech for 1,000+ languages, an increase of 40x and 10x respectively.

https://www.technologyreview.com/2023/05/22/1073471/metas-new-ai-models-can-recognize-and-produce-speech-for-more-than-1000-languages/

objectinspace,

@ppatel Where is the bias, exactly? The goal (as I understand from reading the piece) is to recognize and translate speech to text from different languages. The bible has been translated into thousands of different languages, so it makes sense to me why they would use it as a source. Plus, it's all on github anyway, so anyone cam make whatever improvements to the model they want. Not seeing the issue here.

ppatel,
@ppatel@mstdn.social avatar

@objectinspace "However, the team warns the model is still at risk of mistranscribing certain words or phrases, which could result in inaccurate or potentially offensive labels. They also acknowledge that their speech recognition models yielded more biased words than other models, albeit only 0.7% more."

objectinspace,

@ppatel Yes I did read the article, my question was how? He just states that it is biased, the end. I get how it is biased historically, but it's not trying to learn about the world, it's just to learn languages. Are there other sources that would have been more appropriate to start from?

ppatel,
@ppatel@mstdn.social avatar

@objectinspace Consider the source and what's the system is doing. you're looking at probableistic prediction. The Bible favors a set of language constructs that prioritize a certain level of patriarchal language. This then gets translated to other languages. Language is a social construct. As such, what is used as input is going to determine how the end result is going to turn out.

objectinspace,

@ppatel Gotcha. Haven't there been a number of efforts to try and... open up the bible by translating it in ways that convey the meaning with more inclusive language? I suppose the sheer number of translations would sort of flatten that out though. I wonder if they used multiples from the same language? Anyway, seems like a problem that would befall pretty much anything from that time period. Can you get away with using more modern texts, if the goal is to get at the building blocks of language?

ppatel,
@ppatel@mstdn.social avatar

@objectinspace The training techniques they're using is pretty amazing. But, in order to speed things up, they kinda got lazy. At least, that's my opinion. They could have commissioned multiple works. It would have cost a lot more. They could have also spent a little time with this data and massaged it more. The one advantage is that they're open sourcing this model. If they're also open sourcing the training data, even better.

objectinspace,

@ppatel Can you include multiple sources into the same model? I assume so but I don't really know how this stuff works. You're probably right though, would be better if they had done that first. But they kind of have to rush it, I mean... everyone else is releasing something... gotta move fast! (TM)

ppatel,
@ppatel@mstdn.social avatar

@objectinspace Yes. They can have multiple sources. In fact, the training would be more effective with multiple sources. I imagine the training data is pretty large as it is with the Bible's size and a thousand translations. And then add audio to that.

objectinspace,

@ppatel Like... they could use, say, the Quran, or Homer's Illiad if they wanted to do straight fiction but I feel like those would run into the same problems.

arush,

@ppatel @objectinspace Add to this that it depends on which translations they're using. Assuming it's a public domain one, like the King James or any in that translation family except the New King James, yeah there are going to be some definite problems with regard to all kinds of bias.

objectinspace,

@arush @ppatel That was my thought too, there are a ton of different translations and then translations of translations, it would be interesting to dig a little deeper into that.

objectinspace,

@arush @ppatel Like if you were to give this model the original source text of the bible back to it and say to translate it back into English, whose version would it most resemble? Would it be a new translation?

ppatel,
@ppatel@mstdn.social avatar

@objectinspace @arush Not only that, but the translation wouldn't resemble itself if you ran it a second time or the third.

arush,

@objectinspace @ppatel So that would be difficult. For starters, you're dealing with three source languages, two of which are very much gendered and do not include a neutral gender and which default to masculine most of the time, but have some pretty peculiar exceptions. Also word phrase meaning is heavily dependent on surrounding context, and it's not a one-to-one match between linguistic gender and human gender even when you're not dealing with something English+

arush,

@objectinspace @ppatel has as neutral. My usual advice is that if you want a translation decent translation, first there are several caveats: First, you don't want a literal translation. Second, asking for a translation that conveys what the authors actually meant is impossible, you can only have one that conveys how it was understood during specific periods of time. +

arush,

@objectinspace @ppatel Third, you should have multiple translations not just one or two, and fourth, whatever translations you have need to have an extensive footnote aparatus that explains linguistic concepts/grammar/the cultural stuff that impacts language. So yeah in short this was actually a very bad idea.

ppatel,
@ppatel@mstdn.social avatar

@arush @objectinspace So much of the nuance of language would get lost. I suspect, however, that these models aren't supposed to create anything but the most basic to intermediate levels of translations. If they manage to extend the use of these models to do additional training, that would be far more interesting. The speech generation component of this model interests me far more than anything else.

arush,

@ppatel @objectinspace Yeah but if they're looking for basic to intermediate, they really should have picked something else IMHO. And yeah the speech stuff looks really interesting.

objectinspace,

@arush @ppatel I am curious what would have made for a better base? I'm sure there's lots of stuff that would have worked, but I can see the logic in going with something public domain and as widely translated as the bible.

ppatel,
@ppatel@mstdn.social avatar

@objectinspace @arush I don't disagree that they should have looked to something else. As I said in one of my posts earlier, I think they were lazy and found the most common thing they could think of. Lowest common denominator is lowest for a good reason as we know. Now that they know the techniques for training LLMs, they could have gotten other source material and gotten it translated by humans with guidelines.

ppatel,
@ppatel@mstdn.social avatar

@objectinspace "While the scope of the research is impressive, the use of religious texts to train AI models can be controversial, says Chris Emezue, a researcher at Masakhane, an organization working on natural-language processing for African languages, who was not involved in the project."

  • All
  • Subscribed
  • Moderated
  • Favorites
  • ai
  • DreamBathrooms
  • ngwrru68w68
  • modclub
  • magazineikmin
  • thenastyranch
  • rosin
  • khanakhh
  • InstantRegret
  • Youngstown
  • slotface
  • Durango
  • kavyap
  • mdbf
  • GTA5RPClips
  • JUstTest
  • tacticalgear
  • normalnudes
  • tester
  • osvaldo12
  • everett
  • cubers
  • ethstaker
  • anitta
  • provamag3
  • Leos
  • cisconetworking
  • megavids
  • lostlight
  • All magazines