Do you work with #voice or #speech#data? You might contribute data, write data specifications for collection, perform filtering or pre-processing, train #ASR or #TTS models, or design or perform evaluations on #ML speech models.
If so, I’d love your help to understand current #dataset#documentation practices, and what we can do to make them better as part of my #PhD#research
The #survey takes 10-20 minutes to complete, and you can opt in to win one of 3 gift cards valued at $AUD 50 each.
Research Protocol 2021/427 approved by #ANU Human Research Ethics Committee
AI-TRIGGER WARNING: I've asked ChatGPT to revise my writing because it was ass (writing a stream of coherent looking text is not my forte). Proceed at your own discretion....
I connected #Raleigh 's #dataset of #trees to a #Wikipedia API query for finding nearby items of interest and then to a OpenAI API query so the trees could describe themselves and the area around them.
[#commemorations] Actuellement en train de finaliser un #jeudedonnees consacré aux célébrations et commémorations nationales en #France depuis 1970, je vous propose pour les jours à venir un petit #quizz sur le sujet ⤵️
"The Data Liberation Project is an initiative to identify, obtain, reformat, clean, document, publish, and disseminate government datasets of public interest."
A very nice dataset from Malpedia with all the deobfuscated strings from their dataset. The repository contains the result of the FLARE FLOSS tool applied to all unpacked and dumped samples in Malpedia.
Delighted to be able to publicise a paper that was presented at the @ALTAnlp 2023 Workshop at the end of last year, co-authored with my #PhD supervisor, Associate Professor @eltwilliams, and written as part of my research at #ANU School of Cybernetics.
Titled "Right the docs: Characterising voice dataset documentation practices used in machine learning", it combines both exploratory interviews and documentation analysis to characterise how large voice datasets - e.g. #LibriSpeech, @mozilla's #CommonVoice, and several others, document their #metadata.
Unsurprisingly, it finds that the #dataset#documentation practices seen currently do not meet the needs of the #ML practitioners who use these datasets.
We show, once again, in the words of Nithya Sambasivan - "everyone wants to do the model work, but nobody wants to do the data work" ...
Reid, K., Williams, E.T., 2023. Right the docs: Characterising voice dataset documentation practices used in machine learning, in: Muresan, S., Chen, V., Casey, K., David, V., Nina, D., Koji, I., Erik, E., Stefan, U. (Eds.), Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association. Association for Computational Linguistics, Melbourne, Australia, pp. 51–66.
"A dataset of 77,000+ (distinct) candidates across 57,000+ US elections for mayor, city council, school board, county executive, county legislature, sheriff, and prosecutor."
Do you need data related to #covid19 for your research? 👀📚 Look no further! 👇
Our #PanDDemiC portal has been filled with new datasets, varying from citizens' attitudes on the #pandemic to its impact on migration, labor and municipalities. ✅
"The Arsenical Books Database — part of the Winterthur Museum and the University of Delaware’s Poison Book Project — has identified hundreds of examples of 19th-century books that used [green pigments containing arsenic] in their covers and other binding components."
Need help on saving reddit threads (for post-blackout reasons) to Obsidian
AI-TRIGGER WARNING: I've asked ChatGPT to revise my writing because it was ass (writing a stream of coherent looking text is not my forte). Proceed at your own discretion....