"The Arsenical Books Database — part of the Winterthur Museum and the University of Delaware’s Poison Book Project — has identified hundreds of examples of 19th-century books that used [green pigments containing arsenic] in their covers and other binding components."
Delighted to be able to publicise a paper that was presented at the @ALTAnlp 2023 Workshop at the end of last year, co-authored with my #PhD supervisor, Associate Professor @eltwilliams, and written as part of my research at #ANU School of Cybernetics.
Titled "Right the docs: Characterising voice dataset documentation practices used in machine learning", it combines both exploratory interviews and documentation analysis to characterise how large voice datasets - e.g. #LibriSpeech, @mozilla's #CommonVoice, and several others, document their #metadata.
Unsurprisingly, it finds that the #dataset#documentation practices seen currently do not meet the needs of the #ML practitioners who use these datasets.
We show, once again, in the words of Nithya Sambasivan - "everyone wants to do the model work, but nobody wants to do the data work" ...
Reid, K., Williams, E.T., 2023. Right the docs: Characterising voice dataset documentation practices used in machine learning, in: Muresan, S., Chen, V., Casey, K., David, V., Nina, D., Koji, I., Erik, E., Stefan, U. (Eds.), Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association. Association for Computational Linguistics, Melbourne, Australia, pp. 51–66.
"The Data Liberation Project is an initiative to identify, obtain, reformat, clean, document, publish, and disseminate government datasets of public interest."
"A dataset of 77,000+ (distinct) candidates across 57,000+ US elections for mayor, city council, school board, county executive, county legislature, sheriff, and prosecutor."
A very nice dataset from Malpedia with all the deobfuscated strings from their dataset. The repository contains the result of the FLARE FLOSS tool applied to all unpacked and dumped samples in Malpedia.
[#commemorations] Actuellement en train de finaliser un #jeudedonnees consacré aux célébrations et commémorations nationales en #France depuis 1970, je vous propose pour les jours à venir un petit #quizz sur le sujet ⤵️
I connected #Raleigh 's #dataset of #trees to a #Wikipedia API query for finding nearby items of interest and then to a OpenAI API query so the trees could describe themselves and the area around them.
Do you need data related to #covid19 for your research? 👀📚 Look no further! 👇
Our #PanDDemiC portal has been filled with new datasets, varying from citizens' attitudes on the #pandemic to its impact on migration, labor and municipalities. ✅
The MIT researchers found that #MachineLearning models trained for autocaptioning with their dataset consistently generated captions that were precise, semantically rich, and described data trends and complex patterns.
Researchers teach an #AI to write better chart captions.
A new #dataset can help scientists develop automatic systems that generate richer, more descriptive captions for online charts for #blind people.
@ErikJonker Het blijft niet bij "lezen", een AI trainen betekent het maken van afgeleide werken. Dat iets online staat wil niet zeggen dat het publiek domein is. Mensen zetten dingen online binnen een context met een bepaald doel. Het "Grab all you can" waarmee BigTech zijn datasets nu vult negeert dit volledig. #AI#BigTech#MoveFastAndBreakThings#dataset#ethiek
Has anybody done a Subject Access Request to Nectar/Tesco Clubcard? Given that should have pretty much all the grocery and fuel purchase for the last 10+ years it should make an interesting inflation data set