#dataset - kbin.social

KathyReid, 1 year ago to random

Do you work with #voice or #speech #data? You might contribute data, write data specifications for collection, perform filtering or pre-processing, train #ASR or #TTS models, or design or perform evaluations on #ML speech models.

If so, I’d love your help to understand current #dataset #documentation practices, and what we can do to make them better as part of my #PhD #research

The #survey takes 10-20 minutes to complete, and you can opt in to win one of 3 gift cards valued at $AUD 50 each.

Research Protocol 2021/427 approved by #ANU Human Research Ethics Committee

https://anu.au1.qualtrics.com/jfe/form/SV_cSFODa5osYtm96e

reply

expand (11)

collapse (11)

report

activity

copy /kbin url

copy original url

open original url

Loading...

Need help on saving reddit threads (for post-blackout reasons) to Obsidian

AI-TRIGGER WARNING: I've asked ChatGPT to revise my writing because it was ass (writing a stream of coherent looking text is not my forte). Proceed at your own discretion....

researchbuzz, 6 months ago to Raleigh

I connected #Raleigh 's #dataset of #trees to a #Wikipedia API query for finding nearby items of interest and then to a OpenAI API query so the trees could describe themselves and the area around them.

As you do when you're a weirdo

reply

expand (5)

collapse (5)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ eyesquash, anathema_device, botwiki, stefan

stefan, 7 months ago to ilaughed

A haunting collection of roughly 10,000 recordings of nuclear weapons tests from the 1940's - 1960's.

"The films are equal parts terrifying and fascinating."

https://www.beautifulpublicdata.com/nuclear-weapon-test-films

via @Beautifulpublicdata

#data #dataset #video #AtomicBomb #NuclearPower #war #science

reply

expand (5)

collapse (5)

report

activity

copy /kbin url

copy original url

open original url

Loading...

daieuxetdailleurs, 4 months ago to France French

[#commemorations] Actuellement en train de finaliser un #jeudedonnees consacré aux célébrations et commémorations nationales en #France depuis 1970, je vous propose pour les jours à venir un petit #quizz sur le sujet ⤵️

PS : le #dataset sera mis en #opendata et je prépare bien sûr quelques #dataviz ...

#histodons #histoire #CetteAnnéeLà #Ephemeride @geneafr @archivistodon #memoire #memoires #histoirenationale #archives

reply

expand (5)

collapse (5)

report

activity

copy /kbin url

copy original url

open original url

Loading...

ruthpozuelo, 7 months ago (edited 7 months ago) to datascience

What is the best way to share a dataset with non-technical users and invite them to collaborate on it?

#data #dataset #datasets #datascience #database

reply

expand (4)

collapse (4)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ Jdreben

vyr, 1 year ago to random

anyone have a good collection of Fedi spam messages and account bios? huge bonus points for account metadata. boosts welcome.

#FediAdmin #FediMods #MastoAdmin #MastoMods #spam #dataset

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov, thegibson, michael

stefan, 5 days ago (edited 5 days ago) to journalism

Learn how to request a dataset of all the databases an agency maintains with @muckrock ’s latest #FOIAFriday webinar:

https://youtube.com/watch?v=9-Do81pKSmM

Next session is on June 14 and you can sign up here: https://us02web.zoom.us/webinar/register/WN_2U6FCIpWRve_Odo_VUXtmw#/registration

#foia #journalism #CitizenJournalism #data #dataset #dataviz #webinar

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

stefan, 11 months ago to internet

I just uploaded the final data backup from my Popular Twitter bots project: https://www.kaggle.com/datasets/fourtonfish/popular-twitter-bots

It looks like it lost access to Twitter's API on April 27. It was fun while it lasted!

https://twitterbots.glitch.me

#dataviz #dataset #kaggle #twitter #TwitterBots #TwitterPI

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ botwiki

alatitude77, 1 month ago to Discord

Billions of public #Discord messages may be sold through a #scraping service | #dataset #llms #training #machinelearning #artificialintelligence https://arstechnica.com/tech-policy/2024/04/billions-of-public-discord-messages-may-be-sold-through-a-scraping-service/

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

stefan, 2 months ago to random

"The Data Liberation Project is an initiative to identify, obtain, reformat, clean, document, publish, and disseminate government datasets of public interest."

https://www.data-liberation-project.org

Interesting initiative, and they're looking for volunteers: https://www.data-liberation-project.org/get-involved/

#data #dataset #government #govtech

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ objectinspace

adulau, 3 months ago to opensource

A very nice dataset from Malpedia with all the deobfuscated strings from their dataset. The repository contains the result of the FLARE FLOSS tool applied to all unpacked and dumped samples in Malpedia.

🔗 https://github.com/malpedia/malpedia-flossed

#dataset #opensource #malpedia #infosec #research #malware #opendata

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

KathyReid, 1 month ago (edited 1 month ago) to ML

Delighted to be able to publicise a paper that was presented at the @ALTAnlp 2023 Workshop at the end of last year, co-authored with my #PhD supervisor, Associate Professor @eltwilliams, and written as part of my research at #ANU School of Cybernetics.

Titled "Right the docs: Characterising voice dataset documentation practices used in machine learning", it combines both exploratory interviews and documentation analysis to characterise how large voice datasets - e.g. #LibriSpeech, @mozilla's #CommonVoice, and several others, document their #metadata.

Unsurprisingly, it finds that the #dataset #documentation practices seen currently do not meet the needs of the #ML practitioners who use these datasets.

We show, once again, in the words of Nithya Sambasivan - "everyone wants to do the model work, but nobody wants to do the data work" ...

https://aclanthology.org/2023.alta-1.6/

#RightTheDocs #WriteTheDocs

Citation:

Reid, K., Williams, E.T., 2023. Right the docs: Characterising voice dataset documentation practices used in machine learning, in: Muresan, S., Chen, V., Casey, K., David, V., Nina, D., Koji, I., Erik, E., Stefan, U. (Eds.), Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association. Association for Computational Linguistics, Melbourne, Australia, pp. 51–66.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mwfc, 3 months ago to random

Anyone aware of a #SHM (structural Health Monitoring) database that is freely available with data from europe?

I am curious to corelate some data from various sources

I am especially interested in building damage due to ground movement.

#dataset #minig #quakes #buildings #damage

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mtxvp, 1 month ago to random

Blog >> AWS Free Datasets: Part 2
https://blog.mtxvp.com/aws-free-datasets-part-2/?m
#dataset #opendata

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ Wen

daieuxetdailleurs, 1 month ago to archivistodon French

[#veille] Plus de 7 000 demandes d’aide d’artistes et artisans chômeurs de la Seine indexées (1930-1962) - La Revue française de Généalogie https://www.rfgenealogie.com/infos/plus-de-7-000-demandes-d-aide-d-artistes-et-artisans-chomeurs-de-la-seine-indexees

#dataset #opendata sur : https://data.culture.gouv.fr/explore/dataset/secours-aux-artistes-et-artisans

#archivesnationales #20esiecle #archives #histoire #patrimoine #beauxarts #Seine #opencontent #GLAMarchives #Paris @geneafr @archivistodon

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

stefan, 3 months ago to random

"A dataset of 77,000+ (distinct) candidates across 57,000+ US elections for mayor, city council, school board, county executive, county legislature, sheriff, and prosecutor."

https://www.nature.com/articles/s41597-023-02792-x

Download: https://osf.io/mv5e6/files/osfstorage

via @dataisplural .

#data #dataviz #dataset #elections

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

a, 6 months ago to Futurology

Common Crawl September/October 2023 Crawl Archive (CC-MAIN-2023-40) is out and release.

100TiB compressed of fresh web crawled which can used in your next data mining project.

🔗 https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-40/index.html

#commoncrawl #dataset #opendata #open #research

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ jasonnab

itnewsbot, 7 months ago to machinelearning

AI-Powered Snore Detector Shakes the Pillow So You Won’t - If you snore, you’ll probably find out about it from someone. An elbow to the ribs... - https://hackaday.com/2023/10/14/ai-powered-snore-detector-shakes-the-pillow-so-you-dont-have-to/ #machinelearning #training #dataset #snoring #haptic #apnea #sleep #snore #cnn #ai

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

regroup_horizon, 7 months ago to politicaltheory

Do you need data related to #covid19 for your research? 👀📚 Look no further! 👇

Our #PanDDemiC portal has been filled with new datasets, varying from citizens' attitudes on the #pandemic to its impact on migration, labor and municipalities. ✅

See here: https://panddemic.regroup-horizon.eu/

#covid19research #covid19data #pandemicresearch #pandemicresponses #covidresearch #dataset #regroup #pandemicpolitics #pandemic

@politicalscience @politicaltheory @sociology

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

daieuxetdailleurs, 5 months ago to random French

L'un de mes objectifs dans la vie (pro), c'est de figurer dans les coups de coeur #dataset de @datagouvfr 😋 ❤️ (https://www.data.gouv.fr/fr/posts/suivi-des-sorties-novembre-2023)
#jeudedonnées #opendata #GLAMarchives #archivesnationales #viedarchiviste

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ wikipedia_fr

daieuxetdailleurs, 5 months ago to France French

[#opendata] Nouveau jeu de données Mentions d'événements climatiques et naturels depuis la fin du XVIIIe siècle en #France et #Algérie

Plus de 1000 événements (et ce n'est que le début), à partir des inventaires :

des travaux des cathédrales (19e siècle)

des calamités publiques (années 1950 et 1960)

https://data.culture.gouv.fr/explore/dataset/mentions-evenements-climatiques-dans-les-archives

#climat #meteo #dataset #archives #inondation #tempete #seisme #histodons #histoire #GLAMarchives @archivistodon @geneafr #archivesnationales

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

stefan, 12 days ago to history

"The Arsenical Books Database — part of the Winterthur Museum and the University of Delaware’s Poison Book Project — has identified hundreds of examples of 19th-century books that used [green pigments containing arsenic] in their covers and other binding components."

https://sites.udel.edu/poisonbookproject/

Via @dataisplural.

#history #books #data #dataset

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

boilingsteam, 6 months ago to linux

RedPajama v2 Open Dataset with 30T Tokens for Training LLMs: https://together.ai/blog/redpajama-data-v2
#linux #llm #redpajama #dataset #training #open

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ doboprobodyne

boilingsteam, 19 days ago to llm

Building a Large Japanese Web Corpus for Large Language Models: https://arxiv.org/abs/2404.17733 #llm #japanese #dataset #corpus #training

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...