#dataset - kbin.social

stefan, 5 days ago (edited 5 days ago) to journalism

Learn how to request a dataset of all the databases an agency maintains with @muckrock ’s latest #FOIAFriday webinar:

https://youtube.com/watch?v=9-Do81pKSmM

Next session is on June 14 and you can sign up here: https://us02web.zoom.us/webinar/register/WN_2U6FCIpWRve_Odo_VUXtmw#/registration

#foia #journalism #CitizenJournalism #data #dataset #dataviz #webinar

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

alatitude77, 1 month ago to Discord

Billions of public #Discord messages may be sold through a #scraping service | #dataset #llms #training #machinelearning #artificialintelligence https://arstechnica.com/tech-policy/2024/04/billions-of-public-discord-messages-may-be-sold-through-a-scraping-service/

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

mtxvp, 1 month ago to random

Blog >> AWS Free Datasets: Part 2
https://blog.mtxvp.com/aws-free-datasets-part-2/?m
#dataset #opendata

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ Wen

stefan, 2 months ago to random

"The Data Liberation Project is an initiative to identify, obtain, reformat, clean, document, publish, and disseminate government datasets of public interest."

https://www.data-liberation-project.org

Interesting initiative, and they're looking for volunteers: https://www.data-liberation-project.org/get-involved/

#data #dataset #government #govtech

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ objectinspace

daieuxetdailleurs, 4 months ago to France French

[#commemorations] Actuellement en train de finaliser un #jeudedonnees consacré aux célébrations et commémorations nationales en #France depuis 1970, je vous propose pour les jours à venir un petit #quizz sur le sujet ⤵️

PS : le #dataset sera mis en #opendata et je prépare bien sûr quelques #dataviz ...

#histodons #histoire #CetteAnnéeLà #Ephemeride @geneafr @archivistodon #memoire #memoires #histoirenationale #archives

reply

expand (5)

collapse (5)

report

activity

copy /kbin url

copy original url

open original url

Loading...

daieuxetdailleurs, 5 months ago to random French

L'un de mes objectifs dans la vie (pro), c'est de figurer dans les coups de coeur #dataset de @datagouvfr 😋 ❤️ (https://www.data.gouv.fr/fr/posts/suivi-des-sorties-novembre-2023)
#jeudedonnées #opendata #GLAMarchives #archivesnationales #viedarchiviste

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ wikipedia_fr

researchbuzz, 6 months ago to Raleigh

I connected #Raleigh 's #dataset of #trees to a #Wikipedia API query for finding nearby items of interest and then to a OpenAI API query so the trees could describe themselves and the area around them.

As you do when you're a weirdo

reply

expand (5)

collapse (5)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ eyesquash, anathema_device, botwiki, stefan

boilingsteam, 6 months ago to linux

RedPajama v2 Open Dataset with 30T Tokens for Training LLMs: https://together.ai/blog/redpajama-data-v2
#linux #llm #redpajama #dataset #training #open

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ doboprobodyne

a, 6 months ago to Futurology

Common Crawl September/October 2023 Crawl Archive (CC-MAIN-2023-40) is out and release.

100TiB compressed of fresh web crawled which can used in your next data mining project.

🔗 https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-40/index.html

#commoncrawl #dataset #opendata #open #research

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ jasonnab

ruthpozuelo, 7 months ago (edited 7 months ago) to datascience

What is the best way to share a dataset with non-technical users and invite them to collaborate on it?

#data #dataset #datasets #datascience #database

reply

expand (4)

collapse (4)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ Jdreben

stefan, 7 months ago to ilaughed

A haunting collection of roughly 10,000 recordings of nuclear weapons tests from the 1940's - 1960's.

"The films are equal parts terrifying and fascinating."

https://www.beautifulpublicdata.com/nuclear-weapon-test-films

via @Beautifulpublicdata

#data #dataset #video #AtomicBomb #NuclearPower #war #science

reply

expand (5)

collapse (5)

report

activity

copy /kbin url

copy original url

open original url

Loading...

lysander07, 8 months ago to Futurology

The "Wikidata Research Articles Dataset" comprises peer-reviewed full research papers about Wikidata from its first decade of existence (2012-2022).
https://refubium.fu-berlin.de/handle/fub188/40510
@hcc_research
#Wikidata #research #dataset #knowledgegraphs #bibliography
via @wikiresearch (but posted on #twitter)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ wikidata

ppatel, 10 months ago to machinelearning

The MIT researchers found that #MachineLearning models trained for autocaptioning with their dataset consistently generated captions that were precise, semantically rich, and described data trends and complex patterns.

Researchers teach an #AI to write better chart captions.

A new #dataset can help scientists develop automatic systems that generate richer, more descriptive captions for online charts for #blind people.

https://news.mit.edu/2023/researchers-chart-captions-ai-vistext-0630

#accessibility #a11y #GenerativeAI

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ devinprater

ben, 10 months ago to random

Has anybody done a Subject Access Request to Nectar/Tesco Clubcard? Given that should have pretty much all the grocery and fuel purchase for the last 10+ years it should make an interesting inflation data set

#inflation #gdpr #dataset

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ sldrant

Need help on saving reddit threads (for post-blackout reasons) to Obsidian

AI-TRIGGER WARNING: I've asked ChatGPT to revise my writing because it was ass (writing a stream of coherent looking text is not my forte). Proceed at your own discretion....

stefan, 11 months ago to internet

I just uploaded the final data backup from my Popular Twitter bots project: https://www.kaggle.com/datasets/fourtonfish/popular-twitter-bots

It looks like it lost access to Twitter's API on April 27. It was fun while it lasted!

https://twitterbots.glitch.me

#dataviz #dataset #kaggle #twitter #TwitterBots #TwitterPI

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ botwiki

vyr, 1 year ago to random

anyone have a good collection of Fedi spam messages and account bios? huge bonus points for account metadata. boosts welcome.

#FediAdmin #FediMods #MastoAdmin #MastoMods #spam #dataset

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov, thegibson, michael

KathyReid, 1 year ago to random

ICYMI: Do you work with #voice or #speech #data?

You might be a #linguist, or an #ML #engineer, doing things like data specifications, filtering or pre-processing or training #ASR, #STT or #TTS models, or you might work in #fairness or #bias evaluation.

If so, I’d love your help to understand current #dataset #documentation practices, and what we can do to make them better as part of my #PhD #research 🤓 ⌨️ 🎤

The #survey takes 10-20 minutes to complete, and you can opt in to win one of 3 gift cards valued at $AUD 50 each.

Research Protocol 2021/427 approved by #ANU Human Research Ethics Committee

Boosts appreciated 💕

https://anu.au1.qualtrics.com/jfe/form/SV_cSFODa5osYtm96e

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ Atexjam

KathyReid, 1 year ago to random

Do you work with #voice or #speech #data? You might contribute data, write data specifications for collection, perform filtering or pre-processing, train #ASR or #TTS models, or design or perform evaluations on #ML speech models.

If so, I’d love your help to understand current #dataset #documentation practices, and what we can do to make them better as part of my #PhD #research

The #survey takes 10-20 minutes to complete, and you can opt in to win one of 3 gift cards valued at $AUD 50 each.

Research Protocol 2021/427 approved by #ANU Human Research Ethics Committee

https://anu.au1.qualtrics.com/jfe/form/SV_cSFODa5osYtm96e

reply

expand (11)

collapse (11)

report

activity

copy /kbin url

copy original url

open original url

Loading...

johentsch, 1 year ago to music

Hi Fediverse, #introduction
Currently I'm spending a lot of my time on the computer researching into #music #corpora in order to finish my #phd @ #epfl by the end of 2023. My main subject is #musicTheory and I'm trying to measure stylistic differences between tonal languages of the last four centuries through #statistics on #harmony (#stylometry).
I'm here to connect with people who are interested in #dh #DataScience #machinelearning #opendata #dataset #foss #privacy #musicianship #funk #techno

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ dgar

KathyReid, 1 month ago (edited 1 month ago) to ML

Delighted to be able to publicise a paper that was presented at the @ALTAnlp 2023 Workshop at the end of last year, co-authored with my #PhD supervisor, Associate Professor @eltwilliams, and written as part of my research at #ANU School of Cybernetics.

Titled "Right the docs: Characterising voice dataset documentation practices used in machine learning", it combines both exploratory interviews and documentation analysis to characterise how large voice datasets - e.g. #LibriSpeech, @mozilla's #CommonVoice, and several others, document their #metadata.

Unsurprisingly, it finds that the #dataset #documentation practices seen currently do not meet the needs of the #ML practitioners who use these datasets.

We show, once again, in the words of Nithya Sambasivan - "everyone wants to do the model work, but nobody wants to do the data work" ...

https://aclanthology.org/2023.alta-1.6/

#RightTheDocs #WriteTheDocs

Citation:

Reid, K., Williams, E.T., 2023. Right the docs: Characterising voice dataset documentation practices used in machine learning, in: Muresan, S., Chen, V., Casey, K., David, V., Nina, D., Koji, I., Erik, E., Stefan, U. (Eds.), Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association. Association for Computational Linguistics, Melbourne, Australia, pp. 51–66.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

daieuxetdailleurs, 1 month ago to archivistodon French

[#veille] Plus de 7 000 demandes d’aide d’artistes et artisans chômeurs de la Seine indexées (1930-1962) - La Revue française de Généalogie https://www.rfgenealogie.com/infos/plus-de-7-000-demandes-d-aide-d-artistes-et-artisans-chomeurs-de-la-seine-indexees

#dataset #opendata sur : https://data.culture.gouv.fr/explore/dataset/secours-aux-artistes-et-artisans

#archivesnationales #20esiecle #archives #histoire #patrimoine #beauxarts #Seine #opencontent #GLAMarchives #Paris @geneafr @archivistodon

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

boilingsteam, 19 days ago to llm

Building a Large Japanese Web Corpus for Large Language Models: https://arxiv.org/abs/2404.17733 #llm #japanese #dataset #corpus #training

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

stefan, 12 days ago to history

"The Arsenical Books Database — part of the Winterthur Museum and the University of Delaware’s Poison Book Project — has identified hundreds of examples of 19th-century books that used [green pigments containing arsenic] in their covers and other binding components."

https://sites.udel.edu/poisonbookproject/

Via @dataisplural.

#history #books #data #dataset

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mwfc, 3 months ago to random

Anyone aware of a #SHM (structural Health Monitoring) database that is freely available with data from europe?

I am curious to corelate some data from various sources

I am especially interested in building damage due to ground movement.

#dataset #minig #quakes #buildings #damage

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...