stefan, (edited ) to journalism
@stefan@stefanbohacek.online avatar

Learn how to request a dataset of all the databases an agency maintains with @muckrock ’s latest #FOIAFriday webinar:

https://youtube.com/watch?v=9-Do81pKSmM

Next session is on June 14 and you can sign up here: https://us02web.zoom.us/webinar/register/WN_2U6FCIpWRve_Odo_VUXtmw#/registration

#foia #journalism #CitizenJournalism #data #dataset #dataviz #webinar

alatitude77, to Discord
@alatitude77@mastodon.social avatar
mtxvp, to random
@mtxvp@mastodon.social avatar
stefan, to random
@stefan@stefanbohacek.online avatar

"The Data Liberation Project is an initiative to identify, obtain, reformat, clean, document, publish, and disseminate government datasets of public interest."

https://www.data-liberation-project.org

Interesting initiative, and they're looking for volunteers: https://www.data-liberation-project.org/get-involved/

daieuxetdailleurs, to France French
@daieuxetdailleurs@framapiaf.org avatar

[] Actuellement en train de finaliser un consacré aux célébrations et commémorations nationales en depuis 1970, je vous propose pour les jours à venir un petit sur le sujet ⤵️

PS : le sera mis en et je prépare bien sûr quelques ...

@geneafr @archivistodon

daieuxetdailleurs, to random French
@daieuxetdailleurs@framapiaf.org avatar

L'un de mes objectifs dans la vie (pro), c'est de figurer dans les coups de coeur de @datagouvfr 😋 ❤️ (https://www.data.gouv.fr/fr/posts/suivi-des-sorties-novembre-2023)

researchbuzz, to Raleigh
@researchbuzz@researchbuzz.masto.host avatar

I connected 's of to a API query for finding nearby items of interest and then to a OpenAI API query so the trees could describe themselves and the area around them.

As you do when you're a weirdo

boilingsteam, to linux
@boilingsteam@mastodon.cloud avatar

RedPajama v2 Open Dataset with 30T Tokens for Training LLMs: https://together.ai/blog/redpajama-data-v2
#linux #llm #redpajama #dataset #training #open

a, to Futurology
@a@paperbay.org avatar

Common Crawl September/October 2023 Crawl Archive (CC-MAIN-2023-40) is out and release.

100TiB compressed of fresh web crawled which can used in your next data mining project.

🔗 https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-40/index.html

ruthpozuelo, (edited ) to datascience
@ruthpozuelo@mastodon.social avatar

What is the best way to share a dataset with non-technical users and invite them to collaborate on it?

stefan, to ilaughed
@stefan@stefanbohacek.online avatar

A haunting collection of roughly 10,000 recordings of nuclear weapons tests from the 1940's - 1960's.

"The films are equal parts terrifying and fascinating."

https://www.beautifulpublicdata.com/nuclear-weapon-test-films

via @Beautifulpublicdata

#data #dataset #video #AtomicBomb #NuclearPower #war #science

lysander07, to Futurology

The "Wikidata Research Articles Dataset" comprises peer-reviewed full research papers about Wikidata from its first decade of existence (2012-2022).
https://refubium.fu-berlin.de/handle/fub188/40510
@hcc_research

via @wikiresearch (but posted on )

ppatel, to machinelearning
@ppatel@mstdn.social avatar

The MIT researchers found that models trained for autocaptioning with their dataset consistently generated captions that were precise, semantically rich, and described data trends and complex patterns.

Researchers teach an to write better chart captions.

A new can help scientists develop automatic systems that generate richer, more descriptive captions for online charts for people.

https://news.mit.edu/2023/researchers-chart-captions-ai-vistext-0630

ben, to random
@ben@hardill.me.uk avatar

Has anybody done a Subject Access Request to Nectar/Tesco Clubcard? Given that should have pretty much all the grocery and fuel purchase for the last 10+ years it should make an interesting inflation data set

#inflation #gdpr #dataset

stefan, to internet
@stefan@stefanbohacek.online avatar

I just uploaded the final data backup from my Popular Twitter bots project: https://www.kaggle.com/datasets/fourtonfish/popular-twitter-bots

It looks like it lost access to Twitter's API on April 27. It was fun while it lasted!

https://twitterbots.glitch.me

#dataviz #dataset #kaggle #twitter #TwitterBots #TwitterPI

vyr, to random

anyone have a good collection of Fedi spam messages and account bios? huge bonus points for account metadata. boosts welcome.

KathyReid, to random
@KathyReid@aus.social avatar

ICYMI: Do you work with or ?

You might be a , or an , doing things like data specifications, filtering or pre-processing or training , or models, or you might work in or evaluation.

If so, I’d love your help to understand current practices, and what we can do to make them better as part of my 🤓 ⌨️ 🎤

The takes 10-20 minutes to complete, and you can opt in to win one of 3 gift cards valued at $AUD 50 each.

Research Protocol 2021/427 approved by Human Research Ethics Committee

Boosts appreciated 💕

https://anu.au1.qualtrics.com/jfe/form/SV_cSFODa5osYtm96e

KathyReid, to random
@KathyReid@aus.social avatar

Do you work with or ? You might contribute data, write data specifications for collection, perform filtering or pre-processing, train or models, or design or perform evaluations on speech models.

If so, I’d love your help to understand current practices, and what we can do to make them better as part of my

The takes 10-20 minutes to complete, and you can opt in to win one of 3 gift cards valued at $AUD 50 each.

Research Protocol 2021/427 approved by Human Research Ethics Committee

https://anu.au1.qualtrics.com/jfe/form/SV_cSFODa5osYtm96e

johentsch, to music
@johentsch@hostux.social avatar

Hi Fediverse,
Currently I'm spending a lot of my time on the computer researching into in order to finish my @ by the end of 2023. My main subject is and I'm trying to measure stylistic differences between tonal languages of the last four centuries through on ().
I'm here to connect with people who are interested in

KathyReid, (edited ) to ML
@KathyReid@aus.social avatar

Delighted to be able to publicise a paper that was presented at the @ALTAnlp 2023 Workshop at the end of last year, co-authored with my supervisor, Associate Professor @eltwilliams, and written as part of my research at School of Cybernetics.

Titled "Right the docs: Characterising voice dataset documentation practices used in machine learning", it combines both exploratory interviews and documentation analysis to characterise how large voice datasets - e.g. , @mozilla's , and several others, document their .

Unsurprisingly, it finds that the practices seen currently do not meet the needs of the practitioners who use these datasets.

We show, once again, in the words of Nithya Sambasivan - "everyone wants to do the model work, but nobody wants to do the data work" ...

https://aclanthology.org/2023.alta-1.6/

Citation:

Reid, K., Williams, E.T., 2023. Right the docs: Characterising voice dataset documentation practices used in machine learning, in: Muresan, S., Chen, V., Casey, K., David, V., Nina, D., Koji, I., Erik, E., Stefan, U. (Eds.), Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association. Association for Computational Linguistics, Melbourne, Australia, pp. 51–66.

daieuxetdailleurs, to archivistodon French
@daieuxetdailleurs@framapiaf.org avatar
boilingsteam, to llm
@boilingsteam@mastodon.cloud avatar

Building a Large Japanese Web Corpus for Large Language Models: https://arxiv.org/abs/2404.17733 #llm #japanese #dataset #corpus #training

stefan, to history
@stefan@stefanbohacek.online avatar

"The Arsenical Books Database — part of the Winterthur Museum and the University of Delaware’s Poison Book Project — has identified hundreds of examples of 19th-century books that used [green pigments containing arsenic] in their covers and other binding components."

https://sites.udel.edu/poisonbookproject/

Via @dataisplural.

mwfc, to random
@mwfc@chaos.social avatar

Anyone aware of a (structural Health Monitoring) database that is freely available with data from europe?

I am curious to corelate some data from various sources

I am especially interested in building damage due to ground movement.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • provamag3
  • InstantRegret
  • magazineikmin
  • modclub
  • khanakhh
  • Youngstown
  • rosin
  • mdbf
  • slotface
  • Durango
  • ngwrru68w68
  • thenastyranch
  • kavyap
  • DreamBathrooms
  • JUstTest
  • cubers
  • osvaldo12
  • Leos
  • anitta
  • everett
  • ethstaker
  • GTA5RPClips
  • tester
  • cisconetworking
  • megavids
  • tacticalgear
  • normalnudes
  • lostlight
  • All magazines