#commoncrawl - kbin.social

a, 6 months ago to Futurology

Common Crawl September/October 2023 Crawl Archive (CC-MAIN-2023-40) is out and release.

100TiB compressed of fresh web crawled which can used in your next data mining project.

🔗 https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-40/index.html

#commoncrawl #dataset #opendata #open #research

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ jasonnab

ButterflyOfFire, 7 months ago to random French

Est-ce que cette page fonctionne chez-vous ? 👀

http://urlsearch.commoncrawl.org/

#CommonCrawl

reply

expand (11)

collapse (11)

report

activity

copy /kbin url

copy original url

open original url

Loading...

pluralistic, 9 months ago to random

The crybabies who freak out about The Communist Manifesto appearing on university curriculum clearly never read it - chapter one is basically a long hymn to capitalism's flexibility and inventiveness, its ability to change form and adapt itself to everything the world throws at it and come out on top:

https://www.marxists.org/archive/marx/works/1848/communist-manifesto/ch01.htm#007

1/

reply

expand (47)

collapse (47)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ seav, glynmoody, oblomov, grb090423 +3 more

pluralistic, 9 months ago

Many of the biggest "open AI" companies are totally opaque when it comes to training data. Google and OpenAI won't even say how many pieces of data went into their models' training - let alone which data they used.

Other "open AI" companies use publicly available datasets like #ThePile and #CommonCrawl. But you can't replicate their models by shoveling these datasets into an algorithm. Each one has to be groomed - labeled, sorted, de-duplicated, and otherwise filtered.

28/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

tallison, 10 months ago to infosec

I've gotten a bunch of #infosec followers over the last coupla days.

For those interested in #fileforensics and especially PDFs, please take a look at our fairly newly released 8 million/8TB PDF corpus, derived from #CommonCrawl and then augmented by our team at #nasajpl

https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ wtfpdf

tallison, 11 months ago to random

PDF corpus in the news, again.
#CommonCrawl

https://www.jpl.nasa.gov/news/jpl-creates-worlds-largest-pdf-archive-to-aid-malware-research

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ realn2s

AmyDentata, 1 year ago to random

deleted_by_author

Loading...

bornach, 1 year ago

@morganmay @AmyDentata
Well there is a thought process but it is unfortunately a very human one that is

contaminated by the biases perpetuated in subreddits hoovered up by #CommonCrawl,

cleaned, labelled and fine tuned by low paid offshore workers in the gig economy,

edited by the user's own vulnerability to simple confidence tricks perpetrated by #AIHype pushers

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

willoremus, 1 year ago to random

This visual deep dive into one of the largest AI language datasets is nonstop fascinating, jaw-dropping, and troubling, and anyone who is remotely interested in how LLMs really work, their biases, or intellectual property should read it. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

reply

expand (16)

collapse (16)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ mike, tchambers, maegul

tallison, 1 year ago

@willoremus I ❤️ that Google uses #CommonCrawl and thereby the fruits of #ApacheTika and #ApacheNutch.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...