#WebScraping - kbin.social

remixtures, 7 days ago to internet Portuguese

#SocialMedia #USA #Twitter #Copyright #WebScraping: "A US district judge William Alsup has dismissed Elon Musk's X Corp's lawsuit against Bright Data, a data-scraping company accused of improperly accessing X (formerly Twitter) systems and violating both X terms and state laws when scraping and selling data.

X sued Bright Data to stop the company from scraping and selling X data to academic institutes and businesses, including Fortune 500 companies.

According to Alsup, X failed to state a claim while arguing that companies like Bright Data should have to pay X to access public data posted by X users."

https://arstechnica.com/tech-policy/2024/05/elon-musks-x-tried-and-failed-to-make-its-own-copyright-system-judge-says/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

joe, 10 days ago to ai

A few weeks back, I thought about getting an AI model to return the “Flavor of the Day” for a Culver’s location. If you ask Llama 3:70b “The website https://www.culvers.com/restaurants/glendale-wi-bayside-dr lists “today’s flavor of the day”. What is today’s flavor of the day?”, it doesn’t give a helpful answer.

https://i0.wp.com/jws.news/wp-content/uploads/2024/05/Screenshot-2024-05-09-at-12.29.28%E2%80%AFPM.png?resize=1024%2C690&ssl=1

If you ask ChatGPT 4 the same question, it gives an even less useful answer.

https://i0.wp.com/jws.news/wp-content/uploads/2024/05/Screenshot-2024-05-09-at-12.33.42%E2%80%AFPM.png?resize=1024%2C782&ssl=1

If you check the website, today’s flavor of the day is Chocolate Caramel Twist.

https://i0.wp.com/jws.news/wp-content/uploads/2024/05/Screenshot-2024-05-09-at-12.41.21%E2%80%AFPM.png?resize=1024%2C657&ssl=1

So, how can we get a proper answer? Ten years ago, when I wrote “The Milwaukee Soup App”, I used the Kimono (which is long dead) to scrape the soup of the day. You could also write a fiddly script to scrape the value manually. It turns out that there is another option, though. You could use Scrapegraph-ai. ScrapeGraphAI is a web scraping Python library that uses LLM and direct graph logic to create scraping pipelines for websites, documents, and XML files. Just say which information you want to extract and the library will do it for you.

Let’s take a look at an example. The project has an official demo where you need to provide an OpenAI API key, select a model, provide a link to scrape, and write a prompt.

https://i0.wp.com/jws.news/wp-content/uploads/2024/05/Screenshot-2024-05-09-at-12.35.29%E2%80%AFPM.png?resize=1024%2C660&ssl=1

As you can see, it reliably gives you the flavor of the day (in a nice JSON object). It will go even further, though because if you point it at the monthly calendar, you can ask it for the flavor of the day and soup of the day for the remainder of the month and it can do that as well.

https://i0.wp.com/jws.news/wp-content/uploads/2024/05/Screenshot-2024-05-09-at-1.14.43%E2%80%AFPM.png?resize=1024%2C851&ssl=1

Running it locally with Llama 3 and Nomic

I am running Python 3.12 on my Mac but when you run pip install scrapegraphai to install the dependencies, it throws an error. The project lists the prerequisite of Python 3.8+, so I downloaded 3.9 and installed the library into a new virtual environment.

Let’s see what the code looks like.

You will notice that just like in yesterday’s How to build a RAG system post, we are using both a main model and an embedding model.

So, what does the output look like?

https://i0.wp.com/jws.news/wp-content/uploads/2024/05/Screenshot-2024-05-09-at-2.28.10%E2%80%AFPM.png?resize=1024%2C800&ssl=1

At this point, if you want to harvest flavors of the day for each location, you can do so pretty simply. You just need to loop through each of Culver’s location websites.

Have a question, comment, etc? Please feel free to drop a comment, below.

https://jws.news/2024/how-to-use-ai-to-make-web-scraping-easier/

#AI #ChatGPT #llama3 #LLM #Ollama #Python #ScrapegraphAi #WebScraping

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

remixtures, 1 month ago to internet Portuguese

#SocialMedia #SocialNetworks #WebScraping #Discord: "An online service is scraping Discord servers en masse, archiving and tracking users’ messages and activity across servers including what voice channels they join, and then selling access to that data for as little as $5. Called Spy Pet, the service’s creator says it scrapes more than ten thousand Discord servers, and besides selling access to anyone with cryptocurrency, is also offering the data for training AI models or to assist law enforcement agencies, according to its website.

The news is not only a brazen abuse of Discord’s platform, but also highlights that Discord messages may be more susceptible to monitoring than ordinary users assume. Typically, a Discord user’s activity is spread across disparate servers, with no one entity, except Discord itself, able to see what messages someone has sent across the platform more broadly. With Spy Pet, third-parties including stalkers or potentially police can look up specific users and see what messages they’ve posted on various servers at once.

“Have you ever wondered where your friend hangs out on Discord? Tired of basic search tools like Discord.id? Look no further!” Spy Pet’s website reads. It claims to be tracking more than 14,000 servers, 600 million users, and includes a database of more than 3 billion messages." https://www.404media.co/a-spy-site-is-scraping-discord-and-selling-users-messages/

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

remixtures, 2 months ago to ai Portuguese

#AI #GenerativeAI #MidJourney #StabilityAI #GeneratedImages #WebScraping: "On Wednesday, Midjourney banned all employees from image synthesis rival Stability AI from its service indefinitely after it detected "botnet-like" activity suspected to be a Stability employee attempting to scrape prompt and image pairs in bulk. Midjourney advocate Nick St. Pierre tweeted about the announcement, which came via Midjourney's official Discord channel.

Prompts are the written instructions (like "a cat in a car holding a can of a beer") used by generative AI models such as Midjourney and Stability AI's Stable Diffusion 3 (SD3) to synthesize images. Having prompt and image pairs could potentially help the training or fine-tuning of a rival AI image generator model."

https://arstechnica.com/information-technology/2024/03/in-ironic-twist-midjourney-bans-rival-ai-firm-employees-for-scraping-its-image-data/

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

rennerocha, 2 months ago to python

Now it is official! I will be presenting the tutorial "Gathering data from the web using Python" at #PyConUS !!!

https://us.pycon.org/2024/schedule/presentation/30/

Who will be there? 🙂

#WebScraping #Scrapy #Tutorial #Python #Pittsburgh

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ villares

stefano, 3 months ago to fediverse

Dear friends of the #BSDCafe and the #Fediverse,
Bytedance is connecting to our server every few seconds. As I don't understand why and, more, I've already had problems with their rude behaviour, I've added this rule to nginx.conf:

case sensitive matching

if ($http_user_agent ~ (Bytespider)) {
return 403;
}

case insensitive matching

if ($http_user_agent ~* (bytespider)) {
return 403;
}

They should be out, at least for now.

#Privacy #Security #WebScraping #OnlinePrivacy #FediverseSecurity

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ Binder, SrRochardBunson

remixtures, 3 months ago to meta Portuguese

#Meta #Facebook #WebScraping #DataScraping: "One year after Meta sued a data-scraping company, a federal judge this week threw out Meta's breach-of-contract claim because the defendant obtained only public data from Facebook and Instagram.

Meta sued Bright Data in January 2023, making claims of breach of contract and tortious interference with contract. Bright Data is an Israeli company that collects data from various websites and offers related products to businesses.

"Bright Data concedes that it was bound to Meta's Terms while it had Facebook and Instagram accounts, and that it sells data collected from Facebook and Instagram," US District Judge Edward Chen wrote in a ruling issued Tuesday. "However, even viewing the evidence in the light most favorable to the non-moving party (Meta)... the Facebook and Instagram Terms do not bar logged-off scraping of public data; perforce it does not prohibit the sale of such public data. Therefore, the Terms cannot bar Bright Data's logged-off scraping activities."

Meta alleged that Bright Data violated Facebook and Instagram policies by developing and using "unauthorized automation software to scrape data from Facebook and Instagram, including users' profile information, followers, and posts that users have shared with others." The case is in US District Court for the Northern District of California."

https://arstechnica.com/tech-policy/2024/01/facebook-suffers-big-loss-in-lawsuit-against-data-scraping-company/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

graham_knapp, 4 months ago to python French

#python #Nantes meetup le 8 février avec wagtail cms, #django, #webscraping
https://www.meetup.com/nantes-python-meetup/events/298651245/

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

barefootstache, 4 months ago to random

#DailyBloggingChallenge (153/200)

There are two main ways to #scrape a #website, either actively or passively.

Active scraping is the process of using a trigger to actively scrape the already loaded webpage.

Passive scraping is the process of having the tool navigate to the webpage and scrape it.

The main difference is how one is getting to the loaded #webpage.

#WebsiteScraping

reply

expand (7)

collapse (7)

report

activity

copy /kbin url

copy original url

open original url

Loading...

barefootstache, 4 months ago
#DailyBloggingChallenge (157/200)

When actively scraping, the main starting function is
document.querySelectorAll()
This will return a NodeList, which typically one will use a for-loop to loop over each item.

On each item either the querySelector or querySelectorAll will be applied recursively until all specific data instances are extracted.

This data is then saved into various formats depending on future processing, either as on object in an array or as a string, which is then saved either to the localStorage, sessionStorage, IndexDB, or downloaded via a temporal link.

#WebScraping #VanillaJS #WebDev
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

remixtures, 5 months ago to internet Portuguese

#SocialMedia #Twitter #WebScraping #CFAA #Musk: "The recent lawsuit that X Corp., formerly Twitter, filed against a nonprofit called the Center for Countering Digital Hate illustrates the ongoing threat to researchers—whether they’re nonprofit researchers, academics, journalists—who engage in public interest investigations of platforms and often speak critically about platforms. They will often find things that the platforms are not happy for them to publicize.

In this case, the Center for Countering Digital Hate published reports that talked about what it termed hate speech and misinformation that remained on the Twitter platform. In doing this research, they had to scrape public information on Twitter. They analyzed posts at scale and they argued that Twitter allowed content to remain up that violated Twitter’s own policies on content. X Corp. sued CCDH and their theory was that CCDH violated the terms of service and that that’s a breach of contract.

They’re seeking tens of millions of dollars in damages based on the reputational harm to them of these reports, which they say caused advertisers to flee."

https://themarkup.org/hello-world/2023/12/16/how-elon-musk-is-trying-to-make-web-scraping-dangerous-again

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jameswalters, 5 months ago to python

Thanks so much @pyohio! It was a privilege to speak, and I had a blast giving the talk. 💚️

If you'd like links to my slides as well as additional resources on web scraping, check out my blog:

http://james.walters.click/

#pyohio #python #scrapy #webscraping

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ mariatta

jameswalters, 5 months ago to python

My @pyohio talk is almost up! Join me in 15 minutes for a crash course in web scraping with Scrapy.

https://www.youtube.com/watch?v=4kcLgHDQicg

#pyohio #python #webscraping #scrapy

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ mariatta

stefan, 5 months ago to RSS

In the latest addition to my series of Pipedream.com tutorials I will walk you through turning your RSS feed into a Mastodon bot.

https://stefanbohacek.com/blog/turn-an-rss-feed-into-a-mastodon-bot

This time, no code needed!

#rss #automation #mastodon #pipedream #NoCode

reply

expand (16)

collapse (16)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ box464, sass, brian, developerjustin +2 more

stefan, 5 months ago

I've been also thinking about writing a tutorial on #WebScraping for research, which is already a gray area, but now with all these "AI" startups and companies doing that to make money off of everything we put online...I don't know, I don't want someone to use my work for that 😬

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

remixtures, 5 months ago to uk Portuguese

#UK #SocialMedia #TikTok #WebScraping #Media #News #Journalism:"The Chinese owner of TikTok has been accused of using UK news sites to train up its rival to ChatGPT without permission or fair payment.

Publishers including The Guardian, Daily Mail and The Telegraph are believed to have been targeted by a bot operated by the Beijing-based tech giant Bytedance.

The company has said its bot, dubbed Bytespider, has been deployed for “search optimisation” purposes.

However, news organisations are concerned that their articles are being used without permission to train chatbots and have raised concerns about copyright violations."

https://www.telegraph.co.uk/business/2023/12/11/tiktok-bytedance-scraping-uk-news-sites-train-chatgpt-rival/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

smach, 5 months ago to python

If you want to get started doing Web scraping with Python, this tutorial is very well done.
By Cody Winchester for the NICAR @IRE_NICAR (National Institute for Computer Assisted Reporting - that is, data journalism) conference early this year
https://github.com/cjwinchester/nicar23-python-scraping

#python #WebScraping #scraping @IRE_NICAR @python

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

rennerocha, 6 months ago to python Portuguese

Terminei de preparar o meu tutorial "Raspando Dados Da Internet Com Python" que será apresentado na #PythonBrasil2023 na próxima semana em Caxias do Sul! https://github.com/rennerocha/pybr2023-tutorial
Quem vai? 🙂
#Python #WebScraping

reply

expand (4)

collapse (4)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ augustocc, villares

Imoptimal, 6 months ago to ai

Until there's comprehensive regulation against the #webScraping of data used for training of #AI models (that benefits only the #bigTech), there are some tools that can help.

This #Kudurru looks interesting, and it offers a service in the form of a #WordPress plugin too.

https://www.wired.com/story/kudurru-ai-scraping-block-poisoning-spawning/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

stefan, 7 months ago to Futurology

Question for #researchers, what are some legitimate cases when you had to use web scraping?

#research #WebScraping

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ stefan, juandesant, botwiki

remixtures, 7 months ago to ai Portuguese

#AI #GenerativeAI #Google #Bard #Copyright #WebScraping: "As compared to the class actions against Open AI, this class action seems to be directed even more precisely to the core issue of the Gen AI tools – their alleged training via resources made public on the internet and/or protected under copyright laws – combining potential legal issues on both the IP and the privacy fronts (not to mention due to bias in the algorithms). Whatever the result of such class actions, this seems to be a timely occasion for parties to clarify and judges to assess legitimacy of Gen AI tools based on a deep analysis of the technical functioning and composition of the training datasets. The fact that this is the main goal of the class action seems to be supported by the proposals introduced by the plaintiffs for a governance scheme for all Gen AI models."

https://copyrightblog.kluweriplaw.com/2023/10/03/generative-ai-the-us-class-action-against-google-bard-and-other-ai-tools-for-web-scraping/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

wraptile, 7 months ago to python

#sqlite being used as a message queue for #Python

https://github.com/litements/litequeue

Ideal for small projects that need a persistent, easy to package msg queue. Great for #webScraping

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ bitprophet, chrisjrn, brianokken

villares, 7 months ago (edited 7 months ago) to random Portuguese

I hate Google, but #GoogleColab can be handy.
We had some teaching examples of #webscraping with #selenium on colab because installing the webdriver locally can be challenging for some students.
Google broke selenium scraping on colab. :((
Any #MyBinder suggestions? Something else..?

reply

expand (3)

collapse (3)

report

activity

copy /kbin url

copy original url

open original url

Loading...

remixtures, 9 months ago to ai Portuguese

#AI #GenerativeAI #WebScraping #DataMining #Privacy #DataProtection: "What kind of info are we comfortable forking over to the AIs, if any? Right now we are in the midst of a destabilizing moment. It’s alarming, yes, but it’s also an opportunity to renegotiate what we do and do not want to hand over to tech giants that have been gathering our personal data for decades now. But to make those sorts of decisions, first we have to know where we stand. What are the websites and apps we use every day doing with our data? Are they using it to train their AI systems? What can we do about it if so?

A good rule of thumb, to begin with: If you are posting pictures or words to a public-facing platform or website, chances are that information is going to be scraped by a system crawling the internet gathering data for AI companies, and very likely used to train an AI model of one kind or another. If it hasn’t already."

https://www.latimes.com/business/technology/story/2023-08-16/column-its-not-just-zoom-how-websites-and-apps-harvest-your-data-to-build-ai

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

doctorambient, 9 months ago to ai

Has #creativecommons thought about #ai and licenses that exclude that?

I've done a lot of work that I released as CC-BY, but I can't do that anymore, because I don't consent to have my work scraped for AI. That violates the BY part!

Any ideas about good #licenses to deal with that?

I want to share, but I don't want my work to be absorbed into a model that will take my job from me or not cite me.

#webscraping #machinelearning #gpt #openai #copyright #BillionairesAreEvil

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

veroandi, 10 months ago to privacy

Some web scraping job tasks:

Utilise advanced techniques to bypass blocking technologies, including CAPTCHAs, IP blocking, and user agent restrictions
&

Ensure compliance with applicable data privacy regulations and ethical web scraping practices

https://competitormonitor.com/text-web-scraping-specialist-remote-page-173.html
#privacy #joboffer #scraping #webscraping #python #ethics

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

Twitter sues four unknown entities for 'unlawful data scraping' | Engadget (www.engadget.com)

Elon Musk blamed scraping by these unknown entities for Twitter's decision to put a cap on how many tweets a user can see per day.