The rise of artificial intelligence (AI) has been disruptive. Things are changing rapidly. And it seems like this technology is posing new moral, ethical, and existential questions each day.
There are plenty of stories and opinions to choose from. But one recent incident caught my eye.
A website owner claimed that their site was being “hammered” by a content scraping bot. The tool img2dataset, catalogs large volumes of images for use in AI tools like Stable Diffusion.
The site’s owner opened an issue on the tool’s GitHub repository. He was advised to opt out of scraping. To do so, he’d have to add specific headers to his website.
This is our new reality. These tools are grabbing all manner of content – copyrighted images included. They’re regurgitating it to their users. Indeed, it’s the world’s biggest mash-up.
What’s more, it’s up to website owners to specify that they don’t want to participate. Is this as outrageous as it sounds? Let’s examine the issue and what it means for website owners.
Scraping Website Content for Profit Isn’t New
On one level, a tool scraping your website isn’t a novel idea. Search engines have been both indexing content and displaying relevant bits in results for years. In addition, RSS has allowed for retrieving text and images since the early days of the web.
And companies like Google have profited massively from these efforts. The more data they collect, the better results they provide. Thus, the more eyeballs they attract. That results in bigger ad revenue.
It’s been the way of the world for a few decades now. Therefore, it’s no surprise that other companies are taking a similar approach.
After all, an AI developer needs a good source of content to “train” its tool. What better way to do so than by collecting as much data as possible? For them, the web is the gift that keeps on giving.
So, the mere fact that a bot is visiting your website and cataloging content isn’t a big deal. But maybe that’s where the similarities end.
Is There Any Benefit for Website Owners?
The big difference is in who benefits. When a search engine indexes your website, you stand to gain something. Better rankings mean more visitors – and potentially more customers. And if you practice search engine optimization (SEO), you’re asking Google to visit.
AI bots may not rise to the level of an uninvited guest. But they’re not exactly visiting to your benefit, either.
For example, when you ask ChatGPT to write code, it’s not thinking back to the computer science course it took in college. The tool is tapping into previously-scraped content. True, it may not be a line-for-line copy (although sometimes it is). But the language model is using what it has “learned” to produce an answer.
Similarly, generating an image of Elon Musk riding a unicorn isn’t magic (sorry to spoil the fun). The various visual components had to come from somewhere. Original (and potentially copyrighted) images are key ingredients.
In both scenarios, the beneficiaries are the AI tool and the end user. The sources used to generate this content? They have more bot traffic added to their monthly bandwidth usage.
The developer of img2dataset has a slightly different take. Among their responses to concerns about requiring an opt-out:
“You will have many opportunities in the years to come to benefit from AI. I hope you see that sooner rather than later. As creators you have even more opportunities to benefit from it.”
Their logic seems to suggest that we’ll all benefit from AI at some point. So, allowing the tool to scrape your content is good for humanity. Or something like that.
To Block or Not to Block?
The decision of whether to block AI from scraping your website is complex. Or it requires multiple stages, at least.
Perhaps the easiest part is identifying your philosophy. Are you OK with your content being scraped? If so, carry on. If not, the other parts of the equation are more complicated.
For one, there’s no universal way to opt out of all AI scraping. The headers for blocking image2dataset work only for that tool. That means keeping track of popular tools and finding methods for blocking them.
And companies like Google and Microsoft are further complicating the conversation. Both own search engines. You likely want them to index your website. But they also have AI tools. Where is the line drawn between these different products?
For its part, Google’s Bard claims that it doesn’t scrape content from websites (I asked!). But in the same conversation, it also says that websites are a part of where it gets data. Make what you will of those answers.
If you’d like to block all manner of AI tools, it won’t be easy. But maybe not for long. I can envision services that will cater to website owners who want nothing to do with content scraping. They may allow us to do so more efficiently.
But until such time, this seems like a losing battle. AI is inevitable. And who has time to catalog every new app that hits the market? Plus, it may be difficult to block these tools without also negatively impacting SEO.
Website Owners Must Fend For Themselves
Not everyone will be as impacted as the frustrated user in our introduction. In that case, it appears that image2dataset was indexing a large volume of images. Unless you’re in the same boat, your site probably won’t experience any problems.
But the issue goes much deeper. It should make us think about how we value our content. And we should question what sort of rights (if any) these tools have. Can they simply take what they want? Or should there be guidelines outlining what is and isn’t permissible?
Meaningful regulation of the industry could be months or even years away. In the interim, website owners are left to fend for themselves.
As part of the effort, it’s important to make your voice heard. Encourage companies to make opting out of scraping a transparent process. Express your concerns to elected officials and others of influence.
It may not slow down the onslaught of AI tools. But it could prevent things from getting too far out of hand. That will benefit us all.
AI Tools Are Scraping Your Website. Is That a Good Thing? Medianic.