Topic: Scraping

Just wondering about this topic. Is there a way to prevent scraping? I have many friends that have had their work  scraped from sites like artstation etc for use in AI generated images and libraries.

Without starting a debate on the ethics of this I was just wondering if there is a way to prevent scraping on one's gallery? Or if indeed it is already there when we prevent downloading of images? Is this something I could do with a line of code?

Cheers,

Sim.

Re: Scraping

When you extract the Showkase zip file, you'll find a file named showkase.robots.txt.
You can open this file in a text editor to read more about it (in the comments at the top of the file).
Here are the full contents of the file.

# robots.txt
#
# A robots.txt file prevents the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
# This file will be ignored unless it is called robots.txt and is in the root of your host.
# You can either rename this file to robots.txt or copy the content to your existing robots.txt
#
# If you have installed Showkase in the root of your host then:
# 1. Copy this file to the web root OR
# 2. If you have an existing robots.txt file add this content to it.
#
# If you have installed Showkase in a subdirectory then:
# 1. Copy this file to the web root OR
# 2. If you have an existing robots.txt file add this content to it.
# 3. Edit the paths so they point to the files you want to disallow
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/wc/robots.html
#
# For syntax checking, see:
# http://www.sxw.org.uk/computing/robots/check.html

User-agent: *
Crawl-delay: 10
Disallow: /_data/
Disallow: /_library/
Disallow: /_showkase/
Disallow: /_smarty/
Disallow: /_themes/
Disallow: /_trash/
Disallow: /_viewers/
Disallow: /admin/
Disallow: /readme.html

By default, the file (once renamed and placed in the correct location) will disallow crawling throughout the Showkase admin area of your web server.
If you want to disallow everything (public pages as well as the Showkase admin), then use the following (instead of the stock code):

User-agent: *
Disallow: /

I hope this helps.

Re: Scraping

Thanks Steven. I might be wrong but I'm assuming the public_html folder of a website is the webroot. I renamed the existing showkase.robots.txt file in that location to robots.txt.

Is it really that simple?
No other settings? Lines 25 to 35 are activated by doing so?

If so...GREAT. Much appreciated.

Sim.

Re: Scraping

You're welcome!

... I'm assuming the public_html folder of a website is the webroot.

Yes (although not all web servers are the same and the root directory on others may be labelled something else such as 'htdocs').

Is it really that simple?

Yes. The 'robots.txt' file has been a web standard for a long time now and all major web search engines ought to respect it. I cannot guarantee that they all do, but I'm sure there would be a huge outcry if they did not.
It's a lot easier to tell search engines to not crawl and index a web site that it is to ensure that a site is crawled and indexed.

No other settings? Lines 25 to 35 are activated by doing so?

That's right... no other settings. The actual lines of code in the file are not commented out. The file just needs to be renamed to become active.
Just replace lines 25 to 35 with the following to disallow everything (if you like):

User-agent: *
Disallow: /

Incidentally, I notice that the links in the file's comments (for more information) are no longer active.
Check out these links instead:
(1) http://www.robotstxt.org/robotstxt.html
(2) https://developer.mozilla.org/en-US/doc … Robots.txt
(3) https://en.wikipedia.org/wiki/Robots.txt