Web scraping may seem harmless, especially if you’re the one doing it. After all, you’re just sending requests and downloading data. It can’t be that bad, right?
While one person scraping data from a website might not have a large impact, once more people and companies join in on the fun, the strain put on servers becomes immense. As a result, administrators will often start banning any bot-like activity.
Yet, avoiding blocks is just one side of the coin. There’s an oft-cited legal aspect to scraping, particularly about data acquisition. Legislation exists that protects certain types of data nearly globally. In other cases, Terms of Service forbid the use of a web crawler to scrape data.
As a result of all of these hurdles, best web scraping practices have arisen. Taking heed allows web scrapers to acquire all the data they need without hurting the performance of the server or breaking legal agreements.
Web scraping best practices
#1 Avoid sending too many requests
A great rule of thumb in web scraping is to use the “minimum effective dose”. Only send as many requests as you need to acquire data. Optimize your web scraper to avoid looping around or revisiting pages where it may already have been recently.
Additionally, requests per second is another factor to consider. Humans are significantly slower than some headless browser that blazes through URLs. Not only does that put enormous strain on the web server’s resources, but it also makes your web crawler an easy target.
Of course, a lot of trial and error will be involved until you find the perfect balance between the speed and carefulness of your browsing pattern. While it’s possible to find “crawl-delay” parameters in the “robots.txt” file of a website, they are exceedingly rare.
#2 Respect “robots.txt”
While a lot of web scraping projects might start out as someone test-driving code on a data source, they should eventually take heed of “robots.txt”. Webmasters intentionally develop the file to give guidelines to web scrapers.
Upholding “robots.txt’ not only greatly reduces the likelihood of receiving a ban, but it also makes the process easier. A website owner might choose to forbid access to specific parts of the page. Others might ban specific user agents. Accessing such data might make web crawling a lot simpler and easier for you.
#3 Rotate user agents
A user agent is a part of HTTP request headers sent with every connection request. In simple terms, a user agent is a short message telling the web server what kind of device, OS, browser, and environment you’re using to connect to it. Servers then use that information to correctly display content.
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0 Waterfox/78.14.0
Sending multiple requests in quick succession with a singular user agent is generally a bad idea. Basic bot detection tools can be a little trigger happy and ban someone outright, just for sending several requests from one user agent.
Rotating user agents means changing the headers sent in order to make it seem as if another device is making the request. These can be changed to absolutely anything (although, it’s recommended to make them look reasonable). You can even pretend to be Google Bot (also not recommended) by using their user agent.
If needed, publicly available databases can be used to acquire a legitimate user agent. All the web scraper needs to then do is to implement some intelligent rotation system that wouldn’t be easily detectable.
While user agents won’t solve all of your ban-related issues, especially if you’re facing intelligent anti-crawling mechanisms, they will certainly significantly improve the survival rate of your web scraper.
Get a 2 GB of free proxies per month. No payments & commitments.
#4 Use proxies
Just like using the same user agent for multiple requests is a bad idea, so is using the same IP address. Frequently changing your IP address through proxy servers is so vital to successful data extraction that we would consider it one of the most important web scraping practices.
Even relatively unsophisticated anti-scraping mechanisms will be able to detect ill-meaning bots if they don’t use proxy services. There’s only so much user agent rotation can do if all the requests are coming from one IP.
Combining the rotation of user agents with proxy services, however, provides a lot of protection against anti-scraping techniques. Most of them track IP addresses, request headers, and a few other points of data in order to differentiate between a bot and a real user. Changing the data points often means evading a block.
IP change frequency and the choice of proxies (whether residential or data center proxies) highly depends on the use case, the data source, and many other factors. It’s an exceedingly complex topic, which cannot be explained in a few sentences.
We do, however, recommend picking dedicated proxies that are paid per IP, such as our own. Shared proxies raise the likelihood that someone else gets the IP blocked.
We’d like to emphasize, however, that without proxies there’s no automated data extraction. Using the same IP address makes it easy for websites to block web scraping and to shut you off from accessing data.
#5 Only scrape publicly accessible data
While there are currently no laws directly regulating web scraping, data privacy and security legislation is plentiful. As web scraping, in one way or another, is data acquisition, all data-related laws are directly applicable.
Of course, the best course of action is to always consult with your legal counsel before using a web scraper. They will always be able to guide you through the process and avoid getting into trouble.
Industry practices and legal cases, however, point towards one direction – only publicly accessible data can be scraped. Out of publicly accessible data, only non-personal (or non-identifying) information can be acquired.
As is usual, the definitions and applications of the law are full of nuance and intricacies. Yet, it can be somewhat narrowed down. In a legal sense, the best web scraping practices are to avoid all data behind log-ins, all data that is about persons, and anything that might be in any shape or form be related to a person.
In a practical sense, web scraping social media is in many, if not all, cases off-limits as it’s mostly personal data behind a log-in. Scraping websites such as ecommerce marketplaces might be allowed if the Terms of Service do not forbid it.
#6 Don’t violate copyright
When bots scrape websites, they usually download a copy of the HTML code and store it in memory. Regardless if it’s just a part or the entire document, when the scraping process extends to many web pages, copyright infringement can happen.
As such, web scraping only publicly accessible data is not enough. It’s entirely possible that the data or file is publicly accessible, but subject to copyright. Downloading it would be an infringement.
As a result, data sources should be carefully selected and curated over time. Using web scraping bots without regard and constant monitoring can possibly land you into trouble.
How to find out if you’ve been banned?
Even with the best web scraping bots, optimal user agent rotation, and proper usage of a proxy server, getting banned is inevitable. Yet, since most bans are IP-based, many owners of web servers implement blocks in creative ways. Otherwise, a simple change of a proxy would quickly solve the issue.
In fact, many websites with advanced anti-web scraping measures will start off slowly. If they suspect someone might be using web scrapers for ill-intentions, many websites will start off by providing a CAPTCHA. Bots, as you may well know, have a hard time solving CAPTCHAs.
While it’s the first sign of trouble, getting the test is by no means the end of your web scraping journey. There are plenty of CAPTCHA solving services that can be implemented to solve such an issue. Additionally, it’s possible to simply change the user agent, IP address, and attempt to reconnect to the page.
Things get creative and complicated from here onwards. Clever webmasters will set up traps that force the bot to keep running but deliver no data. In turn, that makes it harder for the web scraper to resolve issues. Methods that are frequently used are:
- Honeypot links. These are links in pages that are not visible to humans and nearly impossible to click. Yet, they exist in the source code. Whenever a bot downloads and extracts URLs, they will find the honeypot link and visit it. That usually results in a ban.
- Incorrect content. One of the most devious anti-bot methods. If the website detects bot-like activity, the content is switched to display the wrong data. In turn, it takes a significant amount of time before the block is discovered.
- Custom anti-bot page. Some webmasters will use the 301 (Moved Permanently) HTTP status code to trick bots into visiting useless pages.
- Nondescript error messages. Finally, an interesting technique used by web owners is to block a suspect IP address, but start displaying error messages that are either vague or, supposedly, indicate server-side issues.
In many cases, there’s barely any way to resolve these. No built-in browser tools or something of the like will, for example, let you instantly discover that incorrect content is being displayed or that an error message is being faked. Only a few, like honeypots, can be avoided with some clever coding.
While blocks are unavoidable, they can be heavily mitigated. Through the clever use of proxies, user agents, and best practices, the survival rate of any scraper can be significantly improved. You shouldn’t, however, use these tips to acquire data that is off-limits. No best practices can save you from legal troubles.
Get a 2 GB of free proxies per month. No payments & commitments.