Bot Detection: How Can We Scrape The Web Without Getting Blocked?

Disclaimer: Special credits to Dariusz Niespodziany of Github community for writing this brilliant article. 

 

Whether you’re just getting started with web scraping and wondering what you’re doing wrong since your solution isn’t working, or you’ve been dealing with crawlers for a while and are stopped on a page that says you’re a bot and can’t proceed, keep reading.

 

In recent years, anti-bot solutions have evolved. More and more websites are implementing security measures, ranging from the simple, such as filtering IP addresses based on their geolocation, to the sophisticated, such as the in-depth study of browser characteristics and behavioral analysis. All of this increases the difficulty and cost of web scraping content compared to a few years ago. It is, however, still possible. I’ve highlighted a few tips that you might find useful below.

How Do You Build A Bot That Is Undetectable?

The following is a list of specialized services that we utilized to circumvent various anti-bot measures. Depending on your application, you may require one or more of the following:


Scenario/use-case

Solution

Example

Short-lived sessions without auth

Rotating IP addresses pool

This is advantageous when scraping websites such as Amazon, Walmart, or public LinkedIn pages. That is any website that does not require registration. You intend to conduct a large number of brief sessions and can afford to be blocked sometimes.

Geographically limited websites

Region-specific pool of IP addresses

This is advantageous when a website makes use of a firewall such as Cloudflare’s to prevent entire geographies from accessing it.

Extended sessions after sign-in

Reliable pool of IP addresses and stable set of browser fingerprints

The most frequently encountered case here is social media automation, in which you develop a program to automate social media accounts in order to manage ads more efficiently.

Javascript-based detection

Implementation of common evasion libraries, similar to puppeteer-extra-plugin-stealth

Numerous websites make use of FingerprintJS, which may be readily circumvented when open-source plugins such as the aforementioned puppeteer stealth plugin are used in conjunction with your existing software.

Detection with fingerprinting techniques for browsers

Browser fingerprints that appear to be natural. That is, having covered the entire area that the target website’s Javascript solution is validating.

This is one of the most sophisticated situations. Credit card processors such as Adyen or Stripe are popular examples. A highly sophisticated browser fingerprint is being developed to detect credit theft and notify the user for additional authorization.

Exceptional set of detection techniques

Bot software developed specifically for the purpose of detecting the target website’s unique detection surface.

Examples include websites that sell sneakers and e-commerce stores that are purportedly under attack by custom-made bot software.

Easy-to-use custom-made detection techniques

Before digging into any of the above, if you’re targeting a smaller website, you may find that all you need is a Scrapy script with some adjustments and an inexpensive data-center proxy.

 

Anti-bot Software Providers List

This is not a complete list of organizations that offer cutting-edge anti-bot solutions to enterprises ranging from small e-commerce sites to Fortune 500 corporations:

 

Akamai Bot Manager by Akamai

Advanced Bot Protection by Imperva (former Distil Networks)

DataDome Bot Protection

PerimeterX

Shape Security

Cloudflare Bot Management

Barracuda Advanced Bot Protection

HUMAN

Kaskada

Alibaba Cloud Anti-Bot Service

Travatar

How To Know Who Is Getting You Blocked

Become a member of the extra.community. There is an automated tester named Botty McBotface that employs numerous intricate ways to ascertain the precise type of protection used by a tested website (credits to berstend and others from #insiders).

 

Available stealth browsers with automation features

Note: These are not fully recommended as some of them contain malwares. Use at your own discretion.

Stealth Browser

Puppeteer

Selenium

Evasions

SDK/Tooling

Origin

GoLogin

✔️

✔️

🤮

👍

🇺🇸 + 🇷🇺

Incogniton

✔️

✔️

🤮

✔️

ClonBrowser

✔️

✔️

🤮

✔️

MultiLogin

✔️

✔️

🤮

✔️

🇪🇪 + 🇷🇺

Indigo Browser

✔️

✔️

🤮

✔️

GhostBrowser

👍

Kameleo

✔️

✔️

🤮

✔️

AntBrowser

🇷🇺

CheBrowser

🤮/✔️

👍

🇷🇺

Legend: 🤮 – Evasion based on noise. ❌ – No. ✔️ – Acceptable (with support libraries or not). 👍 – Very nice.

 

Bypassing Bot Detection: A Technical Perspective

I examine several aspects of evasion tactics used to circumvent bot detection systems employed by major internet websites in this article. I write about both technical and non-technical subjects, including advice and citations to scholarly papers.

 

The technical findings presented below are based on observations made while running web scraping scripts against websites protected by the leading anti-bot solution manufacturers for several months.

 

This section is continually being updated. Over time, I’ll attempt to make it more structured in appearance and feel.

 

Random, maybe useful

puppeteer-extra-plugin-stealth 😈

✔️ Win / ❌ Fail / 🤷 Tie :

  • ✔️ Client Hints – Shipped recently. In line with Chromium cpp implementation.

  • ✔️ General navigator and window properties

  • ✔️ Chrome plugins and native extensions – This includes both Widevine DRM extension, as well as Google Hangouts, safe-browsing etc.

  • 🤷 p0f – detect host OS from TCP struct – Not possible to fix via Puppeteer APIs. Used in Akamai Bot Manager to match against JS and browser headers (Client Hints and User-Agent). There is a detailed explaination of the issue. The most reliable evasion seems to be not spoofing host OS at all, or using OSfooler-ng.

  • 🤷 Browser dimensions – Although stealth plugin provides window.outerdimensions evasion, it won’t work without correct config on non-default OS in headless mode; almost always fails when viewport size >= screen resolution (low screen resolution display on the host).

  • ❌ core-estimator – This can detect mismatch between navigator.hardwareConcurrency and SW/WW execution profile. Not possible to limit/bump the ServiceWorker/WebWorker thread limit via existng Puppeteer APIs.

  • ❌ WebGL extensions profiling – desc. tbd

  • ❌ RTCPeerConnection when behind a proxy – Applies to both SOCKS and HTTP(S) proxies.

  • ❌ Performance.now – desc. tbd (red pill)

  • ❌ WebGL profiling – desc. tbd

  • ❌ Behavior Detection – desc. tbd (events, params, ML+AI buzz)

  • ❌ Font fingerprinting – desc. tbd (list+version+renderer via HTML&canvas)

  • ❌ Network Latency – desc. tbd (integrity check: proxy det., JS networkinfo, dns resolv profiling&timing)

  • ❌ Battery API – desc. tbd

  • ❌ Gyroscope and other (mostly mobile) device sensors – desc. tbd

Multilogin, Kameleo and others 💰🤠

  • ❌ General navigator and window properties – As per Multilogin documentation custom browser builds typically lag behind the latest additions added by browser vendors. In this case modified Chromium M7X is used (almost 10 versions behind when writing this).

  • 🤷 Font masking – Font fingerprinting still leaks host OS due to use of different font rendering backends on Win/Lin/Mac. However, the basic “font whitelisting” technique can help to slightly rotate browser fingerprint.

  • ❌ Inconsistencies – Profile misconfiguration leads to early property/behavior inconsitency detection.

  • ❌ Native extensions – Unlike puppeteer-extra-plugin-stealth custom Chromium builds such as ML and Kameleo provide at most an override for native plugins and extensions shipped with Google Chrome.

  • ❌ AudioContext APIs and WebGL property override – Manipulation of original canvas and audio waveform can be detected with custom JS.

  • ✔️ Audio and GL noise


Fingerprint test pages

These websites may be useful to test fingerprinting techniques against a web scraping software

Test page

Notes

https://bot.incolumitas.com/

Very helpful and useful collection of tests

https://plaperdr.github.io/morellian-canvas/Prototype/webpage/picassauth.html

canvas fingerprinting on steroids

https://pixelscan.net/

Not 100% reliable as it often displays “inconsistent” to Chrome after a new update, but worth checking as the author adds new interesting detection features every now and then

https://browserleaks.com/

Doesn’t need introduction 😉

https://f.vision/

Good quality test page from some 🇷🇺 guys

https://www.ipqualityscore.com/ip-reputation-check

Commercial service with free reputation check against popular blacklists

https://antcpt.com/eng/information/demo-form/recaptcha-3-test-score.html

ReCaptcha score as well as some interesting notes on how to optimize captcha solving costs

https://ja3er.com/

SSL/TLS fingerprint

https://fingerprintjs.com/demo/

Good for basic tests – from people who believe and claim can create unique fingerprints “99.5%” of the time

https://coveryourtracks.eff.org/

https://www.deviceinfo.me/

https://amiunique.org/

http://uniquemachine.org/

http://dnscookie.com/

https://whatleaks.com/

https://antcpt.com/eng/information/demo-form/recaptcha-3-test-score.html

Check your reCaptcha score

 

Non-technical Notes

I’d like to offer a general observation to anyone who is evaluating (and/or) considering using anti-bot software on their websites. Anti-bot software is a waste of time. It’s snake oil supplied to non-technical individuals for a high price.

 

Blocking bot traffic is predicated on the assumption that you (or your technology supplier) can differentiate between bots and legitimate users. This is accomplished through the use of a variety of privacy-invading approaches. None of them have been demonstrated to be effective against specialized site scraping technologies. Anti-bot software is entirely focused on minimizing low-cost bot traffic. It increases the cost and complexity of scraping but does not eliminate it totally.

 

Vendors of anti-bot software employ one of two types of detecting techniques:

Binary Detection

There is no need for specialist web scraping software. The vendor can identify malicious traffic based on information revealed by the scraper, such as the User-Agent header and connection parameters.

 

As a result, only bots that are not specifically designed to scrape websites are prohibited. This will satisfy the majority of managers, as the overall amount of bad traffic decreases, and it may appear as though there is no more bot traffic on the website. Wrong.

Traffic Clustering

Advanced web scrapers employ residential proxies and intricate evasion strategies to trick anti-bot software into believing the web scraper is a legitimate user. Due to the technical limitations of web browsers, no detection tool exists to circumvent this.

 

In this situation, the vendor will most likely be able to cluster the unwanted traffic only by identifying trends in bot activity and behavior. This is when fingerprinting of browsers comes into play. The issue with blocking traffic, in this case, is that it may prove to be a perilous operation if bots successfully impersonate real people. There is a possibility that preventing bots will render the website inaccessible to legitimate visitors