Bot Detection: How Can We Scrape The Web Without Getting Blocked?

Bot Detection: How Can We Scrape The Web Without Getting Blocked?

Disclaimer: Special credits to Dariusz Niespodziany of Github community for writing this brilliant article. 

 

Whether you’re just getting started with web scraping and wondering what you’re doing wrong since your solution isn’t working, or you’ve been dealing with crawlers for a while and are stopped on a page that says you’re a bot and can’t proceed, keep reading.

 

In recent years, anti-bot solutions have evolved. More and more websites are implementing security measures, ranging from the simple, such as filtering IP addresses based on their geolocation, to the sophisticated, such as the in-depth study of browser characteristics and behavioral analysis. All of this increases the difficulty and cost of web scraping content compared to a few years ago. It is, however, still possible. I’ve highlighted a few tips that you might find useful below.

How Do You Build A Bot That Is Undetectable?

The following is a list of specialized services that we utilized to circumvent various anti-bot measures. Depending on your application, you may require one or more of the following:


Scenario/use-case

Solution

Example

Short-lived sessions without auth

Rotating IP addresses pool

This is advantageous when scraping websites such as Amazon, Walmart, or public LinkedIn pages. That is any website that does not require registration. You intend to conduct a large number of brief sessions and can afford to be blocked sometimes.

Geographically limited websites

Region-specific pool of IP addresses

This is advantageous when a website makes use of a firewall such as Cloudflare’s to prevent entire geographies from accessing it.

Extended sessions after sign-in

Reliable pool of IP addresses and stable set of browser fingerprints

The most frequently encountered case here is social media automation, in which you develop a program to automate social media accounts in order to manage ads more efficiently.

Javascript-based detection

Implementation of common evasion libraries, similar to puppeteer-extra-plugin-stealth

Numerous websites make use of FingerprintJS, which may be readily circumvented when open-source plugins such as the aforementioned puppeteer stealth plugin are used in conjunction with your existing software.

Detection with fingerprinting techniques for browsers

Browser fingerprints that appear to be natural. That is, having covered the entire area that the target website’s Javascript solution is validating.

This is one of the most sophisticated situations. Credit card processors such as Adyen or Stripe are popular examples. A highly sophisticated browser fingerprint is being developed to detect credit theft and notify the user for additional authorization.

Exceptional set of detection techniques

Bot software developed specifically for the purpose of detecting the target website’s unique detection surface.

Examples include websites that sell sneakers and e-commerce stores that are purportedly under attack by custom-made bot software.

Easy-to-use custom-made detection techniques

Before digging into any of the above, if you’re targeting a smaller website, you may find that all you need is a Scrapy script with some adjustments and an inexpensive data-center proxy.

 

Anti-bot Software Providers List

This is not a complete list of organizations that offer cutting-edge anti-bot solutions to enterprises ranging from small e-commerce sites to Fortune 500 corporations:

 

Akamai Bot Manager by Akamai

Advanced Bot Protection by Imperva (former Distil Networks)

DataDome Bot Protection

PerimeterX

Shape Security

Cloudflare Bot Management

Barracuda Advanced Bot Protection

HUMAN

Kaskada

Alibaba Cloud Anti-Bot Service

Travatar

How To Know Who Is Getting You Blocked

Become a member of the extra.community. There is an automated tester named Botty McBotface that employs numerous intricate ways to ascertain the precise type of protection used by a tested website (credits to berstend and others from #insiders).

 

Available stealth browsers with automation features

Note: These are not fully recommended as some of them contain malwares. Use at your own discretion.

Stealth Browser

Puppeteer

Selenium

Evasions

SDK/Tooling

Origin

GoLogin

✔️

✔️

🤮

👍

🇺🇸 + 🇷🇺

Incogniton

✔️

✔️

🤮

✔️

ClonBrowser

✔️

✔️

🤮

✔️

MultiLogin

✔️

✔️

🤮

✔️

🇪🇪 + 🇷🇺

Indigo Browser

✔️

✔️

🤮

✔️

GhostBrowser

👍

Kameleo

✔️

✔️

🤮

✔️

AntBrowser

🇷🇺

CheBrowser

🤮/✔️

👍

🇷🇺

Legend: 🤮 – Evasion based on noise. ❌ – No. ✔️ – Acceptable (with support libraries or not). 👍 – Very nice.

 

Bypassing Bot Detection: A Technical Perspective

I examine several aspects of evasion tactics used to circumvent bot detection systems employed by major internet websites in this article. I write about both technical and non-technical subjects, including advice and citations to scholarly papers.

 

The technical findings presented below are based on observations made while running web scraping scripts against websites protected by the leading anti-bot solution manufacturers for several months.

 

This section is continually being updated. Over time, I’ll attempt to make it more structured in appearance and feel.

 

Random, maybe useful

puppeteer-extra-plugin-stealth 😈

✔️ Win / ❌ Fail / 🤷 Tie :

  • ✔️ Client Hints – Shipped recently. In line with Chromium cpp implementation.

  • ✔️ General navigator and window properties

  • ✔️ Chrome plugins and native extensions – This includes both Widevine DRM extension, as well as Google Hangouts, safe-browsing etc.

  • 🤷 p0f – detect host OS from TCP struct – Not possible to fix via Puppeteer APIs. Used in Akamai Bot Manager to match against JS and browser headers (Client Hints and User-Agent). There is a detailed explaination of the issue. The most reliable evasion seems to be not spoofing host OS at all, or using OSfooler-ng.

  • 🤷 Browser dimensions – Although stealth plugin provides window.outerdimensions evasion, it won’t work without correct config on non-default OS in headless mode; almost always fails when viewport size >= screen resolution (low screen resolution display on the host).

  • ❌ core-estimator – This can detect mismatch between navigator.hardwareConcurrency and SW/WW execution profile. Not possible to limit/bump the ServiceWorker/WebWorker thread limit via existng Puppeteer APIs.

  • ❌ WebGL extensions profiling – desc. tbd

  • ❌ RTCPeerConnection when behind a proxy – Applies to both SOCKS and HTTP(S) proxies.

  • ❌ Performance.now – desc. tbd (red pill)

  • ❌ WebGL profiling – desc. tbd

  • ❌ Behavior Detection – desc. tbd (events, params, ML+AI buzz)

  • ❌ Font fingerprinting – desc. tbd (list+version+renderer via HTML&canvas)

  • ❌ Network Latency – desc. tbd (integrity check: proxy det., JS networkinfo, dns resolv profiling&timing)

  • ❌ Battery API – desc. tbd

  • ❌ Gyroscope and other (mostly mobile) device sensors – desc. tbd

Multilogin, Kameleo and others 💰🤠

  • ❌ General navigator and window properties – As per Multilogin documentation custom browser builds typically lag behind the latest additions added by browser vendors. In this case modified Chromium M7X is used (almost 10 versions behind when writing this).

  • 🤷 Font masking – Font fingerprinting still leaks host OS due to use of different font rendering backends on Win/Lin/Mac. However, the basic “font whitelisting” technique can help to slightly rotate browser fingerprint.

  • ❌ Inconsistencies – Profile misconfiguration leads to early property/behavior inconsitency detection.

  • ❌ Native extensions – Unlike puppeteer-extra-plugin-stealth custom Chromium builds such as ML and Kameleo provide at most an override for native plugins and extensions shipped with Google Chrome.

  • ❌ AudioContext APIs and WebGL property override – Manipulation of original canvas and audio waveform can be detected with custom JS.

  • ✔️ Audio and GL noise


Fingerprint test pages

These websites may be useful to test fingerprinting techniques against a web scraping software

Test page

Notes

https://bot.incolumitas.com/

Very helpful and useful collection of tests

https://plaperdr.github.io/morellian-canvas/Prototype/webpage/picassauth.html

canvas fingerprinting on steroids

https://pixelscan.net/

Not 100% reliable as it often displays “inconsistent” to Chrome after a new update, but worth checking as the author adds new interesting detection features every now and then

https://browserleaks.com/

Doesn’t need introduction 😉

https://f.vision/

Good quality test page from some 🇷🇺 guys

https://www.ipqualityscore.com/ip-reputation-check

Commercial service with free reputation check against popular blacklists

https://antcpt.com/eng/information/demo-form/recaptcha-3-test-score.html

ReCaptcha score as well as some interesting notes on how to optimize captcha solving costs

https://ja3er.com/

SSL/TLS fingerprint

https://fingerprintjs.com/demo/

Good for basic tests – from people who believe and claim can create unique fingerprints “99.5%” of the time

https://coveryourtracks.eff.org/

https://www.deviceinfo.me/

https://amiunique.org/

http://uniquemachine.org/

http://dnscookie.com/

https://whatleaks.com/

https://antcpt.com/eng/information/demo-form/recaptcha-3-test-score.html

Check your reCaptcha score

 

Non-technical Notes

I’d like to offer a general observation to anyone who is evaluating (and/or) considering using anti-bot software on their websites. Anti-bot software is a waste of time. It’s snake oil supplied to non-technical individuals for a high price.

 

Blocking bot traffic is predicated on the assumption that you (or your technology supplier) can differentiate between bots and legitimate users. This is accomplished through the use of a variety of privacy-invading approaches. None of them have been demonstrated to be effective against specialized site scraping technologies. Anti-bot software is entirely focused on minimizing low-cost bot traffic. It increases the cost and complexity of scraping but does not eliminate it totally.

 

Vendors of anti-bot software employ one of two types of detecting techniques:

Binary Detection

There is no need for specialist web scraping software. The vendor can identify malicious traffic based on information revealed by the scraper, such as the User-Agent header and connection parameters.

 

As a result, only bots that are not specifically designed to scrape websites are prohibited. This will satisfy the majority of managers, as the overall amount of bad traffic decreases, and it may appear as though there is no more bot traffic on the website. Wrong.

Traffic Clustering

Advanced web scrapers employ residential proxies and intricate evasion strategies to trick anti-bot software into believing the web scraper is a legitimate user. Due to the technical limitations of web browsers, no detection tool exists to circumvent this.

 

In this situation, the vendor will most likely be able to cluster the unwanted traffic only by identifying trends in bot activity and behavior. This is when fingerprinting of browsers comes into play. The issue with blocking traffic, in this case, is that it may prove to be a perilous operation if bots successfully impersonate real people. There is a possibility that preventing bots will render the website inaccessible to legitimate visitors

Growth Hacks: 10 Ways to Optimize Your Instagram Sources for Follow/Unfollow Method

Growth Hacks: 10 Ways to Optimize Your Instagram Sources for Follow/Unfollow Method

You are a newbie in the social media automation industry, found Jarvee and you would like to grow your accounts with it. You have started setting up your accounts. You picked the best settings for your accounts according to automation experts. You bought high-quality proxies, you did not experience any blocks and the follow and unfollow tools are working very well. But, wait up! You noticed your IG followers are not even growing. 

 

 

If this scenario is familiar to you, we believe you should read this.

 

Oftentimes, it’s not our content that is the culprit for not getting those followers you want but our inability to utilize great follow sources. You have interest-piquing content you regularly publish ,but they are useless if they would not be showcased to the right audience. 

 

Hence, below are some tips we have to help you get the right follow sources for your accounts. Read on to know more…

 

  1. Account’s Competitors

 

Well, I believe having your business’ competitors as follow sources is the common practice of Jarvee automators but more often than not, this is one of the effective techniques which is easily forgotten. These kinds of follow sources are not only intended for IG business accounts. If you own a personal one, competitors may mean those accounts whose followers might also be interested in your content who you think would be an awesome addition to your followers.

 

You may use the dropdown button to get access to your main account’s related profiles. From there, you will get great choices of follow sources you can check. 

 

 

  1. Quality Over Quantity

 

Through the years of doing Jarvee automation, one thing I noticed was that clients often suggest follow sources with big follower’s bases  to be added to their accounts. While one might think that this is a great idea, most often it’s not at all. We might be tempted by the cliche aphorism “the more, the merrier” but for follow sources, stick with those with “more followers” and “merrier” engagement rates. 

 

The apparent reason is that while it is fascinating to look at accounts with a lot of followers, sometimes these followers are not even real or are active on Instagram. Therefore, following them on IG would be a waste of time as we can’t expect them to follow us back. Use follow sources that have the most ideal engagement rates. Other automators would opt for 10k or fewer followers to decrease the chance of getting follow sources with fake followers.

 

In checking accounts’ engagement rates, you may click this link to explore the available free tools to use.



  1. Use Jarvee Tools

 

Now, using Jarvee, we can use the tools “Users and Hashtags” and “Scrape Tool” where you can get similar accounts to yours and other similar accounts to the follow sources your clients provided you. You may also want to scrape users based on your preferred location. 

 

In doing this, make sure to use scraper accounts added as regular ones. Never use your main accounts as this will burn them.

 

  1. Follow Mutual Followers

 

Another way to get more chances of getting a follow back is to make use of your new followers. In other words, from the new followers you’ve got, you can try checking them if they can be great sources. 

 

This is because if we follow them, and they can see that you have mutual followers/followings, this will most likely encourage them to follow back. There is no assurance for this but this works most of the time.

 

  1. Consider More Places!

 

Go out of your comfort zone. In case you are limiting yourself to having followers within your area, begin attracting followers from locations nearby. 

 

This depends on the client’s preference but if you want more followers and extend the reach of your niche, you might want to try following users in the same niche but with a different location that is still close to your chosen area. For example, if you would like to target resorts in the city of Chicago, you might want to consider places like Kansas City too.

 

  1. What Are Your Interests?

 

A lot of clients want to use follow sources that have similar niches with them and although this is a great idea, it has downsides as well. We have to remember that the followers inside those follow sources are not limited to one interest, one hobby and/or one choice of content only. Let us say that if your client has a resort, which other things is he/she interested in? Is he/she into sports, online games and mountain adventures? Consider also these niches that are shown in the client’s account. They might not be the main niche you want to target but you can still get favorable followers from them.

 

Do not cling with competitors in the same field. Broaden your horizon. Always know that people are naturally multipotentialite.

 

  1. Google It!

 

Take advantage of tools available on the internet which will help you search the follow sources you need. There are a lot of softwares and platforms out there which offer fantastic services in searching Instagram users using keywords based on the number of content posted, followers, gender, niche and such.



  1. Be Resourceful.

 

If you are managing multiple accounts and have been adding different sets of sources for every account, you might want to try updating the follow back ratio of each one of them and save the highest ones. In case there are main accounts with the same or closely related niches, add the best sources to the accounts that need them. Do not enclose yourself to the idea that if they have different niches, they should have different sources. As long as you think they have great followers, you go with them. 




  1. Cut Them Off!

 

Seasoned and newbie automators alike know how laborious it is to do Instagram  automation nowadays given the ever-changing algorithm of Instagram, so one thing that helps us get through the day is to minimize the number of follow and unfollow as opposed to the limits we have before to avoid potential blocks. 

 

However, this move has also affected the follow back ratio of our accounts. To make up for it, do not forget to regularly update the follow back ratio data and be sure to delete the sources with poor rates. Given the scenario now, you might want to replace follow sources with below 15% follow back ratio.

 

  1.  Rank Them

 

Apart from deleting the poor performing follow sources, it is also recommended taking advantage of the “selection ranking” feature of Jarvee. Remember that the higher the number in ranking, the more you favor to perform follow actions from that follow source. Place the highest number to the sources with the highest followback ratio for the data have already shown that you have more chances of getting followers from there.

 



Disclaimer: Remember that Jarvee automation is not “one size fits all”. These tips might help some accounts to grow faster but may not be effective to yours. Experiment on all of them and observe what works best for you.

9 Anti-detect Browsers for Automators and Avatar Creation

9 Anti-detect Browsers for Automators and Data Scrapers

 

You might be one of those aspiring growth hackers and social media automation newbies who wonder: How do seasoned automators and growth experts  succeed with their goals in growing accounts and scraping data while keeping accounts safe and being managed in one place? What’s their secret? Will I be able to do that too?

A bit hard as it seemed, it’s no secret that anti-detect browsers paved the way for successful growth hacking endeavors. 

As technology keeps on advancing over the years, websites have come up with more and more security measures to block growth hackers from scraping data from the sites visited and social media giants are prohibiting businesses from multi-accounts creation and management. With this, experts made a countermeasure to provide solutions for this dilemma. 

While these security measures vary from simple to more advanced ones, it is definitely possible to bypass them. Hence, below are some anti-detect browser tools you might want to try using when social media growth and scraping come into your mind. We compiled a list of the anti-detect browsers we know together with detailed instructions on how to integrate proxies with them :


  1. Gologin 

https://gologin.com/gologin-integration

  1. Incogniton 

https://incogniton.com/knowldedge%20center/proxies/

  1. Multilogin 

https://docs.multilogin.com/l/en/article/hCyoVkjyHI-using-ip-authenticated-proxies

  1. IndigoBrowser 

https://you-proxy.com/blog/indigo_browser_proxy_settings/

https://proxy-seller.com/blog/how_to_set_up_a_proxy_in_indigo_browser?utm_source=google.com&utm_medium=organic&utm_campaign=google.com&utm_referrer=google.com

  1. Ghost Browser

https://support.ghostbrowser.com/article/344-using-proxies

https://support.ghostbrowser.com/article/365-ghost-proxy-control

  1. Kameleo 

https://help.kameleo.io/hc/en-us/articles/360003298577-Built-in-proxy-manager-HTTP-SOCKS-SSH-

  1. Antbrowser

https://you-proxy.com/blog/antbrowser_proxy_settings/

  1. CheBrowser

https://blog.chebrowser.site/quick-start

  1. ClonBrowser

           https://www.clonbrowser.com/help/proxy-protocol-settings

On the other hand, if you wish not to use these browsers, there is a simple way where you can still do the browsing and connecting accounts without your profiles and activities being tracked or connected to your personal devices. Just follow these steps.

For Google Chrome:

  1. In Google Chrome, add the extension BP Proxy Switcher

  2. Open the extension and add a proxy. This is what it looks like. 

  1. Click “ok” and you’ll be able to browse now with a proxy. To check if you are really connected to it, go to ipinfo.io to check your IP location that it matches the proxy you are using.

For Firefox:

As for the proxy manager to be used for Firefox, we recommend using Simple Proxy Switcher

We hope these will help you get started with your automation and scraping journey. For more guides and information about growth hacking, social media growth and scraping, feel free to contact us.