Scrapebox + Proxies: What Are The Best Settings to Use?

Krizzia Paolyn

Senior Automator at The Social Proxy.

Scrapebox is an incredibly versatile tool that can be used for a wide variety of various enterprises and purposes. For example, it can be used by small businesses to scrape data on their competitors and primary keywords. Larger firms can use it to scrape product details, collect aggregate data for research, or even simply collect data on an audience of consumers via Twitter or another social media platform.

Additionally, it is a risky tool to use. Human rules do not bind Scrapebox. It performs exactly what you instruct it to do. If what you tell it to perform violates the terms and conditions of the website you’re scraping, your IP address or account may be restricted or suspended. Scrapebox might appear very similar to a DDoS assault to a website that is being hammered by data requests, and websites today take DDoS attacks very seriously. Therefore, what settings should you employ to guarantee the tool is used ethically and safely?

Defining Scrapebox

To begin, let us define Scrapebox. If you already use Scrapebox and are aware of its capabilities, you can proceed to the next step. They refer to the software as “the Swiss Army Knife of SEO” because of its versatility and multi-purpose nature. As you may think, it’s a scraper. You may point it to a webpage, and it will retrieve data from that page, or you can point it at a list of URLs, and it will retrieve data from each one.

Scrapebox is essentially an automated tool, which means that it significantly relies on web proxies. Proxies are IP addresses that act as a conduit for traffic. As a result, they’re incredibly beneficial for circumventing IP blockades and rate constraints. Scrapebox, for example, can swiftly scrape the top ten Google search results for a list of 1,000 keywords. However, after a given number of fast hits, Google determines that a specific IP address makes an excessive quantity of rapid requests to their server. So they added a captcha to the software, which causes it to slow.

Scrapebox traffic is significantly slower on any particular IP address when 1,000 different IP addresses are used – or even 200 in a cycle. This is because Google no longer believes that a single individual can make a thousand separate requests in ten minutes; they think 200 people can make five requests apiece in 10 minutes. It’s a lot more manageable level of traffic and one that Google will not blink at.

The point is that Scrapebox enables you to access massive amounts of data that are ordinarily inaccessible instantly. Numerous websites give data APIs that you may use to extract data. As an example, consider Facebook’s Graph API. If you have a Facebook application that has API access, you can retrieve certain types of data. However, if you lack API access or require data that the Facebook API does not provide, you can utilize Scrapebox to obtain it.

Scrapebox circumvents API constraints; it obtains data that APIs cannot supply; it avoids rate limits; it does automated, multi-step activities to get data requiring multiple repeated calls and data filtering via a conventional API.

What Are Its Features?

Scrapebox includes a variety of different potential capabilities for scraping various types of data in a variety of different contexts.

You can provide it with a list of keywords, and it will visit several search engines to collect the results for those terms.
You can provide it with a single term or a list of keywords, and it will generate a much longer list of offshoot keywords using search engine autocomplete recommendations.
You can provide it with a list of proxy IP addresses, and it will verify each one to determine what type of proxy it is, what protocol it uses, and whether it is still operational.
You can provide it with a list of URLs, and it will leave blog comments on each URL, as many or as few as you choose.
You can provide it with a list of links, and it will analyze them for HTTP status codes, originating pages, anchor text, and so on in order to determine the legitimacy of your backlink profile.
You can provide it with a list of URLs, and it will check their Alexa rank.
You can feed it a list of URLs, and it will scrape the corresponding article data.
You can provide it a URL, and it will search for and identify broken links on that site.
You can provide it with a list of URLs, retrieving the page’s Page Authority.

The final handful of those functions, along with another half-dozen or so, are all available as add-ons to the Scrapebox application.

Numerous add-ons require more than the standard license to utilize. By the way, the standard license is $100. They claim it’s a limited-time offer for $97 rather than $197, although I’ve never seen it at full price.

Black Hat Warning

Numerous elements of Scrapebox technically violate a site’s terms of service. For example, Google’s developer terms of service specify that you will not seek to evade API constraints. Naturally, the primary reason for this is money. If a website sells API access, the owner does not want users to use third-party software to obtain the same data without paying for it. Furthermore, scraping consumes server resources, which might be costly. It can even consume all available bandwidth on smaller servers, thus shutting down the service for legitimate visitors.

This is all acceptable scraping usage. That is why you use proxy IPs; to avoid detection. While all of this technically violates numerous site conditions, very rarely will a scraper be so flagrant and blatant as to get discovered.

The legality of data scraping is a contentious question at the moment. Several court proceedings are currently pending to decide what is and are not legal. This site provides an excellent overview of the current situation.

Other aspects of Scrapebox may be considerably more harmful. For example, widespread automated posting of blog comments is a classic spam strategy. Even if you attempt to be reasonable and valuable in your comments, you will still end up publishing a large number of ineffective or spammy ones if you do not pay them special attention. Scrapebox can spin content, but it lacks artificial intelligence, does not use machine learning, and cannot provide contextually appropriate remarks. Not only can these features place you on widely used blacklists such as Akismet, but they can also give your brand an incredibly poor reputation.

Scrapebox is, at its core, a tool. If used responsibly and ethically, it can provide a great deal of value. But, on the other hand, if you use it to its maximum potential, you expose yourself to significant risk, with no one to blame but yourself.

What Are The Ideal Settings for Scrapebox?

Of course, many of you did not come here for the ethical lecture or an overview of the program’s capabilities. Instead, you came here because the title implied that you could actually utilize the settings. I guess you’ve pressed “scroll down” long enough to reach the correct section.

First

To begin, you should contact the provider of the proxy server you are using. Certain proxies support only one connection or request at a time. Certain items will be limitless. Some had restrictions of roughly ten, fifty, or one hundred. This is the list of threads you’re sending to each proxy.

If you increase the number of threads, your proxies will be blacklisted or caught in speed filter captchas. On the other hand, if you set them higher than a server administrator permits, you risk having your access to those proxies terminated, especially if you’re using a private proxy list. Generally, it’s prudent to begin small and work your way up. After all, you do not need to obtain your data immediately; you may always run the software overnight.

If your proxy provider specifies a maximum thread count, use a value less than that. If they do not specify a limit, choose a suitable value for your internet connection and intended use.

Second

Second, utilize a backconnect proxy if possible. A standard proxy is a single server with a single IP address that you use to relay your traffic. A backconnect proxy is a collection of disparate machines and IP addresses. Your traffic enters the cloud of proxy possibilities, exits to retrieve your data, and then return to you.

Randomness is the primary advantage of a backconnect proxy swarm. If you use ten proxies, a site like Google can still identify the same behavior in the same pattern originating from ten distinct IP addresses. It can connect them all together to determine what you’re actually doing. When ten distinct machines form a backconnect swarm, the probability of observing a regular pattern is significantly reduced. The greater the swarm, the less patterning there is likely to be. You can learn more about backconnect proxies by visiting this page.

If you’re scraping results based on keywords, you should take advantage of as many keyword variations as possible.

You can pay Scrapebox for an add-on that suggests keywords for you, but it is an additional fee. Rather than that, you can utilize a service to get a free list of keyword variations. That particular website will begin with a single keyword and provide you with every autocomplete alternative available, starting with the most popular options and progressing through the alphabet. If it completes the alphabet without being stopped, it will begin with the first keyword generated and continue the process with that term, and so on down the list for as long as you let it run. In less than a minute, you may produce thousands of keywords.

If you’re looking for data on articles, links, keywords, or anything else that isn’t Google-specific, you can explore scraping Bing instead. There are two possible explanations for this. To begin, Bing is far laxer than Google when it comes to scraping. They are less concerned with rate limits or bot blocking, and their automated processes work less diligently to prevent it. So scraping Bing is, in essence, simpler.

Next, Bing almost certainly directly utilizes a large portion of Google’s results. Google even provided evidence of this in 2011, to which Microsoft’s response was essential “so what?” Thus, there is a reasonable likelihood that the data you obtain from Bing is consistent with that obtained from Google. As long as an occasional error is acceptable, Bing can provide completely usable data.

One setting that is unique to Scrapebox is the number of search engines that will be scanned. Of course, they do Google, Yahoo, and Bing. Additionally, they produce Rambler, BigLobe, Goo, Blekko, Ask, and Clusty, among others.

If you do not require data from these search engines, or if you do require data but the results returned by specific engines are of low volume or value, uncheck and stop scanning those search engines. Continue scanning them is a waste of CPU cycles, electricity, and bandwidth if the data is useless to you.

Third

Finally, you can modify the timeout settings. If you’re utilizing backconnect proxies or a private proxy list, you can configure the timeout to a low value, such as 15-30 seconds. While shorter timeouts enable faster data harvesting, they can also overload proxies, causing you to be briefly pushed off the proxy. Public proxies, which are already somewhat slow, should have extended timeout periods. A duration of between 30 and 90 seconds is recommended here.

Conclusion

If you’re utilizing a limited proxy list or know you’ll be scraping a big amount of data directly from a picky site like Facebook or Google, use a longer timeout, typically 90 seconds. This helps ensure that you are not captured and filtered by captchas. It will collect data more slowly but with more reliability.

Krizzia Paolyn

Senior Automator at The Social Proxy.

Resources menu

Scrapebox + Proxies: What Are The Best Settings to Use?

Krizzia Paolyn

Defining Scrapebox

What Are Its Features?

Black Hat Warning

What Are The Ideal Settings for Scrapebox?

First

Second

Third

Conclusion

Krizzia Paolyn