Planning on scraping LinkedIn? LinkedIn is the computerized Rolodex of the current era, with over 500 million users. If you do not already have an account, you should most likely create one. You can rub elbows with industry titans, stalk old high school friends, and discuss your next career move.
However, LinkedIn has a whole different connotation for scrapers. Rather than personally connecting with members in the industry, scrapers view LinkedIn as a goldmine of personal information.
Then there are LinkedIn company profiles, which are distinct from individual user-profiles and introduce an entirely new feature for a scraper.
The answer should be self-evident: to gather all of that data. User profiles include information such as names, email addresses, industries, and skill capabilities. Businesses maintain a database of employees, job postings, present employees, and various other critical data.
LinkedIn is a literal depiction of individuals and businesses in the workforce, and they maintain accurate information. This information is quite essential.
Naturally, you cannot scrape all of the data I mentioned above. However, you can scrape off part of it.
Getting it right entails a variety of elements. Consider the following:
Selecting an application is critical, as many of them are fee-based. You’ll want to have a firm grasp on the software itself, as well as what you’re hoping to accomplish using LinkedIn. This is to earn a healthy return on your investment.
Once you’ve chosen an application, you’ll need to tweak two critical settings within it. This is true for all scraping procedures in general, but it is essential for scraping LinkedIn, which is more sensitive than other websites.
In scraping software, the term “threads” refers to the number of open connections used to scrape. The greater the number of threads, the faster the scraping; the greater the number of threads, the faster you will be tagged and banned.
The most careful choice is to employ a single thread per proxy. That is what an actual human being does, and anything more will raise suspicion at some time. Numerous scrapers, on the other hand, utilize up to ten threads per proxy.
Timeouts are the second key aspect to consider while altering your application’s scraping settings. Timeouts are the length of time elapsed between a server responding to a proxy and the proxy initiating a new request.
If your timeouts are set to ten seconds, your proxy will submit another request to the server for information after ten seconds of inactivity.
Numerous scrapers have a relatively short timeout: 1 or 2 seconds. This generates many results since it frequently generates fresh requests for information, implying that you will receive results more frequently.
Avoid this. Set your timeouts to a long duration, between 30 and 60 seconds. This effectively pauses the server until that particular proxy delivers another request.
By setting your timeouts to a high value, you evade much of LinkedIn’s detection and avoid overwhelming them with repetitive queries.
You are free to try scraping LinkedIn public pages the same way that you would any other scrape that begins with a search engine. You must enter the correct search phrases, such as “LinkedIn.com,” which will generate Google results pointing to specific LinkedIn pages.
Your scraper can then retrieve information from these publicly accessible pages and provide it to you. You’ll be scraping both Google and LinkedIn in this context, so you’ll want to avoid setting off any of their warning bells.
You may be really particular with this by looking for company sites on LinkedIn by industry sector using a search engine such as Microsoft, Google, or Apple. This would be accomplished by performing a search for “Apple LinkedIn” and then scraping the results.
This will, however, limit you to public pages, which you may not want.
Private pages are a different story. When a user joins LinkedIn, they are notified that their information will be kept confidential, will not be sold to other firms, and used exclusively for internal purposes.
However, there are more reasons to scrape this data. Perhaps you’re on the lookout for programmers in a particular city or open positions in a new state. You can also scrape for research purposes. Both of these options seem acceptable to me, but the for-profit business does not.
Scraping LinkedIn private pages requires you first to create an account. Once you’ve completed this and logged into LinkedIn, you’ll be free to conduct as many searches as you like. Bear in mind that this account is not intended to connect with individuals but rather to provide access to LinkedIn to scrape.
After creating an account, you’ll need to determine what you want to search for. If you search for Microsoft employees, a large number of individuals will appear. You can instruct the scraper to capture whatever data is available to you in the absence of a connection. Essentially, the name, position, and, on occasion, the email address.
Much of the information remains secret till you connect with others, at which point you’re essentially running a standard LinkedIn account.
By performing the actions mentioned above, you are utilizing a direct automation mechanism within LinkedIn. The risk of being caught is extremely high here, so be sure to adhere to the threads and timeouts regulations outlined above.
Additionally, ensure that you create the account using a single proxy IP address and then scrape on that account. This is all about resembling a human being. The majority of humans do not log onto LinkedIn using a different IP address every few hours. They access it with a single IP address: their own.
If you make the account using a proxy IP, scrape the account using the same proxy IP, and configure all your parameters appropriately, you dramatically lower your chances of being blocked or banned.
Depending on the size of your scrape, you may require many of them. The basic guideline is that the more proxies you have, the better, particularly when scraping a particularly challenging website.
If you want to use a single proxy per account and wish to quickly harvest a large amount of data, consider starting with 50 accounts and 50 proxies.
If you want to use more proxies per account (which I do not recommend), choose between 100 and 200 and rotate them frequently to avoid detection, block, ban, and blacklist them.
The fewer proxies you use, the more frequently they will be discovered. Because this is always an experiment, double-check everything.
Scraping LinkedIn needs proxies and a certain amount of bravery. You must be really motivated to do it, as it will not be easy and may result in banned IP addresses or a lawsuit; as such, exercise caution. Recognize why you’re mining LinkedIn and then work diligently to accomplish those precise goals.