How to Scrape Indeed Jobs Data

When it comes to analyzing job market trends, web scraping helps you extract data from web pages. More specifically, scraping can be used to extract job postings from job sites like Indeed. Most job postings are restricted geographically, have IP blocks, and CAPTCHA, and therefore require the use of proxies when scraping.

https://thesocialproxy.com/In this Indeed job scraping tutorial, you will learn how to scrape actual jobs from Indeed with the use of BeautifulSoup and Python. By using The Social Proxy’s mobile proxy, you’ll learn how to bypass geographical limitations by simulating a New York—based IP address. By the end, you’ll have scraped the data into a CSV file for further analysis and visualization.

Why scrape Indeed jobs data?

Indeed is a leading online job search site with millions of active jobs from all industry sectors. According to Statista, Indeed had more than 500 million unique visitors in 2023 making Indeed stand out as a market leader in the job search industry.

Indeed job description and salary data are useful to recruiters, market researchers, and job seekers. You can keep track of salary rates on specific occupations and geographical locations, discover the demand of particular skills, and keep tabs on various market trends. Indeed also provided employer preferences and job posting patterns, which can help inform both employers and job seekers.

Setting up the scraping environment

To begin scraping job data from Indeed, you’ll need to set up the following tools in your development environment:

  • Python: click here to download.
  • The Social Proxy mobile proxy: click here to set it up.

Once you’ve downloaded Python and configured the proxy, switch to the newly created proxy using a proxy switcher service. For this demonstration, you can use the BP Proxy Switcher Chrome plugin.

Login to your The Social Proxy dashboard and navigate to the tab titles “Proxies.” Click on the copy icon located below your proxy and select “Host:Port:User:Pass” as shown in the image below.

Launch the proxy switching service and fill in the designated field with the text you copied.

Once it’s been pasted, select it from the dropdown.

After successfully setting up the proxy, confirm its operation by checking connection speeds. Now, set up the development environment.

Run the below command in your terminal to create a new virtual environment.

				
					 python3 -m venv scraper
source scraper/bin/activate    #for Windows users run scraper\Scripts\activate
				
			

Next, install the required dependencies using the command below.

				
					pip install pandas selenium bs4


				
			

Run the command below to create a file that’ll hold the scraping script.

				
					touch scraper.py


				
			

A New York proxy is essential when scraping Indeed data because the website displays region-specific listings, and access may be restricted based on geolocation. New York City, one of the largest job markets in the world, offers opportunities across nearly every industry. Using proxies to scrape Indeed data with a New York IP address will allow you to unlock job listings targeting this dynamic market, giving you a full view of local opportunities that might otherwise be hidden.

A step-by-step guide to scraping Indeed job listings

Now that your development environment has been fully set up, we can start scraping. Modern websites like Indeed rely heavily on JavaScript to dynamically load content, particularly for listings, images, and other interactive elements. To scrape such content, we’ll use Selenium to automate a browser, load JavaScript, and then scrape the fully rendered page.

Add the following imports at the top of the scraper.py file.

				
					import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
				
			

Next, configure Selenium to use the ChromeDriver.

				
					driver = webdriver.Chrome()  # Make sure you've installed the driver.
url = 'https://www.indeed.com/jobs?q=web+developer&l=New+York%2C+NY'

# Open the webpage
driver.get(url)

# Wait until the job postings are loaded
WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'jobTitle')))
				
			

Parse the loaded HTML content using BeautifulSoup.

				
					html = driver.page_source

# Parse the loaded HTML with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
				
			

Open Indeed in your browser, right-click on one of the job listings and select “Inspect”. This will open the Developer Tools, where you’ll be able to analyze the underlying code.

As seen in the image above, the class JobTitle holds the job posting title.

Add the following line to your scraper.py file to pluck all job titles on the page.

				
					# Extract all job titles from the page
job_titles = soup.find_all('h2', class_='jobTitle')
				
			

Next, inspect the page to find the company name and location element.

As seen in the image below, the company name and text location attributes point to the company name and location respectively.

Add the following line to your scraper.py file to pluck all company names and locations.

				
					# Extract all company names and company locations from the page
company_names = soup.find_all('span', attrs={"data-testid": "company-name"})

job_locations = soup.find_all('div',  attrs={"data-testid": "text-location"})
				
			

Now you can inspect the page to find the job summary element.

As seen in the image below, the CSS class heading6 tapItem-gutter css-1rgici5 eu4oa1w0 holds the job summary.

Add the following line to your scraper.py file to pluck all job summaries.

				
					# Extract all job summaries from the page
job_descriptions = soup.find_all('div', class_='heading6 tapItem-gutter css-1rgici5 eu4oa1w0')
				
			

Saving and structuring the scraped data

Now that the data is ready, let’s present it in a format that’s easier to read and understand.

First, create lists to store the job data. Add the code block below to your scraper.py file:

Set up any necessary files/folders and install the required module before proceeding with the code in this tutorial.

				
					titles = []
companies = []
locations = []
descriptions = []
				
			

Transform the data, but keep in mind that some job listings may have missing values (e.g. a job listing without a location).

				
					# Iterate through the job data and store it in corresponding lists
for i in range(max(len(job_titles), len(company_names), len(job_locations), len(job_descriptions))):
    titles.append(job_titles[i].text.strip() if i < len(job_titles) else "N/A")
    companies.append(company_names[i].text.strip() if i < len(company_names) else "N/A")
    locations.append(job_locations[i].text.strip() if i < len(job_locations) else "N/A")
    
    if i < len(job_descriptions):
        description = job_descriptions[i].find_all('li')
        desc_text = [li.text.strip() for li in description]
        descriptions.append('; '.join(desc_text) if desc_text else "N/A")
    else:
        descriptions.append("N/A")
				
			

The code block above iterates through job listings. It uses the `range()` function with the maximum length of these lists to ensure all available data is captured. For each iteration, it appends the corresponding information to separate lists (`titles`, `companies`, `locations`, and `descriptions`), using a conditional expression to add the text content if the index is valid, or “N/A” if it’s out of range.

For job descriptions, it finds all `<li>` elements, extracts their text, and joins them with semicolons defaulting to “N/A” where data is missing.

Use Pandas to export the data into a CSV file.

				
					df = pd.DataFrame({
    'Title': titles,
    'Company': companies,
    'Location': locations,
    'Description': descriptions
})

# Export to CSV
df.to_csv('indeed_jobs.csv', index=False)
				
			

Close the page by adding the code below:

				
					driver.quit()
				
			

Execute your script by running the command below:

				
					python3 scraper.py


				
			

Conclusion

In this  scraping tutorial, you learned how to scrape job listings from Indeed using a proxy. There are many benefits to using this method; you not only have access to geographically restricted job postings, you can also bypass CAPTCHAs automatically. Proxies allow you to mask your IP address, which prevents you from getting blocked by the website.

You can further improve the script by integrating other job sites, implementing multiple search functionality, enhancing error checking, optimizing scraping speed, and implementing data analysis tools. Continuously refining your scraping tools and incorporating new data sources will help you build a strong, powerful system for data from the job market industry. If you’re a job seeker, a recruiter, or conducting market research, having the skills needed to scrape Indeed job data will help keep you in tune with the ever-changing employment market.

Accessibility tools

Powered by - Wemake