How to Build an Amazon Scraper

With an estimated 8,928 orders per minute, Amazon is the largest online shopping site in the United States. If you intend to target a similar market, you’ve probably thought about developing a competitive pricing strategy. You might want to compare prices of similar products to make more informed purchasing decisions or scrape product reviews for natural language processing (NLP). All of these involve extracting and analyzing data from Amazon. In this tutorial, you’ll learn how to build a dataset from Amazon product pages and scrape similar products and reviews.

While scraping data from Amazon is legal, they take various measures to hinder frequent automatic data scraping. These restrictions can be bypassed using The Social Proxy’s residential proxies. In this Amazon product scraper tutorial, you’ll learn how to use tools like the BeautifulSoup library and Python to scrape product data. At the end, you’ll export the data into a CSV file for further analysis.

Understanding Amazon’s website structure

For this tutorial, you’ll use an office armchair to illustrate how an Amazon product page is structured and show how to identify key webpage components to focus on during scraping. Once you open the product link, you’ll be directed to a page that looks like the image below.

Every Amazon product page contains the following elements:

  • Product information: title, price, product description, specifications, etc.
  • Similar products section
  • Customer reviews and ratings
  • Metadata: Amazon Standard Identification Number (ASIN), product dimensions, and seller information

You’ll need to inspect the HTML layout of a product page to extract specific information. This will allow you to locate the web page elements containing the data you need and simplify the extraction process. We’ll use Chrome Developer Tools to review how to inspect HTML elements.

To open Chrome Developer Tools, hold down the Ctrl + Shift + I keys or click on the three vertical/horizontal dots in the top-right corner (depending on your browser), select More tools > Developer tools. This will display the Chrome Developer Tools panel alongside the Amazon product page. Click on the icon within the red square box (see image below) and drag your cursor over any product information you want to inspect. This will automatically highlight the corresponding HTML element that contains the selected information in the Chrome Developer Tools layout. Familiarize yourself with the product page and observe the HTML elements in the layout.

The image below shows the HTML elements that will be highlighted when you hover over the product title. The product title is located within an HTML element that starts with the tag name “span” and has the following attributes: id=”productTitle” and class=”a-size-large product-title-word-break”.

To locate the product ASIN, navigate to the attribute “data-asin”(a unique block of 10 characters). The ASIN of a product is always assigned to the data-asin attribute of a specific HTML element on all Amazon web pages as shown in the image below.

Keeping the Chrome Developer Tools layout open, scroll down the page until you reach the header “Discover similar items.” Like before, move your cursor over each image box and observe the sections within the Chrome Developer Tools with any necessary information like product name and product price.

Now scroll down the product page to locate the customer review section. Hover over the items within the red square boxes in the image below and observe the elements that are automatically highlighted in the Chrome Developer Tools layout. In most cases, you will find that the information within an HTML element is nested inside another HTML element.

Scraping Amazon using Python and proxies will boost efficiency, minimize errors, and save you time. You’ll also be able to scrape multiple pages without getting worn out, inserting the wrong data, or getting blocked by Amazon anti-scraping mechanisms.

Challenges with scraping Amazon

Amazon values the data it gathers from millions of users on its platform, which is why it actively protects it. Although Amazon does not have a specific policy against scraping, you are likely to encounter some restrictions while scraping data automatically from its website. Amazon has implemented anti-scraping mechanisms that make it impossible to scrape data without getting blocked after a certain period.

Some of these mechanisms include:

  • CAPTCHAs: Amazon uses CAPTCHA to ensure that a user is human and not a bot. These challenges involve simple tasks, such as object identification, and are triggered when Amazon detects rapid requests or requests from suspicious IP addresses. This helps them limit automated scrapers that can’t solve CAPTCHAs.
  • IP rate limiting: Amazon limits the number of requests an IP address can make within a specified time frame. IP addresses that exceed these thresholds are subject to getting temporarily or permanently blocked.
  • Dynamic content loading: To optimize user experience, Amazon often loads content dynamically. This makes it more difficult for scrapers to extract data directly from the HTML. To scrape data from a site with dynamic content, you’ll need a scraper that can intercept network requests.

Using a rotating residential proxy, such as The Social Proxy’s residential proxy, can help bypass these anti-scraping mechanisms.

Tools and setup

  • To build this scraper, make sure you have the following:
  • Python: the programming language we’ll be using.
  • Requests: This library handles HTTP(s) calls and fetches web pages.
  • BeautifulSoup4: a Python library for scraping web page information.
  • Pandas: a Python library for data manipulation. We’ll use it to create the CSV file.
  • re: the short form of the regex module for regular expression matching.
  • os: a python module for interacting with operating system.
  • Your residential proxy credentials from The Social Proxy.

To get started:

  • Download a version of Python that’s compatible with your operating system. Click here to download Python.
  • Install a Virtual Environment Package: Navigate to your project folder using your command line interface (CLI) and execute the following command to install Pipenv
				
					
pip install pipenv


				
			
  • Create a virtual environment for this project in that folder. Run pipenv shell in your terminal.
  • Install required packages: To download the necessary packages, use this command:
				
					
pipenv install requests pandas beautifulsoup4


				
			

Follow this guide to set up a residential proxy with The Social Proxy.

You’ll need to add host, usernames and password information to your Amazon scraper script. Copy the relevant information.

A step-by-step guide to scraping Amazon product data

In this section, you’ll create a Python file and write functions to scrape and extract data from this Amazon website page. The following steps will guide you through the process.

Step 1

Create a new Python file in your project folder and save it with a preferred name, for example, amazon.py. Import the required packages into the script. Copy the URL and assign it to the url variable.

				
					
#import the required libraries
import os
import re
import requests 
import pandas as pd
from bs4 import BeautifulSoup

url ="https://www.amazon.com/HINOMI-Ergonomic-Foldable-Suitable-Computer/dp/B0CHRQWNCH/?_encoding=UTF8&pd_rd_w=ehyH1&content-id=amzn1.sym.4bba068a-9322-4692-abd5-0bbe652907a9&pf_rd_p=4bba068a-9322-4692-abd5-0bbe652907a9&pf_rd_r=7VDVZANP4F104RGPP3ZT&pd_rd_wg=piGQB&pd_rd_r=fe8c79ea-20d0-46b2-bbba-af6e30e2c1aa&ref_=pd_hp_d_btf_nta-top-picks&th=1"

				
			

Step 2

To avoid being blocked by Amazon, configure The Social Proxy’s residential proxies. Copy your credentials from the proxy dashboard and assign them to the appropriate variables. Replace proxy_host, proxy_username, and proxy_password with the corresponding values for host, username, and password, respectively.

				
					
#add your proxy details
proxy_host = "<residential>.thesocialproxy.com" #add your host
proxy_port = "10000"
proxy_url = f"http://{proxy_host}:{proxy_port}"
proxy_username = "USERNAME" #add your username
proxy_password = "PASSWORD" #add your password
proxy_url = f"http://{proxy_username}:{proxy_password}@{proxy_host}:{proxy_port}"

proxies = { "http": proxy_url, "https": proxy_url }

				
			

Step 3

Write the script to extract data from Amazon. Configure your scraper to mimic a typical browser by adding headers and their corresponding values to your script. Create a new session object from the requests library and include the “proxies” defined in Step 2. Use the session object to make a request to the specified URL.

				
					
header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36',
            'Accept-Language': 'en-US,en;q=0.9'
        }

# Set up a session
session = requests.Session()
session.proxies = proxies

#make a request to a web page through the proxy
page = session.get(url, headers=header)
assert page.status_code == 200

				
			

Step 4

Use the BeautifulSoup scraping tool to extract the product title, price, and description. Instantiate the BeautifulSoup library to retrieve data from the page, and utilize the find or find_all functions to gather the necessary information. The find_all function returns an array of elements with a common tag name. To better understand the difference between find and find_all, refer to this article. Pass the HTML elements that correspond to the product title, price, and description as arguments to either the find or find_all function.

				
					
soup = BeautifulSoup(page.content, 'html.parser')
# Extract the title
title = soup.find('span', attrs={'id': 'productTitle'}).get_text(strip=True)
print(title)
#Extract the price
price = soup.find('span', attrs={'class': 'a-offscreen'}).get_text(strip=True)
print(price)

#Extract the description
feature_bullets = soup.find('div', id='feature-bullets')
if feature_bullets:
    ul = feature_bullets.find('ul', class_='a-unordered-list')
    if ul:
        # Extract text from each 'li' tag within the 'ul'
        for li in ul.find_all('li', class_='a-spacing-mini'):
            bullet_text = li.get_text(strip=True)
            print(bullet_text)
				
			

Run the Python file in your terminal. You should receive an output similar to the one in the image below.

Make sure you delete the code snippet in step 4 from your script before you run the test the functions in step 5.

Step 5

Extract additional data, such as similar products and customer reviews (reviewer name, rating, location, and review text).
Writing Python functions enhances code reusability, which is why we’re going to package all the code snippets into Python functions. We’ll combine the code snippets from Steps 3 and 4 into two functions. The first function will accept two arguments, url and proxy, and will return the website data after calling the BeautifulSoup library. The get_product_information function will extract the product information as demonstrated in Step 4.

				
					
def get_page_content(url, proxies):
    
    header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36',
            'Accept-Language': 'en-US,en;q=0.9'
        }
    
    # Set up a session
    session = requests.Session()
    #add the proxy to the session
    session.proxies = proxies
    
    #access the website page through the proxy
    page = session.get(url, headers=header)
    #check if the website was loaded succesfully
    assert page.status_code == 200
    
    #this return the html content from the page
    soup = BeautifulSoup(page.content, 'html.parser')
    
    return soup


def get_product_information(soup):
    #get the title of the product
    title = soup.find('span', attrs={'id': 'productTitle'}).get_text(strip=True)
    # get the price of the product
    price = soup.find('span', attrs={'class': 'a-offscreen'}).get_text(strip=True)
    #print(title)
    #print(price)

    feature_bullets = soup.find('div', id='feature-bullets')
    #this is an empty list where the description text will be saved
    description = []
    if feature_bullets:
        ul = feature_bullets.find('ul', class_='a-unordered-list')
        if ul:
            # Extract text from each 'li' tag within the 'ul'
            for li in ul.find_all('li', class_='a-spacing-mini'):
                bullet_text = li.get_text(strip=True)
                description.append(bullet_text)
                #combines all description into a single text
                text = ' '.join(description)
                

     #returns the title, price and description of the product
    return title, price,text
				
			

Write two functions to extract the title and price of products similar to the one we scraped. These products are listed under the “Discover similar items” section of the webpage.
Pass only one argument into the get_similar_products() function. This function will use the .find method to search for elements that match the title and price of the items listed as similar items. It should return an array with the title and price for each similar product.

				
					
def get_similar_products(item):
    # Extract product title
    product_tag = item.find('img', class_='shopbylook-btf-image-elem')
    product_title = product_tag['alt'] if product_tag else 'No product_tag alt text found'

    # Extract price of similar product
    price_tag = item.find('span', class_='a-offscreen')
    price = price_tag.get_text(strip=True) if price_tag else 'No price found'

    # Extract price fraction
    fraction_tag = item.find('span', class_='a-price-fraction')
    price_fraction = fraction_tag.get_text(strip=True) if fraction_tag else 'No price fraction found'
    #combine price and price fraction into the single price
    #it will look like $200.20
    final_price = price + "."+ price_fraction

    # Display results
   # print(f"Image Alt Text: {product_title}")
    #print(f"Price: {price}")
  #  print(f"final_price: {final_price}")



    similar_product_data = product_title,final_price

    return similar_product_data
				
			

Test the functions with the code snippet below:

				
					
#Delete this code block after running your script
soup = get_page_content(url, proxies)
title, price, description = get_product_information(soup)
items_section = soup.find_all('div', class_='shopbylook-btf-item-box')
for item in items_section:
    print(get_similar_products(item))
				
			

Write a function to extract the reviewer’s name, reviewer’s location, rating, and the comment from the reviewer section of the page. As in previous cases, pass the HTML elements that contain the required information through the find function. Use the regex module, imported as re, to extract the country name and the rating.

				
					# The function that gets the product reviews information.
def get_review_details(review):
    
    # Extract the product name
    name = review.find("span", class_ = "a-profile-name").get_text(strip=True)
    #print(name)
        
    location_tag = review.find("span", class_='a-size-base a-color-secondary review-date')
    if location_tag:
        location = location_tag.get_text(strip=True)
        
        # Extract country name using regex
        country_match = re.search(r'Reviewed in ([a-zA-Z\s]+) on', location)
        country = country_match.group(1) if country_match else "Unknown"
        #print("Country:", country)
        #print("Reviewer Location:", location)
    
    # Extract Reviewer Rating
    rating_tag = review.find("span", class_='a-icon-alt')
    if rating_tag:
        rate = rating_tag.get_text(strip=True)
        #extract rating using regex
        rating_match = re.search(r'(\d+\.\d+|\d+)', rate)
        rating = rating_match.group(1) if rating_match else "Unknown"
        #print("Rating:", rating)
        
     # Extract the review text
    review_body = review.find('div', class_='review-text-content')
    if review_body:
        review_text = review_body.find('span').get_text(separator=' ').strip()
        #print("Review Text:", review_text)  
    return name, country, rating, review_text

				
			

Test the get_review_details() function with the code snippet below:

				
					
#Delete this code block after running your script
foreign_reviews  = soup.find_all('div', class_='a-section review aok-relative cr-desktop-review-page-0') 
for review in foreign_reviews:
    review_details = []
    name, country, rating, review_text = get_review_details(review)
    review_details.append([ name, country, rating, review_text])
    print(review_details)
				
			

Note: Remember to delete all the code snippets for testing function from your python script after the test.

Storing scraped data in CSV format

The code block below is a function called extract_info_to_dataframe. It takes in two arguments, connects other functions, creates a DataFrame, and saves the DataFrame as a CSV file. The pandas library, imported as pd, is used to convert the list into a DataFrame through the pandas.DataFrame function and to save the DataFrame as a CSV file using the .to_csv function. Add the code block to your script:

				
					file_path = 'amz_data.csv'

def extract_info_to_dataframe(url, proxies):
    
    # Get page content
    soup = get_page_content(url, proxies)
    
    # Extract product information
    title, price, description = get_product_information(soup)
    product_info = [
        {"Title": title,
            "Price": price,
            "Description": description}]

    # Extract similar products
    similar_products = []
    items_section = soup.find_all('div', class_='shopbylook-btf-item-box')
    if not items_section:
        print("No items found with the class 'shopbylook-btf-item-box'.")
    else:
        for item in items_section:
            product_title, final_price = get_similar_products(item)
            similar_products.append({
                "Similar Product Title": product_title,
                "Similar Product Price": final_price})

    # Extract reviews
    review_details = []
    
    #Extract reviews from users in countries outside USA
    foreign_reviews = soup.find_all('div', class_='a-section review aok-relative cr-desktop-review-page-0')
    for review in foreign_reviews:
        name, country, rating, review_text = get_review_details(review)
        review_details.append({
            "Reviewer Name": name,
            "Location": country,
            "Rating": rating,
            "Review Text": review_text
        })
    
    #Extract reviews from users in countries outside USA
    us_reviews = soup.find_all('div', attrs={'data-hook': 'review', 'class': 'a-section review aok-relative'})
    for review in us_reviews:
        name, country, rating, review_text = get_review_details(review)
        review_details.append({
            "Reviewer Name": name,
            "Location": country,
            "Rating": rating,
            "Review Text": review_text
        })

    # Prepare the product information for DataFrame
    # for product_title, final_price in similar_products:
   

    h = pd.DataFrame.from_dict(product_info)
    
    #convert the list to a pandas series
    i = pd.Series(similar_products)
    
    #convert the list to a pandas series
    j = pd.Series(review_details)
    
    #this results in a dataframe with a column containing the dictionaries  
    #similar_products_title and similar_products_name in one column under 0
    # and the dictionaries in review_data in one column named 1
    df = pd.concat([h,i,j], axis = 1)
    
    # Extract the two columns from the dictionaries in column "0"
    sp = pd.json_normalize(df[0])

    # Rename the columns as needed
    sp.columns = ['Similar Products Title', 'Similar Products Price']
    
    # Extract the two columns from the dictionaries in column "1"
    rp = pd.json_normalize(df[1])
    
    # Rename the columns as needed
    rp.columns = ['Reviewer Name', 'Rating', 'Location', 'Review Text']
    
    # Concatenate the new columns with the original DataFrame
    df = pd.concat([df, sp], axis=1)
    df = pd.concat([df, rp], axis=1)
    #Drop the unwanted columns
    df.drop(columns=[0,1], inplace=True)
    # df = pd.DataFrame.from_dict(merged_list)
    
    if os.path.exists(file_path):
        # If it exists, delete it
        os.remove(file_path)

    # Save the new dataframe (it will overwrite if the file existed)
    df.to_csv(file_path, index=False)

    # Create a DataFrame from the collected data
    pass

if __name__ == "__main__":
    extract_info_to_dataframe(url, proxies)
				
			

To scrape data into a csv file, run your script in your terminal. Next, search for amz_data.csv. It will appear as shown in the image below.

View the entire script for scraping Amazon with Python on GitHub.

Why scrape Amazon data?

Amazon data provides comprehensive product listings, descriptions, specifications, similar products, and user reviews and ratings from all over the world. This rich product information holds immense value for various stakeholders in the e-commerce business. Whether you’re a consumer looking to monitor price changes, compare similar products and reviews, a business owner tracking competitor prices and customer feedback, or a developer who wants to build a unique Amazon data product, you can access the latest information.

Scraping Amazon data provides insights that can empower business owners to make informed decisions for strategic planning. For example, you can scrape reviews on products to understand consumer preferences regarding products and service delivery. These insights can then be incorporated into targeted advertising to enhance customer service practices.

Data scientists can use this information to build models that predict product prices, perform market analysis, and conduct sentiment analysis. These insights provide entrepreneurs with a competitive advantage over other players in the same niche. Additionally, a custom data application can be developed for specific stakeholders. For instance, an application can track price changes for certain products over a given period.

Conclusion

With just a few actions, you can scrape all kinds of products and review data from Amazon. Whether you’re a business owner, e-commerce marketer looking to analyze certain products and compare prices, a data analyst looking to build Amazon portfolio projects, a developer or an entrepreneur who wants to create tools for Amazon users, you can benefit from scraping Amazon.

The Social Proxy’s residential proxy lets you bypass Amazon’s rate limits, CAPTCHAs, and IP blocks. It switches IPs, hiding your IP address, and mimicking a real browser during your scraping activities. Build your next idea using this residential proxy and explore all the potential use cases that come to mind without the risk of being blocked.

Accessibility tools

Powered by - Wemake