Unlimited IP Pool
Cost Effective IP Pool
Unlimited IP Pool
Cost Effective IP Pool
Data Sourcing for LLMs & ML
Accelerate ventures securely
Proxy selection for complex cases
Some other kind of copy
Protect your brand on the web
Reduce ad fraud risks
While scraping data from Amazon is legal, they take various measures to hinder frequent automatic data scraping. These restrictions can be bypassed using The Social Proxy’s residential proxies. In this Amazon product scraper tutorial, you’ll learn how to use tools like the BeautifulSoup library and Python to scrape product data. At the end, you’ll export the data into a CSV file for further analysis.
For this tutorial, you’ll use an office armchair to illustrate how an Amazon product page is structured and show how to identify key webpage components to focus on during scraping. Once you open the product link, you’ll be directed to a page that looks like the image below.
Every Amazon product page contains the following elements:
You’ll need to inspect the HTML layout of a product page to extract specific information. This will allow you to locate the web page elements containing the data you need and simplify the extraction process. We’ll use Chrome Developer Tools to review how to inspect HTML elements.
To open Chrome Developer Tools, hold down the Ctrl + Shift + I keys or click on the three vertical/horizontal dots in the top-right corner (depending on your browser), select More tools > Developer tools. This will display the Chrome Developer Tools panel alongside the Amazon product page. Click on the icon within the red square box (see image below) and drag your cursor over any product information you want to inspect. This will automatically highlight the corresponding HTML element that contains the selected information in the Chrome Developer Tools layout. Familiarize yourself with the product page and observe the HTML elements in the layout.
The image below shows the HTML elements that will be highlighted when you hover over the product title. The product title is located within an HTML element that starts with the tag name “span” and has the following attributes: id=”productTitle” and class=”a-size-large product-title-word-break”.
To locate the product ASIN, navigate to the attribute “data-asin”(a unique block of 10 characters). The ASIN of a product is always assigned to the data-asin attribute of a specific HTML element on all Amazon web pages as shown in the image below.
Keeping the Chrome Developer Tools layout open, scroll down the page until you reach the header “Discover similar items.” Like before, move your cursor over each image box and observe the sections within the Chrome Developer Tools with any necessary information like product name and product price.
Now scroll down the product page to locate the customer review section. Hover over the items within the red square boxes in the image below and observe the elements that are automatically highlighted in the Chrome Developer Tools layout. In most cases, you will find that the information within an HTML element is nested inside another HTML element.
Scraping Amazon using Python and proxies will boost efficiency, minimize errors, and save you time. You’ll also be able to scrape multiple pages without getting worn out, inserting the wrong data, or getting blocked by Amazon anti-scraping mechanisms.
Amazon values the data it gathers from millions of users on its platform, which is why it actively protects it. Although Amazon does not have a specific policy against scraping, you are likely to encounter some restrictions while scraping data automatically from its website. Amazon has implemented anti-scraping mechanisms that make it impossible to scrape data without getting blocked after a certain period.
Some of these mechanisms include:
Using a rotating residential proxy, such as The Social Proxy’s residential proxy, can help bypass these anti-scraping mechanisms.
To get started:
pip install pipenv
pipenv install requests pandas beautifulsoup4
Follow this guide to set up a residential proxy with The Social Proxy.
You’ll need to add host, usernames and password information to your Amazon scraper script. Copy the relevant information.
In this section, you’ll create a Python file and write functions to scrape and extract data from this Amazon website page. The following steps will guide you through the process.
Create a new Python file in your project folder and save it with a preferred name, for example, amazon.py. Import the required packages into the script. Copy the URL and assign it to the url variable.
#import the required libraries
import os
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
url ="https://www.amazon.com/HINOMI-Ergonomic-Foldable-Suitable-Computer/dp/B0CHRQWNCH/?_encoding=UTF8&pd_rd_w=ehyH1&content-id=amzn1.sym.4bba068a-9322-4692-abd5-0bbe652907a9&pf_rd_p=4bba068a-9322-4692-abd5-0bbe652907a9&pf_rd_r=7VDVZANP4F104RGPP3ZT&pd_rd_wg=piGQB&pd_rd_r=fe8c79ea-20d0-46b2-bbba-af6e30e2c1aa&ref_=pd_hp_d_btf_nta-top-picks&th=1"
To avoid being blocked by Amazon, configure The Social Proxy’s residential proxies. Copy your credentials from the proxy dashboard and assign them to the appropriate variables. Replace proxy_host, proxy_username, and proxy_password with the corresponding values for host, username, and password, respectively.
#add your proxy details
proxy_host = ".thesocialproxy.com" #add your host
proxy_port = "10000"
proxy_url = f"http://{proxy_host}:{proxy_port}"
proxy_username = "USERNAME" #add your username
proxy_password = "PASSWORD" #add your password
proxy_url = f"http://{proxy_username}:{proxy_password}@{proxy_host}:{proxy_port}"
proxies = { "http": proxy_url, "https": proxy_url }
Write the script to extract data from Amazon. Configure your scraper to mimic a typical browser by adding headers and their corresponding values to your script. Create a new session object from the requests library and include the “proxies” defined in Step 2. Use the session object to make a request to the specified URL.
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9'
}
# Set up a session
session = requests.Session()
session.proxies = proxies
#make a request to a web page through the proxy
page = session.get(url, headers=header)
assert page.status_code == 200
Use the BeautifulSoup scraping tool to extract the product title, price, and description. Instantiate the BeautifulSoup library to retrieve data from the page, and utilize the find or find_all functions to gather the necessary information. The find_all function returns an array of elements with a common tag name. To better understand the difference between find and find_all, refer to this article. Pass the HTML elements that correspond to the product title, price, and description as arguments to either the find or find_all function.
soup = BeautifulSoup(page.content, 'html.parser')
# Extract the title
title = soup.find('span', attrs={'id': 'productTitle'}).get_text(strip=True)
print(title)
#Extract the price
price = soup.find('span', attrs={'class': 'a-offscreen'}).get_text(strip=True)
print(price)
#Extract the description
feature_bullets = soup.find('div', id='feature-bullets')
if feature_bullets:
ul = feature_bullets.find('ul', class_='a-unordered-list')
if ul:
# Extract text from each 'li' tag within the 'ul'
for li in ul.find_all('li', class_='a-spacing-mini'):
bullet_text = li.get_text(strip=True)
print(bullet_text)
Run the Python file in your terminal. You should receive an output similar to the one in the image below.
Make sure you delete the code snippet in step 4 from your script before you run the test the functions in step 5.
Extract additional data, such as similar products and customer reviews (reviewer name, rating, location, and review text).
Writing Python functions enhances code reusability, which is why we’re going to package all the code snippets into Python functions. We’ll combine the code snippets from Steps 3 and 4 into two functions. The first function will accept two arguments, url and proxy, and will return the website data after calling the BeautifulSoup library. The get_product_information function will extract the product information as demonstrated in Step 4.
def get_page_content(url, proxies):
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9'
}
# Set up a session
session = requests.Session()
#add the proxy to the session
session.proxies = proxies
#access the website page through the proxy
page = session.get(url, headers=header)
#check if the website was loaded succesfully
assert page.status_code == 200
#this return the html content from the page
soup = BeautifulSoup(page.content, 'html.parser')
return soup
def get_product_information(soup):
#get the title of the product
title = soup.find('span', attrs={'id': 'productTitle'}).get_text(strip=True)
# get the price of the product
price = soup.find('span', attrs={'class': 'a-offscreen'}).get_text(strip=True)
#print(title)
#print(price)
feature_bullets = soup.find('div', id='feature-bullets')
#this is an empty list where the description text will be saved
description = []
if feature_bullets:
ul = feature_bullets.find('ul', class_='a-unordered-list')
if ul:
# Extract text from each 'li' tag within the 'ul'
for li in ul.find_all('li', class_='a-spacing-mini'):
bullet_text = li.get_text(strip=True)
description.append(bullet_text)
#combines all description into a single text
text = ' '.join(description)
#returns the title, price and description of the product
return title, price,text
Write two functions to extract the title and price of products similar to the one we scraped. These products are listed under the “Discover similar items” section of the webpage.
Pass only one argument into the get_similar_products() function. This function will use the .find method to search for elements that match the title and price of the items listed as similar items. It should return an array with the title and price for each similar product.
def get_similar_products(item):
# Extract product title
product_tag = item.find('img', class_='shopbylook-btf-image-elem')
product_title = product_tag['alt'] if product_tag else 'No product_tag alt text found'
# Extract price of similar product
price_tag = item.find('span', class_='a-offscreen')
price = price_tag.get_text(strip=True) if price_tag else 'No price found'
# Extract price fraction
fraction_tag = item.find('span', class_='a-price-fraction')
price_fraction = fraction_tag.get_text(strip=True) if fraction_tag else 'No price fraction found'
#combine price and price fraction into the single price
#it will look like $200.20
final_price = price + "."+ price_fraction
# Display results
# print(f"Image Alt Text: {product_title}")
#print(f"Price: {price}")
# print(f"final_price: {final_price}")
similar_product_data = product_title,final_price
return similar_product_data
Test the functions with the code snippet below:
#Delete this code block after running your script
soup = get_page_content(url, proxies)
title, price, description = get_product_information(soup)
items_section = soup.find_all('div', class_='shopbylook-btf-item-box')
for item in items_section:
print(get_similar_products(item))
Write a function to extract the reviewer’s name, reviewer’s location, rating, and the comment from the reviewer section of the page. As in previous cases, pass the HTML elements that contain the required information through the find function. Use the regex module, imported as re, to extract the country name and the rating.
# The function that gets the product reviews information.
def get_review_details(review):
# Extract the product name
name = review.find("span", class_ = "a-profile-name").get_text(strip=True)
#print(name)
location_tag = review.find("span", class_='a-size-base a-color-secondary review-date')
if location_tag:
location = location_tag.get_text(strip=True)
# Extract country name using regex
country_match = re.search(r'Reviewed in ([a-zA-Z\s]+) on', location)
country = country_match.group(1) if country_match else "Unknown"
#print("Country:", country)
#print("Reviewer Location:", location)
# Extract Reviewer Rating
rating_tag = review.find("span", class_='a-icon-alt')
if rating_tag:
rate = rating_tag.get_text(strip=True)
#extract rating using regex
rating_match = re.search(r'(\d+\.\d+|\d+)', rate)
rating = rating_match.group(1) if rating_match else "Unknown"
#print("Rating:", rating)
# Extract the review text
review_body = review.find('div', class_='review-text-content')
if review_body:
review_text = review_body.find('span').get_text(separator=' ').strip()
#print("Review Text:", review_text)
return name, country, rating, review_text
Test the get_review_details() function with the code snippet below:
#Delete this code block after running your script
foreign_reviews = soup.find_all('div', class_='a-section review aok-relative cr-desktop-review-page-0')
for review in foreign_reviews:
review_details = []
name, country, rating, review_text = get_review_details(review)
review_details.append([ name, country, rating, review_text])
print(review_details)
Note: Remember to delete all the code snippets for testing function from your python script after the test.
The code block below is a function called extract_info_to_dataframe. It takes in two arguments, connects other functions, creates a DataFrame, and saves the DataFrame as a CSV file. The pandas library, imported as pd, is used to convert the list into a DataFrame through the pandas.DataFrame function and to save the DataFrame as a CSV file using the .to_csv function. Add the code block to your script:
file_path = 'amz_data.csv'
def extract_info_to_dataframe(url, proxies):
# Get page content
soup = get_page_content(url, proxies)
# Extract product information
title, price, description = get_product_information(soup)
product_info = [
{"Title": title,
"Price": price,
"Description": description}]
# Extract similar products
similar_products = []
items_section = soup.find_all('div', class_='shopbylook-btf-item-box')
if not items_section:
print("No items found with the class 'shopbylook-btf-item-box'.")
else:
for item in items_section:
product_title, final_price = get_similar_products(item)
similar_products.append({
"Similar Product Title": product_title,
"Similar Product Price": final_price})
# Extract reviews
review_details = []
#Extract reviews from users in countries outside USA
foreign_reviews = soup.find_all('div', class_='a-section review aok-relative cr-desktop-review-page-0')
for review in foreign_reviews:
name, country, rating, review_text = get_review_details(review)
review_details.append({
"Reviewer Name": name,
"Location": country,
"Rating": rating,
"Review Text": review_text
})
#Extract reviews from users in countries outside USA
us_reviews = soup.find_all('div', attrs={'data-hook': 'review', 'class': 'a-section review aok-relative'})
for review in us_reviews:
name, country, rating, review_text = get_review_details(review)
review_details.append({
"Reviewer Name": name,
"Location": country,
"Rating": rating,
"Review Text": review_text
})
# Prepare the product information for DataFrame
# for product_title, final_price in similar_products:
h = pd.DataFrame.from_dict(product_info)
#convert the list to a pandas series
i = pd.Series(similar_products)
#convert the list to a pandas series
j = pd.Series(review_details)
#this results in a dataframe with a column containing the dictionaries
#similar_products_title and similar_products_name in one column under 0
# and the dictionaries in review_data in one column named 1
df = pd.concat([h,i,j], axis = 1)
# Extract the two columns from the dictionaries in column "0"
sp = pd.json_normalize(df[0])
# Rename the columns as needed
sp.columns = ['Similar Products Title', 'Similar Products Price']
# Extract the two columns from the dictionaries in column "1"
rp = pd.json_normalize(df[1])
# Rename the columns as needed
rp.columns = ['Reviewer Name', 'Rating', 'Location', 'Review Text']
# Concatenate the new columns with the original DataFrame
df = pd.concat([df, sp], axis=1)
df = pd.concat([df, rp], axis=1)
#Drop the unwanted columns
df.drop(columns=[0,1], inplace=True)
# df = pd.DataFrame.from_dict(merged_list)
if os.path.exists(file_path):
# If it exists, delete it
os.remove(file_path)
# Save the new dataframe (it will overwrite if the file existed)
df.to_csv(file_path, index=False)
# Create a DataFrame from the collected data
pass
if __name__ == "__main__":
extract_info_to_dataframe(url, proxies)
To scrape data into a csv file, run your script in your terminal. Next, search for amz_data.csv. It will appear as shown in the image below.
View the entire script for scraping Amazon with Python on GitHub.
Amazon data provides comprehensive product listings, descriptions, specifications, similar products, and user reviews and ratings from all over the world. This rich product information holds immense value for various stakeholders in the e-commerce business. Whether you’re a consumer looking to monitor price changes, compare similar products and reviews, a business owner tracking competitor prices and customer feedback, or a developer who wants to build a unique Amazon data product, you can access the latest information.
Scraping Amazon data provides insights that can empower business owners to make informed decisions for strategic planning. For example, you can scrape reviews on products to understand consumer preferences regarding products and service delivery. These insights can then be incorporated into targeted advertising to enhance customer service practices.
Data scientists can use this information to build models that predict product prices, perform market analysis, and conduct sentiment analysis. These insights provide entrepreneurs with a competitive advantage over other players in the same niche. Additionally, a custom data application can be developed for specific stakeholders. For instance, an application can track price changes for certain products over a given period.
With just a few actions, you can scrape all kinds of products and review data from Amazon. Whether you’re a business owner, e-commerce marketer looking to analyze certain products and compare prices, a data analyst looking to build Amazon portfolio projects, a developer or an entrepreneur who wants to create tools for Amazon users, you can benefit from scraping Amazon.
The Social Proxy’s residential proxy lets you bypass Amazon’s rate limits, CAPTCHAs, and IP blocks. It switches IPs, hiding your IP address, and mimicking a real browser during your scraping activities. Build your next idea using this residential proxy and explore all the potential use cases that come to mind without the risk of being blocked.