Unlimited IP Pool
Cost Effective IP Pool
Unlimited IP Pool
Cost Effective IP Pool
Data Sourcing for LLMs & ML
Accelerate ventures securely
Proxy selection for complex cases
Some other kind of copy
Protect your brand on the web
Reduce ad fraud risks
In a world where many businesses compete within the same niche, positive customer reviews can be invaluable in making a company stand out while providing important feedback that can be used to improve a business. Trustpilot is a world-leading online review platform with data from over 890,000 businesses, making it a central hub for reputation management and market analysis. With companies receiving increasingly more reviews each day, it can be hard to keep up. That’s where web scraping comes in and can be used to automate the collection and storage of data.
Web scraping is the automated extraction of data from the web using a programming language of your choice. Since web scraping can cause websites to slow down, it’s common to implement CAPTCHAs, rate limiting, or IP blocking to help detect and minimize suspicious activity like frequent requests or large-scale data access. That’s when proxies come into play. Proxies shield your IP address and prevent it from being banned or blocked during scraping. In this tutorial, you’ll learn how to use the Selenium web scraping tool with Python to overcome Trustpilot’s anti-scraping defenses, analyze the data, and export it into a CSV file for further analysis.
To find a review for the company of your choice, follow these steps:
1. Visit Trustpilot’s website and enter a company’s name in the search bar
2. A section titled “Reviews” will appear, showing ratings from 1 to 5 stars.
3. Click “Filter” to filter reviews by star rating, date, popular mentions, location, etc.
4. To filter reviews by verified status or those with replies, select the review option.
After making your selection, you’ll see each review arranged in white boxes, listed by date. Important review data includes:
To scrape this data, you need to pay attention to HTML tags such as <div>, <span>, and <p>, which help identify the class names required to extract these elements. Follow these instructions to locate HTML tags using browser developer tools such as Chrome Developer Tools:
It’s important to consider that Trustpilot uses dynamic loading, meaning reviews only fully load when a user scrolls down the page. Also, pagination on Trustpilot requires using the “Next” button or clicking the next page number to advance.
To start scraping Trustpilot’s data, you’ll need to set up the following tools:
Set up Selenium in a virtual environment with the necessary modules:
# Import necessary modules for webscraping
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import (
NoSuchElementException,
TimeoutException,
ElementClickInterceptedException,
) #To handle common Selenium errors that come up
from selenium.webdriver.common.action_chains import ActionChains
Import time
Import csv
Import zip
Import time and import csv modules will be referenced later in the code. The time and csv modules will be used later in the code.
Now you’ll need to use your mobile proxy credentials. Click here for more information about how to set up a mobile proxy with The Social Proxy.
Mobile proxies can effectively scrape sites with strict security. Once you have access to a mobile proxy, use variables to set up your proxies as shown:
proxy_host = "miami1.thesocialproxy.com" # Example: "us.socialproxy.com"
proxy_port = "10000" # Example: "12345"
proxy_username = YOUR_USERNAME
proxy_password = YOUR_PASSWORD
proxy_url = f"http://{proxy_username}:{proxy_password}@{proxy_host}:{proxy_port}"
Configure Selenium to use your proxy with selenium_wire:
# Configure proxies with Selenium_wire
seleniumwire_options = {
"proxy": {
"http": proxy_url,
"https": proxy_url
}
}
Configure Chrome Options for Selenium WebDriver:
# Configure Chrome Options
chrome_options = Options()
chrome_options.add_argument("--ignore-certificate-errors")
chrome_options.add_argument("--ignore-ssl-errors")
Initialize Chrome WebDriver:
# Initialize Chrome WebDriver
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://www.trustpilot.com/review/thesocialproxy.com")
Scroll to Elements and Close Cookie Banners. Since Trustpilot uses dynamic loading, write a function to scroll to a specific element:
# Function to scroll through elements
def scroll_to_element(element):
actions = ActionChains(driver) #To perform complex actions in Selenium
actions.move_to_element(element).perform()
Handle cookie banners with Selenium:
def close_cookie_banner():
try:
cookie_button = WebDriverWait(driver, 5).until(
EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))
)
# Locates cookie button by ID and handles potential errors
cookie_button.click()
print("Cookie banner closed")
except (NoSuchElementException, TimeoutException):
print("No cookie banner found or unable to close it")
To scrape reviews across multiple pages, use the following function:
def click_next_page():
try:
next_button = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.NAME, "pagination-button-next"))
)
scroll_to_element(next_button)
# Locates the next button by NAME, function, and handles potential errors
try:
next_button.click()
except ElementClickInterceptedException:
# If normal click fails, try using JavaScript
driver.execute_script("arguments[0].click();", next_button)
return True
except (NoSuchElementException, TimeoutException):
print("Next page button not found.")
return False
To extract review data:
def get_reviews():
WebDriverWait(driver, 10).until(
EC.presence_of_element_located(
(By.CLASS_NAME, "styles_reviewsContainer__3_GQw")
)
)
# Locates reviews container, finds elements, and prints elements’ texts
elements = driver.find_elements(By.CLASS_NAME, "styles_reviewCardInner__EwDq2")
print(f"Number of reviews found: {len(elements)}")
for el in elements:
try:
head = el.find_element(
By.CSS_SELECTOR,
".typography_heading-s__f7029.typography_appearance-default__AAY17",
)
content = el.find_element(By.CLASS_NAME, "styles_reviewContent__0Q2Tg")
reviewer = el.find_element(By.CLASS_NAME, "link_internal__7XN06")
date_posted = el.find_element(By.CLASS_NAME, "styles_reviewHeader__iU9Px")
print(f"Title: {head.text}")
content_text = content.text
reviewer_text = reviewer.text
# To separate reviewer's location and date of experience
reviewer_text_array = reviewer_text.split('\n')
content_text_array = content_text.split('\n')
date_of_experience = content_text_array[-1]
location = reviewer_text_array[-1]
print(f"Content: {content.text}")
print(f"Reviewer: {reviewer.text}")
print(f"Date Posted: {date_posted.text}")
print("------------------------------------------------------------------")
Write a “for” loop to scrape through multiple pages of reviews:
for i in range(3):
# 3 is the number of pages you want to scrape so edit edit as needed
print(f"PAGE NUMBER {i + 1}")
get_reviews()
if not click_next_page():
print("No more pages to scrape.")
break
# Wait for the page to load after clicking next
time.sleep(5) #this is why the time module was imported
Close the server when scraping is complete
driver.quit ()
Congratulations! You have successfully scraped your desired data from Trustpilot!
Extracting data will allow you to analyze information. To prepare the data for analysis, you’ll need to store it in a CSV file.
Add the following code above the get_reviews() function, ensuring that get_reviews() is at an indentation level lower than the CSV code, as shown below
# Create csv file
with open('review_1.csv', mode='w', newline='', encoding='utf-8') as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["Title", "Content", "Reviewer", "Date Posted", "Date of Experience", "Location"])
def get_reviews():
Add the following code at the end of the get_reviews() function:
print("------------------------------------------------------------------")
# Write to CSV
csv_writer.writerow([head.text, content.text, reviewer.text, date_posted.text, date_of_experience, location])
# Write data to CSV
except (NoSuchElementException, StaleElementReferenceException) as e:
print(f"Error extracting review details: {str(e)}")
If there are any errors, handle them with:
# Error handling
except (NoSuchElementException, StaleElementReferenceException) as e:
print(f"Error extracting review details: {str(e)}")
This creates a CSV file with the necessary information. The review data is now at your fingertips to do as you wish.
Trustpilot is a trusted platform for reviewing companies worldwide. This article provides a step-by-step guide to help you scrape Trustpilot’s reviews using The Social Proxy’s mobile proxy, store the data in a CSV file, and perform further analysis.
You can extract reviews from Trustpilot using your preferred scraping tool, such as Puppeteer, BeautifulSoup, or Selenium. Having development experience, or hiring a developer, will make the process simpler, especially when you need proxies.
Trustpilot reviews are valuable for gathering customer insights, evaluating a company’s communication, and identifying necessary product improvements. Someone may have already shared feedback about a service you’re considering—why not check it out?
There is a plethora of data you can scrape from Trustpilot, but for reviews, the most relevant include review text, ratings, and reviewer details. Other scrapable data is discussed in the section on understanding Trustpilot’s layout.
According to the General Data Protection Regulation (GDPR), web scraping is legal. However, the businesses you scrape from may impose restrictions on how you can use their data, especially for commercial purposes. It’s legal to scrape Trustpilot, but it’s a good idea to check their terms and conditions for any limitations.
Yes, Trustpilot allows web scraping, but high-frequency scraping can result in your IP getting blocked. We highly recommend using The Social Proxy’s mobile proxy to prevent your IP address from getting blocked.