Unlimited IP Pool
Cost Effective IP Pool
Unlimited IP Pool
Cost Effective IP Pool
Data Sourcing for LLMs & ML
Accelerate ventures securely
Proxy selection for complex cases
Some other kind of copy
Protect your brand on the web
Reduce ad fraud risks
Craigslist is a popular site used to post various items for sale. However, real estate dealers are particularly fond of it because of the on-site traffic it produces. This makes it ideal for investors to scrape and analyze the data in order to develop algorithms that can improve trading.
Platforms like Craigslist aren’t so easy to use, however, they employ various mechanisms to prevent scrapers and bots from extracting data from their site. This is usually done by imposing CAPTCHAs, which bots cannot solve without humans or rate limiting, which disrupts data streaming. Luckily, proxy services can be used to make traffic look legitimate and avoid CAPTCHA confrontations in the first place. This article will demonstrate how to build a scraper from scratch using Python to implement proxies provided by The Social Proxy as well as bypass CAPTCHAs and rate limiting.
Experienced investors understand the value of data collection, especially when it comes to competitive markets like real estate. Popular websites with a broad user base like Craigslist possess excessive amounts of data, posing great scraping potential. This data collection creates a competitive edge, allowing you to continuously monitor Craigslist data such as price, surrounding factors, physical factors, and dimensions.
With the advent of AI, numerical and factual data can be fed to ML models to perform analysis of the data and provide insights on it, making it easy for investors to stay updated with the latest listings with a summary.
This whole facility can be set up in your personal space as a dashboard. Having a personal space to monitor data and experiment gives you a better heads-up than competitors. However, the core of this setup is a scraper, which is the most challenging part to develop.
Platforms like Craigslist employ various mechanisms to prevent bots from crawling their website like the detection of suspicious traffic. When platforms like Craigslist suspect suspicious activity, they’ll ask to solve CAPTCHA – easy for humans, but difficult for bots to solve.
Hence, the best way to deal with these sites is to prevent them from suspecting our traffic to begin with and mimic legitimate users. This helps convince Craigslist that the user is legitimate and avoids throwing a CAPTCHA. Nonetheless, this solution is insufficient for platforms with too strict of detections. In such a case, we would need to use CAPTCHA-solving services.
In addition, these websites block IPs in case many requests are hitting their servers. Using your IP address will put you at the risk of getting blacklisted from the site and end up disrupting your analytics. For this reason, we need an intermediary service that stands between you and Craigslist and exposes its IP to Craigslist. If the IP gets blocked, it won’t affect your access. At the same time, this middleware must have a high threshold for requests and a high throughput.
The Social Proxy provides high-quality proxies to deal with this kind of problem. The IP address paired with your traffic causes a significant share of the problem. Anti-scraping mechanisms are smart enough to detect scraping attempts and even ban the IP from accessing the site. In case you are using your IP address, chances are you would get blacklisted.
For this purpose, The Social Proxy provides proxy services that are fast, reliable, and ethically sourced from devices worldwide. Craigslist can see your IP address, but all the traffic you send to Craigslist goes through proxy servers to protect it. If the IP gets blocked, you can switch to another proxy.
Web scraping is the automated process of extracting information from a webpage. A web app serves users as HTML, CSS, and Javascript files containing all the necessary data. These files are processed and rendered on the user’s browser. Instead of rendering, we can get these files by pretending to be ordinary users and parse them by extracting data.
Various techniques are used to mimic the behavior of an ordinary user, like using proxies. We will use a mobile proxy from The Social Proxy in this demonstration. Mobile proxies route your traffic through servers using the cellular network, making the target website believe that the traffic is coming from mobile devices. Since mobile devices are not popular options for scraping purposes, the traffic looks less suspicious.
These measures help reduce suspicion and encounter CAPTCHAs. Reducing detections allows you to minimize the overhead of solving CAPTCHAs and enables you to hide your IP while using different IPs to tackle rate limiting. You can also switch servers whenever you want in order to maintain a smoother data stream.
To understand the depths of the scraping process, we will use the plain requests module in Python to handle HTTP(S) requests and fetch web pages. We’ll also be using BeautifulSoup, a powerful Python package, to manipulate data, parse the web pages, and extract data.
To start, make sure you have Python installed or click here to install Python. This project will be built in a virtual environment. Skip to Step 2 if you already have one.
To exit this environment, use the command `deactivate`.
Now we can install all of the dependencies using Pip3.
Create a file called `scraper.py`, and you are ready to start building the program.
For this demonstration, we’ll use a New York’s Craigslist page. Since we’re dealing with real estate, click on the real estate page on the home page, and you’ll be taken to this page. This is the target page we’ll be scraping.
Right-click on one of the listings and select “Inspect Element” on your browser. This will open the Developer Tools, where you can analyze the underlying code.
It only takes a short glance to notice that all these listings are wrapped around the HTML document’s list tags (<li>). Analyzing more in-depth, we find that the list tags are arranged in the following way:
The href tag is our internet in each of these list tags. These provide links to individual property pages. Clicking on one of these links, will land you on pages like the ones shown here:
Right-click anywhere on this page and click on “View Page Source”. This will disclose all of the links to the images, content, and other information.
If you explore the listing more, you will notice that some pages contain images and location information and others don’t, so we need a flexible scraper for all types of pages.
We’re going to scrape the real estate listing page and extract all URLs to individual property pages, then we’ll visit each page and extract all the information available. At the end, we’ll convert the data to JSON format, so you can integrate it with other data manipulation tools.
To start, you’ll need to import all the required modules into the code:
from bs4 import BeautifulSoup
import requests
import json
import time
import random
Next we’ll set up the proxies and configure them for The Social Proxy. Head to The Social Proxy Dashboard and gather your proxies to configure these fields:
proxy_host = ".thesocialproxy.com"
proxy_port = "10000"
proxy_url = f"http://{proxy_host}:{proxy_port}"
proxy_username = "USERNAME"
proxy_password = "PASSWORD"
proxy_url = f"http://{proxy_username}:{proxy_password}@{proxy_host}:{proxy_port}"
proxies = { "http": proxy_url, "https": proxy_url }
First we’ll pull the web page for the listings page:
session = requests.Session()
session.proxies = proxies
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://example.com',
'Connection': 'keep-alive'
}
response = session.get("https://newyork.craigslist.org/search/rts", headers=headers)
To do so, we’ll need to create a session with requests. With the requests module, you can directly send a GET request without a session, but in this case, we’ll create a session that mimics a real browser and configures each request to follow.
The whole response from Craigslist should now be stored in the response variable. For example, `response.text` contains the HTML content sent by Craigslist.
Now we can use BeautifulSoup to parse the web page:
soup = BeautifulSoup(response.text, 'html.parser')
items = soup.find_all('li', class_='cl-static-search-result')
hrefs = []
for item in items:
link = item.find('a')
if link and 'href' in link.attrs:
hrefs.append(link['href'])
Here, we instantiate a `soup` object and instruct BeautifulSoup to take `response.text` (HTML Document) and use an HTML parser built into it.
We use the `soup` object to parse all the lists in the `cl-static-search-result` class according to the page source of this web page as seen in the browser.
Then we can create an empty list of `hrefs` and iterate over all the hrefs URLs found in the items by BeautifulSoup and store them in the `hrefs` list. This list will be iterated again to fetch data of individual properties.
We need to define a function named `scrape_listing`:
def scrape_listing(url, session):
response = session.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
listing_data = {
"url": url,
"images": [],
"header": "",
"content": "",
"price": "",
"location": "",
"posting_date": "",
"post_id": "",
"contact_info": "",
"additional_info": {}
}
title = soup.find("span", id="titletextonly")
if title:
listing_data["header"] = title.text.strip()
content = soup.find("section", id="postingbody")
if content:
listing_data["content"] = content.text.strip()
image_tags = soup.find_all("img")
for img in image_tags:
if 'src' in img.attrs:
listing_data["images"].append(img['src'])
price = soup.find("span", class_="price")
if price:
listing_data["price"] = price.text.strip()
location = soup.find("span", class_="postingtitletext").find("small")
if location:
listing_data["location"] = location.text.strip(" ()")
date = soup.find("time", class_="date timeago")
if date:
listing_data["posting_date"] = date.get("datetime")
post_id = soup.find("p", class_="postinginfo")
if post_id:
listing_data["post_id"] = post_id.text.replace("post id:", "").strip()
contact_info = soup.find("a", class_="show-contact")
if contact_info:
listing_data["contact_info"] = contact_info.get("data-href")
attrs_groups = soup.find_all("p", class_="attrgroup")
for group in attrs_groups:
spans = group.find_all("span")
for span in spans:
key_value = span.text.split(":")
if len(key_value) == 2:
key, value = key_value
listing_data["additional_info"][key.strip()] = value.strip()
else:
listing_data["additional_info"][span.text.strip()] = True
return listing_data
This function is long, but keep in mind that it repeats similar actions. It takes a variable `url` (containing the URL) and a `session` (a request session we previously defined). We then send a GET request to the URL, gather the web page of the individual listing, and create a BeautifulSoup object with HTML Parser.
Since we want to build the data up in JSON format, we’ll use `listing_data` to provide a template for arranging the data. Now the repetitive process begins. For each field in the `listing_data` field, we need to extract the given data from its unique HTML field. If the data (which may or may not be present) exists, we’ll store it in the `listing_data` list. Do this for every field and try to extract as much information as possible.
We return to the `listing_data` list with all the scraped information while traversing this function.
results = []
for href in hrefs:
result = scrape_listing(href, session)
results.append(result)
data = []
for result in results:
data.append(json.dumps(result))
print(json.dumps(result, indent=2))
We first define the `results` list. Then we iterate over all href links present in `hrefs` list, which we filled before. We pass each link to scrape_listing with the session (configured with The Social Proxy proxies). All the data gathered from this URL (in JSON format) gets appended to the results list.
Finally, we iterate over all the results in the `results` list and append them to the newly defined `data` list. We will also print all the JSON data on the terminal. If you followed all these steps, you’ll see an output with each JSON object that resembles the one below:
{
"url": "https://newyork.craigslist.org/wch/rts/d/rhinebeck-man-lift-rental-45-ft/7784450668.html",
"images": [
"https://images.craigslist.org/01010_hWUeiCh4Ypm_0cb087_600x450.jpg",
"https://images.craigslist.org/01010_hWUeiCh4Ypm_0cb087_50x50c.jpg",
"https://images.craigslist.org/00P0P_6ZnvgGZHgNC_0cb087_50x50c.jpg",
"https://images.craigslist.org/00101_8tXa0ZAxRE6_0cb087_50x50c.jpg",
"https://images.craigslist.org/00707_69R1jQtCQo1_0cb087_50x50c.jpg",
"https://images.craigslist.org/00m0m_hfMXknwSqSS_0cb087_50x50c.jpg"
],
"header": "Man Lift Rental 45 ft Articulating - $1500/month",
"content": "QR Code Link to This Post\n\n\nReach New Heights with Our 45 ft Articulating Man Lift!\n\nTRACK DRIVEN = YOU DON'T GET STUCK IN THE MUD\n\nNeed to access those hard-to-reach areas? Our 45 ft articulating man lift is the perfect solution for your project. With its exceptional maneuverability and outreach, you'll be able to tackle any job with ease.\n\nKey Features:\n\n45 ft maximum working height\nArticulating boom for ultimate flexibility\nSmooth and precise controls\nSafe and reliable operation\nIdeal for:\n\nConstruction\nTree trimming\nPainting\nMaintenance\nAnd much more!\nAffordable Monthly Rental:\n\nRent our 45 ft articulating man lift for only $1500 per month!\n\nContact us today to reserve your lift!\n\n8 four 5 - 706 - 1 8 zero zero",
"price": "",
"location": "",
"posting_date": "2024-09-13T13:01:48-0400",
"post_id": "Posted\n \n 2024-09-13 13:01",
"contact_info": "",
"additional_info": {}
}
And here you go! While writing this for the New York Craigslist, we gathered 148 properties listed in JSON format by executing this script. With The Social Proxy, you can make hundreds of requests per second and gather endless amounts of data without the fear of getting banned from Craigslist.
Scraping is a powerful technique that can be used to fetch valuable information from popular websites like Craigslist. There are countless ways to use this data; it can be used in mathematical operations to derive various parameters not provided by these platforms, to create graphs and charts, to create personal dashboards, or to scout out optimal investments. It can also be fed to AI/ML models. There are endless possibilities that can help set you apart from other investors in this competitive field.
The Social Proxy helps you scrape data by providing various ways to switch IPs, mimic legitimate users, bypass rate limits, hide your IP address, prevent you from getting blocked, etc. You can reference this article to better understand the basic implementation of the scraper with our mobile proxies, which can be extended to build your unique data collection and analysis application.