Unlimited IP Pool
Cost Effective IP Pool
Unlimited IP Pool
Cost Effective IP Pool
Data Sourcing for LLMs & ML
Accelerate ventures securely
Proxy selection for complex cases
Some other kind of copy
Protect your brand on the web
Reduce ad fraud risks
Social media platforms like Twitter have become a powerful tool for sharing information and promoting discussions. The downside, however, is that they spread fake news. This rings especially true regarding specific keywords or topics, ranging from health-related terms to current events.
The goal of this blog is to equip data scientists, developers, social media analysts, and misinformation researchers with the tools to tackle disinformation. As fake news gains traction, not only does it spread misinformation, but this misinformation can lead to real-world consequences, making accuracy verification of tweets more pressing of a subject than ever.
This tutorial will explore how to build a disinformation detector for Twitter. We’ll walk through the process of scraping Twitter posts related to specific trending keywords, analyzing content using Natural Language Processing (NLP), and implementing a machine learning model for Twitter disinformation to determine the accuracy of tweets.
Setting up a virtual environment is essential to manage dependencies and keep your Python project separate from other projects on your machine. Here’s a guide on how to set up a virtual environment in Python:
Before setting up a virtual environment, make sure you have Python installed in your system by running the following code:
python --version
Output:
If you don’t have Python installed, you can download it from the official website.
pip install virtualenv
Go to the directory and create a virtual environment using the following command:
Cd Downloads
virtualenv myenv
This command creates a folder named myenv (or any name you choose) containing your virtual environment.
Once we’ve created a virtual environment, we’ll need to activate it.
myenv\Scripts\activate
Your terminal prompt will change to show the virtual environment name, indicating that it’s active. For example:
Now that the virtual environment is active, you can install your project-specific dependencies without affecting your global Python environment.
For example use this code to install beautifulsoup4 and requests:
pip install beautifulsoup4 requests pandas selenium
That’s it! You’ve successfully set up and managed your virtual environment.
In order to scrape Twitter posts without getting blocked, you can use The Social Proxy Scraper API. To obtain access, you’ll need to create an account and get your Consumer Key and Consumer Secret Key.
In this step, we’ll collect tweets mentioning a specific keyword (e.g. #ClimateChange) using The Social Proxy Scraper API.
import time
import json
import requests
base_url = "https://scraping-api.thesocialproxy.com/twitter/v0/search/top"
BEARER_TOKEN = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
BEARER_SECRET = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
HEADERS = {
"Api-Key": f"{BEARER_TOKEN}:{BEARER_SECRET}",
}
query = {"query": "climate+change"}
is_next = True
cursor = None
total_data = []
page_count = 1
MAX_PAGES = 1000 # Set a reasonable limit for pages
try:
while is_next and page_count <= MAX_PAGES:
response = requests.get(base_url, timeout=120, headers=HEADERS, params=query)
response_json = response.json()
response_tweets = response_json.get("tweets", [])
for tweet in response_tweets:
tweet_text = tweet["tweet"].get("full_text", "")
total_data.append({"tweet": tweet_text, "sentiment": -1})
new_cursor = response_json.get("cursor", None)
if new_cursor and new_cursor != cursor:
cursor = new_cursor
query["cursor"] = cursor
is_next = True
else:
is_next = False
time.sleep(2)
print(f"Done with page: {page_count}")
page_count += 1
with open("total_tweets.json", "w", encoding="utf-8") as f:
json.dump(total_data, f)
print("Done!")
except KeyboardInterrupt:
print("\nProcess interrupted by user.")
with open("total_tweets.json", "w", encoding="utf-8") as f:
json.dump(total_data, f)
print("Progress saved.")
except Exception as e:
with open("total_tweets.json", "w", encoding="utf-8") as f:
json.dump(total_data, f)
print(f"An error occurred: {e}")
else:
with open("total_tweets.json", "w", encoding="utf-8") as f:
json.dump(total_data, f)
print("Done!")
Let’s review the code block above line by line.
Line 1–3: Imports libraries
import time
import json
import requests
json: Used to work with JSON data (storing and retrieving data)
time: Used for pausing between requests (to avoid overloading the API)
requests: Used to make HTTP requests to the Twitter scraping API
Line 6-13: API credentials
base_url = "https://scraping-api.thesocialproxy.com/twitter/v0/search/top"
BEARER_TOKEN = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
BEARER_SECRET = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
HEADERS = {
"Api-Key": f"{BEARER_TOKEN}:{BEARER_SECRET}",
}
These lines define the base URL for the Twitter scraping API, as well as API credentials for authentication. The HEADERS dictionary stores the API key for authorization.
Line 15-25: Search query and loop variables
query = {"query": "climate+change"}
is_next = True
cursor = None
total_data = []
page_count = 1
MAX_PAGES = 1000 # Set a reasonable limit for pages
query: Defines the search term as “climate+change”.
is_next: Boolean flag to control the loop (continues scraping as long as there are more pages).
cursor: Stores a value that retrieves subsequent pages of results.
total_data: Empty list to store all scraped tweets.
page_count: Keeps track of the current page being scraped.
MAX_PAGES: Sets a limit on the number of pages to scrape (prevents infinite loops).
Line 27-52: Main loop for scraping
try:
while is_next and page_count <= MAX_PAGES:
response = requests.get(base_url, timeout=120, headers=HEADERS, params=query)
response_json = response.json()
response_tweets = response_json.get("tweets", [])
for tweet in response_tweets:
tweet_text = tweet["tweet"].get("full_text", "")
total_data.append({"tweet": tweet_text, "sentiment": -1})
new_cursor = response_json.get("cursor", None)
if new_cursor and new_cursor != cursor:
cursor = new_cursor
query["cursor"] = cursor
is_next = True
else:
is_next = False
time.sleep(2)
print(f"Done with page: {page_count}")
page_count += 1
The try block handles the main scraping.
The while loop continues as long as is_next is True and the page limit isn’t reached.
You should also note that inside the loop:
Line 54-73: Save results and error handling
with open("total_tweets.json", "w", encoding="utf-8") as f:
json.dump(total_data, f)
print("Done!")
except KeyboardInterrupt:
print("\nProcess interrupted by user.")
with open("total_tweets.json", "w", encoding="utf-8") as f:
json.dump(total_data, f)
print("Progress saved.")
except Exception as e:
with open("total_tweets.json", "w", encoding="utf-8") as f:
json.dump(total_data, f)
print(f"An error occurred: {e}")
else:
with open("total_tweets.json", "w", encoding="utf-8") as f:
json.dump(total_data, f)
print("Done!")
try-except-else block:
Once we’ve scraped the tweet, we can clean and preprocess the data. Let’s start by importing the libraries:
#import libraries
import re
import nltk
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report
Read the dataset:
# Change display setting to show full text
pd.set_option('display.max_colwidth', None)
tweet = pd.read_csv("tweets.csv")
tweet.head()
Output:
From the output, we can see the column name is not specified well, so we’ll rename the column to make the output more intuitive.
#Rename the column
tweet1 = tweet.rename(columns={'Unnamed: 0': 'Id'}, inplace=False)
tweet1.head()
Output:
Now that we have a better output, we can move forward with the data cleaning and remove links, emojis, and other irrelevant descriptions from the dataset.
#Data cleaning - Remove hashtags,urls, emojis and special characters
#Define function to clean tweets
def clean_tweet(tweet1):
tweet1 = re.sub(r'http\S+', '', tweet1) # Remove URLs
tweet1 = re.sub(r'#\w+', '', tweet1) # Remove hashtags
tweet1 = re.sub(r'@[A-Za-z0-9_]+', '', tweet1) # Remove mentions
tweet1 = re.sub(r'[^A-Za-z\s]', '', tweet1) # Remove special characters (keep only letters and spaces)
tweet1 = re.sub(r'\s+', ' ', tweet1).strip() # Remove extra spaces
return tweet1
# Apply the clean_tweet function to the 'tweet' column
tweet1['cleaned_tweet'] = tweet1['tweet'].apply(clean_tweet)
tweet1.tail()
Output:
#Keyword-Specific Feature Extraction
tweet1['keyword_present'] = tweet1['cleaned_tweet'].str.contains("climate change")
tweet1['has_hashtags'] = tweet1['tweet'].apply(lambda x: re.findall(r"#(\w+)", x))
print(tweet1[['cleaned_tweet', 'keyword_present', 'has_hashtags']])
The above code extracts the following features from the tweets:
‘keyword_present’: A boolean indicating whether the tweet contains the keyword “climate change”.
‘has_hashtags’: A list of hashtags present in the tweet.
Output:
Now we’ll use Natural Language Processing (NLP) for disinformation detection, to perform twitter content analysis for truth verification and the context of the keyword used in the tweet.
#Using NLP for Contextual Understanding
nltk.download('vader_lexicon')
# Sentiment Analysis
sia = SentimentIntensityAnalyzer()
tweet1['sentiment'] = tweet1['cleaned_tweet'].apply(lambda x: sia.polarity_scores(x)['compound'])
# Label sentiment: Positive (1), Neutral (0), Negative (-1)
tweet1['sentiment_label'] = tweet1['sentiment'].apply(lambda x: 1 if x < 0 else (-1 if x < 0 else 0))
# Display the first few rows with sentiment analysis
print(tweet1.head())
Output:
Here’s a breakdown:
Here’s a step-by-step guide to train and build a Twitter disinformation model that differentiates between true and false information about climate change.
#Train model
# Feature Extraction with TF-IDF
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(tweet1['cleaned_tweet']).toarray()
# Labels (For illustration, assume 0 = Not disinformation, 1 = Disinformation)
# You will need labeled data for this step
y = tweet1['keyword_present'] # Placeholder label
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Model training
model = LogisticRegression()
model.fit(X_train, y_train)
# Model evaluation
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))
TF-IDF is a common technique for representing text data numerically. It considers both the frequency of words within a document as well as their overall importance across the entire data. The above code snippet performs keyword-specific disinformation detection as well as the following:
Output:
Save the vectorizer and model:
# Save the TF-IDF vectorizer
joblib.dump(tfidf, 'tfidf_vectorizer.pkl')
# Save the Logistic Regression model
joblib.dump(model, 'logistic_model.pkl')
Once the model has been trained, we’ll integrate it with the scraper. Each time a new batch of tweets containing the keyword gets scraped, the model classifies whether or not they spread disinformation.
Let’s write a script to integrate it with the scraper:
import re
import joblib
import requests
BASE_URL = "https://scraping-api.thesocialproxy.com/twitter/v0/search/top"
BEARER_TOKEN = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
BEARER_SECRET = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
HEADERS = {"Api-Key": f"{BEARER_TOKEN}:{BEARER_SECRET}"}
query = {"query": "climate+change"}
# load the saved model and vectorizer
tfidf = joblib.load("tfidf_vectorizer.pkl")
model = joblib.load("logistic_model.pkl")
# Define function to clean tweets
def clean_tweet(value: str):
value = re.sub(r"http\S+", "", value) # Remove URLs
value = re.sub(r"#\w+", "", value) # Remove hashtags
value = re.sub(r"@[A-Za-z0-9_]+", "", value) # Remove mentions
value = re.sub(
r"[^A-Za-z\s]", "", value
) # Remove special characters (keep only letters and spaces)
value = re.sub(r"\s+", " ", value).strip() # Remove extra spaces
return value
def get_model_prediction(x):
x_numpy = tfidf.transform(x).toarray()
pred = model.predict(x_numpy)
result = ([*x], [*pred])
return result
def extract_tweets():
response = requests.get(BASE_URL, timeout=120, headers=HEADERS, params=query)
response_json = response.json()
response_tweets = response_json.get("tweets", [])
total_tweets = []
for tweet_object in response_tweets:
tweet = tweet_object["tweet"].get("full_text", "")
if tweet:
cleaned_tweet = clean_tweet(value=tweet)
total_tweets.append(cleaned_tweet)
return total_tweets
if __name__ == "__main__":
scrapped_tweets = extract_tweets()
cleaned_tweets, model_prediction = get_model_prediction(x=scrapped_tweets)
for tweet_str, prediction in zip(cleaned_tweets, model_prediction):
print(f"Tweet: {tweet_str}")
print(f"Prediction: {prediction}\n\n")
Let’s go over what the above code block does line by line.
Line 1–3: Importing libraries
import re
import joblib
import requests
import re: Imports the re module for regular expressions used in text cleaning.
import joblib: Imports the joblib module for loading saved models and vectorizer.
import requests: Imports the requests module for making HTTP requests to the Social Proxy API.
Line 5-9: The Social Proxy API credentials
BASE_URL = "https://scraping-api.thesocialproxy.com/twitter/v0/search/top"
BEARER_TOKEN = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
BEARER_SECRET = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
HEADERS = {"Api-Key": f"{BEARER_TOKEN}:{BEARER_SECRET}"}
BASE_URL: Defines the base URL of The Social Proxy API.
BEARER_TOKEN and BEARER_SECRET: Defines the API credentials for accessing social proxy.
HEADERS: Creates a dictionary containing the API key for authentication.
Line 11: Search query
query = {"query": "climate+change"}
query: Defines the search query for the API call. Here, it’s set to “climatechange”.
Line 13-16: Loading saved model and vectorizer
# load the saved model and vectorizer
tfidf = joblib.load("tfidf_vectorizer.pkl")
model = joblib.load("logistic_model.pkl")
tfidf = joblib.load(“tfidf_vectorizer.pkl”): Loads the pre-trained TF-IDF vectorizer from a file named “tfidf_vectorizer.pkl”.
model = joblib.load(“logistic_model.pkl”): Loads the pre-trained logistic regression model from a file named “logistic_model.pkl”.
Line 19-28: Clean tweet function
# Define function to clean tweets
def clean_tweet(value: str):
value = re.sub(r"http\S+", "", value) # Remove URLs
value = re.sub(r"#\w+", "", value) # Remove hashtags
value = re.sub(r"@[A-Za-z0-9_]+", "", value) # Remove mentions
value = re.sub(
r"[^A-Za-z\s]", "", value
) # Remove special characters (keep only letters and spaces)
value = re.sub(r"\s+", " ", value).strip() # Remove extra spaces
return value
def clean_tweet(value: str): Defines a function that takes a string (tweet text) as input and returns a cleaned version. Uses regular expressions to remove URLs, hashtags, mentions, special characters (except letters and spaces), and extra whitespace.
Line 31-39: Get model prediction function
def get_model_prediction(x):
x_numpy = tfidf.transform(x).toarray()
pred = model.predict(x_numpy)
result = ([*x], [*pred])
return result
def get_model_prediction(x): Takes a list of strings (tweets) as input, converts the tweets to a TF-IDF representation using the loaded vectorizer, and makes predictions on the transformed tweets using the loaded model. Returns a tuple containing the original tweets and the corresponding model predictions (1 for disinformation, 0 for not).
Line 42-58: Extract tweets function
ef extract_tweets():
response = requests.get(BASE_URL, timeout=120, headers=HEADERS, params=query)
response_json = response.json()
response_tweets = response_json.get("tweets", [])
total_tweets = []
for tweet_object in response_tweets:
tweet = tweet_object["tweet"].get("full_text", "")
if tweet:
cleaned_tweet = clean_tweet(value=tweet)
total_tweets.append(cleaned_tweet)
return total_tweets
def extract_tweets(): Makes a GET request to The Social Proxy scraping API using the defined URL, headers, and search query, parses the JSON response and extracts the list of tweets from the response, cleans each tweet using the clean_tweet function, and returns a list of cleaned tweets.
Line 60-68: main execution
if __name__ == "__main__":
scrapped_tweets = extract_tweets()
cleaned_tweets, model_prediction = get_model_prediction(x=scrapped_tweets)
for tweet_str, prediction in zip(cleaned_tweets, model_prediction):
print(f"Tweet: {tweet_str}")
print(f"Prediction: {prediction}\n\n")
if __name__ == “__main__”:This block executes only when the script is run directly (not imported as a module).
Calls the extract_tweets function to get a list of cleaned tweets related to climate change.
Uses the get_model_prediction function to get predictions for those tweets.
Iterates through the cleaned tweets and corresponding predictions, printing each tweet and its predicted classification (either True or False).
Output:
Dealing with false positives and negatives is an important aspect of building a reliable detector. Improving feature selection, fine-tuning the model, or factoring in additional context via trusted news sources can help minimize inaccuracies.
This Twitter scraping tutorial explains how to scrape tweets that contain specific keywords, clean the data, and train a machine learning model to detect false information on Twitter. You can implement an effective and timely detector by focusing on trending topics that are subject to spreading misinformation. Identifying fake news is imperative to prevent the negative impacts that come with it. Implementing Twitter keyword disinformation detectors is essential for professionals to maintain information integrity and help fight the harmful effects of false information.
Explore The Social Proxy services to enhance your skills on Twitter post scraping for disinformation. The full code used in this article can be found in the GitHub repo.