How to Build a Disinformation Detector for X (Twitter): Verifying Trending Hashtag Information

Social media platforms like Twitter have become a powerful tool for sharing information and promoting discussions. The downside, however, is that they spread fake news. This rings especially true regarding specific keywords or topics, ranging from health-related terms to current events.

The goal of this blog is to equip data scientists, developers, social media analysts, and misinformation researchers with the tools to tackle disinformation. As fake news gains traction, not only does it spread misinformation, but this misinformation can lead to real-world consequences, making accuracy verification of tweets more pressing of a subject than ever.

This tutorial will explore how to build a disinformation detector for Twitter. We’ll walk through the process of scraping Twitter posts related to specific trending keywords, analyzing content using Natural Language Processing (NLP), and implementing a machine learning model for Twitter disinformation to determine the accuracy of tweets.

A step-by-step guide to scraping Twitter posts and detecting false information on specific keywords

Step 1: Set up your environment

Setting up a virtual environment is essential to manage dependencies and keep your Python project separate from other projects on your machine. Here’s a guide on how to set up a virtual environment in Python:

Before setting up a virtual environment, make sure you have Python installed in your system by running the following code:

				
					python --version
				
			

Output:

If you don’t have Python installed, you can download it from the official website.

  • Python 3.3+ comes with a built-in module called venv to create virtual environments. However, you can also install virtualenv for more flexibility.
    To install virtualenv, run:
				
					pip install virtualenv
				
			

Go to the directory and create a virtual environment using the following command:

				
					Cd Downloads
virtualenv myenv

				
			

This command creates a folder named myenv (or any name you choose) containing your virtual environment.

Once we’ve created a virtual environment, we’ll need to activate it.

				
					myenv\Scripts\activate
				
			

Your terminal prompt will change to show the virtual environment name, indicating that it’s active. For example:

Now that the virtual environment is active, you can install your project-specific dependencies without affecting your global Python environment.

For example use this code to install beautifulsoup4 and requests:

				
					pip install beautifulsoup4 requests pandas selenium
				
			

That’s it! You’ve successfully set up and managed your virtual environment.

In order to scrape Twitter posts without getting blocked, you can use The Social Proxy Scraper API. To obtain access, you’ll need to create an account and get your Consumer Key and Consumer Secret Key.

Step 2: Scrape Twitter posts for a specific keyword using The Social Proxy Scraper API

In this step, we’ll collect tweets mentioning a specific keyword (e.g. #ClimateChange) using The Social Proxy Scraper API.

				
					import time
import json
import requests

base_url = "https://scraping-api.thesocialproxy.com/twitter/v0/search/top"
BEARER_TOKEN = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
BEARER_SECRET = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"


HEADERS = {
    "Api-Key": f"{BEARER_TOKEN}:{BEARER_SECRET}",
}

query = {"query": "climate+change"}

is_next = True

cursor = None

total_data = []

page_count = 1

MAX_PAGES = 1000  # Set a reasonable limit for pages

try:
    while is_next and page_count <= MAX_PAGES:

        response = requests.get(base_url, timeout=120, headers=HEADERS, params=query)

        response_json = response.json()

        response_tweets = response_json.get("tweets", [])

        for tweet in response_tweets:
            tweet_text = tweet["tweet"].get("full_text", "")
            total_data.append({"tweet": tweet_text, "sentiment": -1})

        new_cursor = response_json.get("cursor", None)

        if new_cursor and new_cursor != cursor:
            cursor = new_cursor
            query["cursor"] = cursor
            is_next = True

        else:
            is_next = False

        time.sleep(2)
        print(f"Done with page: {page_count}")
        page_count += 1

    with open("total_tweets.json", "w", encoding="utf-8") as f:
        json.dump(total_data, f)

    print("Done!")

except KeyboardInterrupt:
    print("\nProcess interrupted by user.")
    with open("total_tweets.json", "w", encoding="utf-8") as f:
        json.dump(total_data, f)
    print("Progress saved.")

except Exception as e:
    with open("total_tweets.json", "w", encoding="utf-8") as f:
        json.dump(total_data, f)
    print(f"An error occurred: {e}")

else:
    with open("total_tweets.json", "w", encoding="utf-8") as f:
        json.dump(total_data, f)
    print("Done!")
				
			

Let’s review the code block above line by line.
Line 1–3: Imports libraries

				
					import time
import json
import requests
				
			

json: Used to work with JSON data (storing and retrieving data)
time: Used for pausing between requests (to avoid overloading the API)
requests: Used to make HTTP requests to the Twitter scraping API

Line 6-13: API credentials

				
					base_url = "https://scraping-api.thesocialproxy.com/twitter/v0/search/top"
BEARER_TOKEN = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
BEARER_SECRET = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"


HEADERS = {
    "Api-Key": f"{BEARER_TOKEN}:{BEARER_SECRET}",
}
				
			

These lines define the base URL for the Twitter scraping API, as well as API credentials for authentication. The HEADERS dictionary stores the API key for authorization.

Line 15-25: Search query and loop variables

				
					query = {"query": "climate+change"}

is_next = True

cursor = None

total_data = []

page_count = 1

MAX_PAGES = 1000  # Set a reasonable limit for pages
				
			

query: Defines the search term as “climate+change”.
is_next: Boolean flag to control the loop (continues scraping as long as there are more pages).
cursor: Stores a value that retrieves subsequent pages of results.
total_data: Empty list to store all scraped tweets.
page_count: Keeps track of the current page being scraped.
MAX_PAGES: Sets a limit on the number of pages to scrape (prevents infinite loops).

Line 27-52: Main loop for scraping

				
					try:
    while is_next and page_count <= MAX_PAGES:


        response = requests.get(base_url, timeout=120, headers=HEADERS, params=query)


        response_json = response.json()


        response_tweets = response_json.get("tweets", [])


        for tweet in response_tweets:
            tweet_text = tweet["tweet"].get("full_text", "")
            total_data.append({"tweet": tweet_text, "sentiment": -1})


        new_cursor = response_json.get("cursor", None)


        if new_cursor and new_cursor != cursor:
            cursor = new_cursor
            query["cursor"] = cursor
            is_next = True


        else:
            is_next = False


        time.sleep(2)
        print(f"Done with page: {page_count}")
        page_count += 1

				
			

The try block handles the main scraping.
The while loop continues as long as is_next is True and the page limit isn’t reached.

You should also note that inside the loop:

  • An HTTP GET request is sent to the scraping API with the search query and headers.
  • The response is converted to JSON format (response_json).
  • Tweets are extracted from the response (response_tweets).
  • Each tweet’s full text is retrieved and stored with a sentiment placeholder (-1).
  • The loop iterates through each tweet and adds it to the total_data list.
  • A new cursor value for pages of results is retrieved.
  • If a new cursor exists and differs from the previous one, it’s used to update the search query for the next page.
  • Otherwise, is_next is set to False, signaling the end of scraping.
  • A 2-second delay is added between requests to avoid overwhelming the API.
  • The current page number is printed.
  • The page count is incremented.

Line 54-73: Save results and error handling

				
					with open("total_tweets.json", "w", encoding="utf-8") as f:
        json.dump(total_data, f)

    print("Done!")

except KeyboardInterrupt:
    print("\nProcess interrupted by user.")
    with open("total_tweets.json", "w", encoding="utf-8") as f:
        json.dump(total_data, f)
    print("Progress saved.")

except Exception as e:
    with open("total_tweets.json", "w", encoding="utf-8") as f:
        json.dump(total_data, f)
    print(f"An error occurred: {e}")

else:
    with open("total_tweets.json", "w", encoding="utf-8") as f:
        json.dump(total_data, f)
    print("Done!")
				
			

try-except-else block:

  • try: This block contains the code that you want to execute. If an exception occurs within this block, the execution will jump to the except block.
    except
  • KeyboardInterrupt: This except block handles the specific exception KeyboardInterrupt, which is raised when the user interrupts the program (e.g. by pressing Ctrl+C). If this exception occurs, the code within this block will be executed.
  • except Exception as e: This except block catches any other type of exception that might occur. The variable e will hold the exception object, allowing you to print its details for debugging.
  • else: This block is executed only if no exceptions occur within the try block.
  • with open(“total_tweets.json”, “w”, encoding=”utf-8″) as f: This line opens the file “total_tweets.json” in write mode (“w”) with UTF-8 encoding. The with statement ensures that the file automatically closes, even if an exception occurs.
  • json.dump(total_data, f): This line dumps the total_data object into the open file f as JSON format.
  • print(“Done!”): This line prints a message indicating that the process has been completed successfully.

Step 3: Data preprocessing and keyword-specific analysis

Once we’ve scraped the tweet, we can clean and preprocess the data. Let’s start by importing the libraries:

				
					#import libraries

import re
import nltk
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report
				
			

Read the dataset:

				
					# Change display setting to show full text
pd.set_option('display.max_colwidth', None)

tweet = pd.read_csv("tweets.csv")
tweet.head()

				
			

Output:

From the output, we can see the column name is not specified well, so we’ll rename the column to make the output more intuitive.

				
					#Rename the column


tweet1 = tweet.rename(columns={'Unnamed: 0': 'Id'}, inplace=False)
tweet1.head()
				
			

Output:

Now that we have a better output, we can move forward with the data cleaning and remove links, emojis, and other irrelevant descriptions from the dataset.

				
					#Data cleaning - Remove hashtags,urls, emojis and special characters
#Define function to clean tweets

def clean_tweet(tweet1):
    tweet1 = re.sub(r'http\S+', '', tweet1)  # Remove URLs
    tweet1 = re.sub(r'#\w+', '', tweet1)     # Remove hashtags
    tweet1 = re.sub(r'@[A-Za-z0-9_]+', '', tweet1)  # Remove mentions
    tweet1 = re.sub(r'[^A-Za-z\s]', '', tweet1)     # Remove special characters (keep only letters and spaces)
    tweet1 = re.sub(r'\s+', ' ', tweet1).strip()    # Remove extra spaces
    return tweet1

# Apply the clean_tweet function to the 'tweet' column
tweet1['cleaned_tweet'] = tweet1['tweet'].apply(clean_tweet)
tweet1.tail()

				
			

Output:

				
					#Keyword-Specific Feature Extraction

tweet1['keyword_present'] = tweet1['cleaned_tweet'].str.contains("climate change")
tweet1['has_hashtags'] = tweet1['tweet'].apply(lambda x: re.findall(r"#(\w+)", x))
print(tweet1[['cleaned_tweet', 'keyword_present', 'has_hashtags']])
				
			

The above code extracts the following features from the tweets:
‘keyword_present’: A boolean indicating whether the tweet contains the keyword “climate change”.
‘has_hashtags’: A list of hashtags present in the tweet.


Output:

Now we’ll use Natural Language Processing (NLP) for disinformation detection, to perform twitter content analysis for truth verification and the context of the keyword used in the tweet.

				
					#Using NLP for Contextual Understanding

nltk.download('vader_lexicon')

# Sentiment Analysis
sia = SentimentIntensityAnalyzer()

tweet1['sentiment'] = tweet1['cleaned_tweet'].apply(lambda x: sia.polarity_scores(x)['compound'])

# Label sentiment: Positive (1), Neutral (0), Negative (-1)
tweet1['sentiment_label'] = tweet1['sentiment'].apply(lambda x: 1 if x < 0 else (-1 if x < 0 else 0))

# Display the first few rows with sentiment analysis
print(tweet1.head())
				
			

Output:

Here’s a breakdown:

  • nltk.download(‘vader_lexicon’): This line downloads the VADER sentiment lexicon from the NLTK library. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool specifically designed for social media text. It requires the VADER lexicon to be downloaded before use.
  • sia = SentimentIntensityAnalyzer(): This line creates an instance of SentimentIntensityAnalyzer from the NLTK library. This analyzer object will be used to perform sentiment analysis on the tweets.
  • tweet1[‘sentiment’] = tweet1[‘cleaned_tweet’].apply(lambda x: sia.polarity_scores(x)[‘compound’]): This line adds a new column named sentiment to the tweet1 DataFrame. It applies a lambda function to each row in the ‘cleaned_tweet’ column. The lambda function calls the polarity_scores method of the sia (sentiment analyzer) object. polarity_scores analyzes the sentiment of the text and returns a dictionary containing various scores. The dictionary includes a key named ‘compound’ which represents the overall sentiment polarity (negative, neutral, positive). This is extracted and assigned to the sentiment column.
  • tweet1[‘sentiment_label’] = tweet1[‘sentiment’].apply(lambda x: 1 if x < 0 else (-1 if x < 0 else 0)): This line adds another new column named sentiment_label to the tweet1 DataFrame. It applies another lambda function to each row in the newly created sentiment column.The lambda function creates a categorical label based on the sentiment score. If the score (x) is less than 0 (negative sentiment), it assigns 1 (label for negative), If the score is greater than 0 (positive sentiment), it assigns -1 (label for positive), If the score equals 0 (neutral sentiment), it assigns 0 (label for neutral).
  • print(tweet1.head()): This line prints the first few rows of the modified tweet1 DataFrame. Now, the DataFrame includes the ‘cleaned_tweet’ column, the VADER ‘sentiment’ polarity score for the tweet and the ‘sentiment_label’, a categorical label indicating the overall sentiment (negative, neutral, positive).

Step 4: Build a keyword-specific disinformation detection model

Here’s a step-by-step guide to train and build a Twitter disinformation model that differentiates between true and false information about climate change.

				
					#Train model
# Feature Extraction with TF-IDF

tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(tweet1['cleaned_tweet']).toarray()

# Labels (For illustration, assume 0 = Not disinformation, 1 = Disinformation)
# You will need labeled data for this step
y = tweet1['keyword_present']  # Placeholder label

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Model training
model = LogisticRegression()
model.fit(X_train, y_train)

# Model evaluation
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))
				
			

TF-IDF is a common technique for representing text data numerically. It considers both the frequency of words within a document as well as their overall importance across the entire data. The above code snippet performs keyword-specific disinformation detection as well as the following:

  1. Converts textual tweet data into numerical features using TF-IDF.
  2. Trains a logistic regression model to differentiate between truthful and potentially disinformation-related tweets.
  3. Evaluates the model’s performance to assess its effectiveness in identifying disinformation.

Output:

Save the vectorizer and model:

				
					# Save the TF-IDF vectorizer
joblib.dump(tfidf, 'tfidf_vectorizer.pkl')
# Save the Logistic Regression model
joblib.dump(model, 'logistic_model.pkl')
				
			

Step 5: Implementing the keyword-specific disinformation detector

Once the model has been trained, we’ll integrate it with the scraper. Each time a new batch of tweets containing the keyword gets scraped, the model classifies whether or not they spread disinformation.

Let’s write a script to integrate it with the scraper:

				
					import re
import joblib
import requests

BASE_URL = "https://scraping-api.thesocialproxy.com/twitter/v0/search/top"
BEARER_TOKEN = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
BEARER_SECRET = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

HEADERS = {"Api-Key": f"{BEARER_TOKEN}:{BEARER_SECRET}"}

query = {"query": "climate+change"}

# load the saved model and vectorizer

tfidf = joblib.load("tfidf_vectorizer.pkl")
model = joblib.load("logistic_model.pkl")

# Define function to clean tweets
def clean_tweet(value: str):
    value = re.sub(r"http\S+", "", value)  # Remove URLs
    value = re.sub(r"#\w+", "", value)  # Remove hashtags
    value = re.sub(r"@[A-Za-z0-9_]+", "", value)  # Remove mentions
    value = re.sub(
        r"[^A-Za-z\s]", "", value
    )  # Remove special characters (keep only letters and spaces)
    value = re.sub(r"\s+", " ", value).strip()  # Remove extra spaces
    return value


def get_model_prediction(x):

    x_numpy = tfidf.transform(x).toarray()

    pred = model.predict(x_numpy)

    result = ([*x], [*pred])

    return result


def extract_tweets():
    response = requests.get(BASE_URL, timeout=120, headers=HEADERS, params=query)

    response_json = response.json()

    response_tweets = response_json.get("tweets", [])

    total_tweets = []

    for tweet_object in response_tweets:
        tweet = tweet_object["tweet"].get("full_text", "")

        if tweet:
            cleaned_tweet = clean_tweet(value=tweet)
            total_tweets.append(cleaned_tweet)

    return total_tweets

if __name__ == "__main__":

    scrapped_tweets = extract_tweets()

    cleaned_tweets, model_prediction = get_model_prediction(x=scrapped_tweets)

    for tweet_str, prediction in zip(cleaned_tweets, model_prediction):
        print(f"Tweet: {tweet_str}")
        print(f"Prediction: {prediction}\n\n")
				
			

Let’s go over what the above code block does line by line.

Line 1–3: Importing libraries

				
					import re
import joblib
import requests


				
			

import re: Imports the re module for regular expressions used in text cleaning.
import joblib: Imports the joblib module for loading saved models and vectorizer.
import requests: Imports the requests module for making HTTP requests to the Social Proxy API.

Line 5-9: The Social Proxy API credentials

				
					BASE_URL = "https://scraping-api.thesocialproxy.com/twitter/v0/search/top"
BEARER_TOKEN = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
BEARER_SECRET = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

HEADERS = {"Api-Key": f"{BEARER_TOKEN}:{BEARER_SECRET}"}
				
			

BASE_URL: Defines the base URL of The Social Proxy API.
BEARER_TOKEN and BEARER_SECRET: Defines the API credentials for accessing social proxy.
HEADERS: Creates a dictionary containing the API key for authentication.

Line 11: Search query

				
					query = {"query": "climate+change"}
				
			

query: Defines the search query for the API call. Here, it’s set to “climatechange”.

Line 13-16: Loading saved model and vectorizer

				
					# load the saved model and vectorizer
tfidf = joblib.load("tfidf_vectorizer.pkl")
model = joblib.load("logistic_model.pkl")
				
			

tfidf = joblib.load(“tfidf_vectorizer.pkl”): Loads the pre-trained TF-IDF vectorizer from a file named “tfidf_vectorizer.pkl”.
model = joblib.load(“logistic_model.pkl”): Loads the pre-trained logistic regression model from a file named “logistic_model.pkl”.

Line 19-28: Clean tweet function

				
					# Define function to clean tweets
def clean_tweet(value: str):
    value = re.sub(r"http\S+", "", value)  # Remove URLs
    value = re.sub(r"#\w+", "", value)  # Remove hashtags
    value = re.sub(r"@[A-Za-z0-9_]+", "", value)  # Remove mentions
    value = re.sub(
        r"[^A-Za-z\s]", "", value
    )  # Remove special characters (keep only letters and spaces)
    value = re.sub(r"\s+", " ", value).strip()  # Remove extra spaces
    return value
				
			

def clean_tweet(value: str): Defines a function that takes a string (tweet text) as input and returns a cleaned version. Uses regular expressions to remove URLs, hashtags, mentions, special characters (except letters and spaces), and extra whitespace.

Line 31-39: Get model prediction function

				
					def get_model_prediction(x):

    x_numpy = tfidf.transform(x).toarray()

    pred = model.predict(x_numpy)

    result = ([*x], [*pred])

    return result
				
			

def get_model_prediction(x): Takes a list of strings (tweets) as input, converts the tweets to a TF-IDF representation using the loaded vectorizer, and makes predictions on the transformed tweets using the loaded model. Returns a tuple containing the original tweets and the corresponding model predictions (1 for disinformation, 0 for not).

Line 42-58: Extract tweets function

				
					ef extract_tweets():
    response = requests.get(BASE_URL, timeout=120, headers=HEADERS, params=query)

    response_json = response.json()

    response_tweets = response_json.get("tweets", [])

    total_tweets = []

    for tweet_object in response_tweets:
        tweet = tweet_object["tweet"].get("full_text", "")

        if tweet:
            cleaned_tweet = clean_tweet(value=tweet)
            total_tweets.append(cleaned_tweet)

    return total_tweets
				
			

def extract_tweets(): Makes a GET request to The Social Proxy scraping API using the defined URL, headers, and search query, parses the JSON response and extracts the list of tweets from the response, cleans each tweet using the clean_tweet function, and returns a list of cleaned tweets.

Line 60-68: main execution

				
					if __name__ == "__main__":

    scrapped_tweets = extract_tweets()

    cleaned_tweets, model_prediction = get_model_prediction(x=scrapped_tweets)

    for tweet_str, prediction in zip(cleaned_tweets, model_prediction):
        print(f"Tweet: {tweet_str}")
        print(f"Prediction: {prediction}\n\n")
				
			

if __name__ == “__main__”:This block executes only when the script is run directly (not imported as a module).


Calls the extract_tweets function to get a list of cleaned tweets related to climate change.


Uses the get_model_prediction function to get predictions for those tweets.


Iterates through the cleaned tweets and corresponding predictions, printing each tweet and its predicted classification (either True or False).


Output:

Dealing with false positives and negatives is an important aspect of building a reliable detector. Improving feature selection, fine-tuning the model, or factoring in additional context via trusted news sources can help minimize inaccuracies.

Conclusion

This Twitter scraping tutorial explains how to scrape tweets that contain specific keywords, clean the data, and train a machine learning model to detect false information on Twitter. You can implement an effective and timely detector by focusing on trending topics that are subject to spreading misinformation. Identifying fake news is imperative to prevent the negative impacts that come with it. Implementing Twitter keyword disinformation detectors is essential for professionals to maintain information integrity and help fight the harmful effects of false information.

Explore The Social Proxy services to enhance your skills on Twitter post scraping for disinformation. The full code used in this article can be found in the GitHub repo.

Accessibility tools

Powered by - Wemake