My team and I are interested in President Donald Trump's tweets from the popular social media platform Twitter because they deliver the opinions and thoughts from this controversial political figure straight to the public. If you forget what he looks like, click here for a reminder. From multiple sources, we were able to retrieve a large CSV file of his tweets from the early days of his 2016 presidential campaign all the way up to March 5th, 2017. These are the question we plan on answering at the end:
1. On average, do Trump's tweets tend to be positive or negative?
2. What are Trump's most used words in his tweets?
3. What emotions do Trump’s tweets convey?
# Set-up
import pandas as pd
import numpy as np
import csv as csv
from nltk.tokenize import TweetTokenizer
from collections import OrderedDict, defaultdict, Counter
import seaborn as sns # for visualization
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib
import matplotlib.pyplot as plt # plotting
df = pd.read_csv('./data/Trump-Tweets.csv')
df.columns
Here, we read in the entire CSV file and convert it to a dataframe and show you all the columns in the dataset.
The library we will be utilizing to conduct the sentiment analysis is the NRC Word-Emotion Association Lexicon, which will help us determine eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust). It will also help us determine whether the tweet is positive or negative. The categorization were manually done by expert researchers. These will be our new columns, which provide more information about the content of the tweets. In addition, I have added a variable called "polarity" that calculates the magnitude of the tweet's sentiment, and it is simply subtracting the negative column from the positive column. Last but not least, I added the varible "sentiment" that labels the tweet as positive, negative or neutral based on the polarity.
# Gather and isolate all the tweets
tweets = df['Text']
wordList = defaultdict(list)
emotionList = defaultdict(list)
# Process the text file and put the lexicon into a dictionary.
with open('./data/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt', 'r') as f:
reader = csv.reader(f, delimiter='\t')
for word, emotion, present in reader:
if int(present) == 1:
wordList[word].append(emotion)
emotionList[emotion].append(word)
tt = TweetTokenizer()
def generate_emotion_count(string, tokenizer):
emoCount = Counter()
for token in tt.tokenize(string):
token = token.lower()
emoCount += Counter(wordList[token])
return emoCount
# Go through all the tweets with the lexicon and figure out the emotions and sentiments.
emotionCounts = [generate_emotion_count(tweet, tt) for tweet in tweets]
emotion = pd.DataFrame(emotionCounts, index=tweets.index)
# Fill in 0s if the emotion is not detected.
emotion = emotion.fillna(0)
# Join the original dataset and the sentiment and emotion counts
df = pd.concat([df, emotion], axis=1, join='inner')
# Adds polarity and sentiments
df['polarity'] = df['positive'] - df['negative']
sentiments = []
for polarity in df['polarity']:
if polarity > 0:
sentiments.append("positive")
elif polarity < 0:
sentiments.append("negative")
else:
sentiments.append("neutral")
df['sentiment'] = sentiments
In this section, we want to showcase what the dataset looks like now after some manipulation and cleaning up. Below is the shape of the dataset and the variable type of each column.
df.shape
df.dtypes
In the dataset, there are now 8165 Tweets dated from July 16th, 2015 to March 5th, 2017 and 17 columns. Below is a sample of what the dataset looks like now.
df.sample(n=10)
To answer our second question, we needed to determine the most frequently used words by Trump. To do this, we separated each tweet into individual words and then took a count of each word.
We then used a library called 'wordcloud' to visualize our data better.
We are interested in all of the new columns that we have added, along with date, the number of favorites and retweets. We first want to see the distribution of emotions in his tweets. For some of the emotions, we see a value of 0, and that's because the lexicon does not associate some of Trump's words with an emotion. As we are examining the emotion counts, we are not so interested in the distribution of each emotion because it is almost impossible to pinpoint one emotion for one tweet. In addtion, the emotion columns are more of a tally of how many times that emotion emerges in the tweet. We are more intrigued by the overall distribution of emotions among his 8000+ tweets. Below is a histogram of that data.
# Histogram of emotions
emotions = ['anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust']
emotion_sums = [sum(df['anger']), sum(df['anticipation']), sum(df['disgust']), sum(df['fear']), sum(df['joy']), sum(df['sadness']), sum(df['surprise']), sum(df['trust'])]
pos = np.arange(len(emotions))
width = 1.0
ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(emotions)
# Graph the histogram
rects = plt.bar(pos, emotion_sums, width, color='b')
# Attach a text label for each bar
def autolabel(rects):
for rect in rects:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
'%d' % int(height),
ha='center', va='bottom')
autolabel(rects)
plt.show()
It is surprising forus that trust receives the highest count. We thought anger or disgust would take the lead because Trump bashes his opponents, the state of our country and President Obama's policies quite often, but perhaps using words that inflict trust is the reason why he has so many loyal supporters.
Next, we want to examine the distribution of polarity so that we can try to figure out if Trump's emotions and sentiments change overtime. Later in the analysis, we would like to find out major dates during the election and see if we can pinpoint the exact date(s) Trump's emotions and sentiments alter.
df['polarity'].describe()
Last but not least, we are also interested in the distrubtion of favorites and retweets. We want to see how sentiment and emotions in Trump's tweet affect people's response on Twitter. Below is a scatter plot between those two columns.
plt.scatter(df['Favorites'], df['Retweets'], alpha=0.5)
plt.xlim(-100, 51000)
plt.ylim(-100, 21000)
plt.xlabel('Favorites')
plt.ylabel('Retweets')
plt.show()
We group the observations by sentiment and provide a sample of the data, as well as the number of rows in each grouping. We are not so interested in the neutral tweets because it's hard to determine what side of sentiment it is on.
positive_tweets = df.groupby('sentiment').get_group('positive')
positive_tweets.sample(n=2)
positive_tweets.shape
negative_tweets = df.groupby('sentiment').get_group('negative')
negative_tweets.sample(n=2)
negative_tweets.shape
Next, we graph both set of the dataset to examine the favorite and retweet count in a scatter plot.
plt.scatter(positive_tweets['Favorites'], positive_tweets['Retweets'], alpha=0.5)
plt.xlim(-100, 51000)
plt.ylim(-100, 21000)
plt.xlabel('Favorites - Positive')
plt.ylabel('Retweets - Positive')
plt.show()
plt.scatter(negative_tweets['Favorites'], negative_tweets['Retweets'], alpha=0.5)
plt.xlim(-100, 51000)
plt.ylim(-100, 21000)
plt.xlabel('Favorites - Negative')
plt.ylabel('Retweets - Positive')
plt.show()
Surprisingly, Trump issued more than double the positive tweets than the negative tweets. However, when he published a negative tweet, the chance of the favorite and retweet count being higher than much greater than a positive tweet.
For the bivariate analysis, I use the correlation data from the panda library and generate a heat map from the seaborn library to see how the variable covary. I decide to conduct this over the entire dataframe, even though some data might not be relevant, just in case some interesting trend pops up.
df.corr()
sns.heatmap(df.corr(), square=True)
plt.show()
Although not all the squares are worth examing, we are glad to see that some of the data in this section are consistent with the data from the earlier sections. For example, there is a stronger positive correlation between the negative sentiment and favorite and retweet counts than the positive sentiment.
After this exercise, our assumptions are shattered. we are surprised at how many more positive tweets that Trump sends than negative tweets, as well as finding out that generally, his tweets actually contains many words that inflict trusts. We were very certain that his tweets are going to lean toward the more negative emotions.
Our group is definitely limited by the lexicon that we are using to analyze Trump's tweets. Perhaps we need to think about using multiple lexicons to see if we are obtaining similar results across the board. We are also worried about our method of calculating the sentiment. Right now, we are simply taking the difference between the positive and negative word counts, but should neutral be when both counts are 0 or is it valid to claim that the tweet is also neutral if the two counts cancel each other out?