Analysis of Donald Trump's Tweets

My team and I are interested in President Donald Trump's tweets from the popular social media platform Twitter because they deliver the opinions and thoughts from this controversial political figure straight to the public. If you forget what he looks like, click here for a reminder. From multiple sources, we were able to retrieve a large CSV file of his tweets from the early days of his 2016 presidential campaign all the way up to March 5th, 2017. These are the question we plan on answering at the end:

1. On average, do Trump's tweets tend to be positive or negative?
2. What are Trump's most used words in his tweets?
3. What emotions do Trump’s tweets convey?

Data peparation

In [1]:
# Set-up
import pandas as pd
import numpy as np
import csv as csv
from nltk.tokenize import TweetTokenizer
from collections import OrderedDict, defaultdict, Counter
import seaborn as sns # for visualization
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib
import matplotlib.pyplot as plt # plotting
In [2]:
df = pd.read_csv('./data/Trump-Tweets.csv')
df.columns
Out[2]:
Index(['Tweet_Num', 'Date', 'Text', 'Favorites', 'Retweets'], dtype='object')

Here, we read in the entire CSV file and convert it to a dataframe and show you all the columns in the dataset.

The library we will be utilizing to conduct the sentiment analysis is the NRC Word-Emotion Association Lexicon, which will help us determine eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust). It will also help us determine whether the tweet is positive or negative. The categorization were manually done by expert researchers. These will be our new columns, which provide more information about the content of the tweets. In addition, I have added a variable called "polarity" that calculates the magnitude of the tweet's sentiment, and it is simply subtracting the negative column from the positive column. Last but not least, I added the varible "sentiment" that labels the tweet as positive, negative or neutral based on the polarity.

In [3]:
# Gather and isolate all the tweets
tweets = df['Text']
wordList = defaultdict(list)
emotionList = defaultdict(list)

# Process the text file and put the lexicon into a dictionary.
with open('./data/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt', 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    for word, emotion, present in reader:
        if int(present) == 1:
            wordList[word].append(emotion)
            emotionList[emotion].append(word)
            
tt = TweetTokenizer()

def generate_emotion_count(string, tokenizer):
    emoCount = Counter()
    for token in tt.tokenize(string):
        token = token.lower()
        emoCount += Counter(wordList[token])
    return emoCount

# Go through all the tweets with the lexicon and figure out the emotions and sentiments.
emotionCounts = [generate_emotion_count(tweet, tt) for tweet in tweets]
emotion = pd.DataFrame(emotionCounts, index=tweets.index)

# Fill in 0s if the emotion is not detected.
emotion = emotion.fillna(0)
In [4]:
# Join the original dataset and the sentiment and emotion counts
df = pd.concat([df, emotion], axis=1, join='inner')

# Adds polarity and sentiments 
df['polarity'] = df['positive'] - df['negative']

sentiments = []
for polarity in df['polarity']:
    if polarity > 0:
       sentiments.append("positive")
    elif polarity < 0:
       sentiments.append("negative")
    else:
       sentiments.append("neutral")
    
df['sentiment'] = sentiments

Describing the Data Structure

In this section, we want to showcase what the dataset looks like now after some manipulation and cleaning up. Below is the shape of the dataset and the variable type of each column.

In [5]:
df.shape
Out[5]:
(8165, 17)
In [6]:
df.dtypes
Out[6]:
Tweet_Num         int64
Date             object
Text             object
Favorites         int64
Retweets          int64
anger           float64
anticipation    float64
disgust         float64
fear            float64
joy             float64
negative        float64
positive        float64
sadness         float64
surprise        float64
trust           float64
polarity        float64
sentiment        object
dtype: object

In the dataset, there are now 8165 Tweets dated from July 16th, 2015 to March 5th, 2017 and 17 columns. Below is a sample of what the dataset looks like now.

In [7]:
df.sample(n=10)
Out[7]:
Tweet_Num Date Text Favorites Retweets anger anticipation disgust fear joy negative positive sadness surprise trust polarity sentiment
5226 5237 16-05-07 I am going to keep our jobs in the US and tota... 19114 6576 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 -1.0 negative
4809 4820 16-03-24 NATO is obsolete and must be changed to additi... 15376 4949 1.0 0.0 1.0 1.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 neutral
2946 2950 15-12-06 Trump is not controlled by donorsspecial int... 2422 831 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 positive
2450 2453 15-11-10 OuTsTaNdInG Is obvious that Jr has much of ... 1182 459 0.0 0.0 0.0 1.0 2.0 1.0 3.0 0.0 0.0 2.0 2.0 positive
7937 7948 17-01-20 Today we are not merely transferring power fro... 111258 20552 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 neutral
2673 2676 15-11-21 When you become president we will avenge all... 2304 718 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 positive
695 696 15-08-07 I really enjoyed the debate tonight even thoug... 9651 4042 0.0 1.0 0.0 0.0 1.0 0.0 3.0 0.0 1.0 2.0 3.0 positive
1867 1868 15-10-18 Just reported by CNN that the Trump halo effec... 2550 1011 1.0 0.0 0.0 1.0 1.0 1.0 2.0 1.0 1.0 1.0 1.0 positive
6086 6097 16-07-29 In Hillary Clintons America things get worse ... 26662 12781 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 -1.0 negative
1691 1692 15-10-13 You Cant Stump the Trump 6406 6057 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 neutral

Univariate Analysis

To answer our second question, we needed to determine the most frequently used words by Trump. To do this, we separated each tweet into individual words and then took a count of each word.

Trump's Top 10 Most Frequently Used Words

We then used a library called 'wordcloud' to visualize our data better.

'Wordcloud Visualization of Trump's Most Used Words'

We are interested in all of the new columns that we have added, along with date, the number of favorites and retweets. We first want to see the distribution of emotions in his tweets. For some of the emotions, we see a value of 0, and that's because the lexicon does not associate some of Trump's words with an emotion. As we are examining the emotion counts, we are not so interested in the distribution of each emotion because it is almost impossible to pinpoint one emotion for one tweet. In addtion, the emotion columns are more of a tally of how many times that emotion emerges in the tweet. We are more intrigued by the overall distribution of emotions among his 8000+ tweets. Below is a histogram of that data.

In [8]:
# Histogram of emotions
emotions = ['anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust']
emotion_sums = [sum(df['anger']), sum(df['anticipation']), sum(df['disgust']), sum(df['fear']), sum(df['joy']), sum(df['sadness']), sum(df['surprise']), sum(df['trust'])]

pos = np.arange(len(emotions))
width = 1.0 
ax = plt.axes()
ax.set_xticks(pos + (width / 2))
ax.set_xticklabels(emotions)

# Graph the histogram
rects = plt.bar(pos, emotion_sums, width, color='b')

# Attach a text label for each bar
def autolabel(rects):
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%d' % int(height),
                ha='center', va='bottom')
autolabel(rects)
plt.show()

It is surprising forus that trust receives the highest count. We thought anger or disgust would take the lead because Trump bashes his opponents, the state of our country and President Obama's policies quite often, but perhaps using words that inflict trust is the reason why he has so many loyal supporters.

Next, we want to examine the distribution of polarity so that we can try to figure out if Trump's emotions and sentiments change overtime. Later in the analysis, we would like to find out major dates during the election and see if we can pinpoint the exact date(s) Trump's emotions and sentiments alter.

In [9]:
df['polarity'].describe()
Out[9]:
count    8165.000000
mean        0.323209
std         1.279625
min        -6.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         7.000000
Name: polarity, dtype: float64

Last but not least, we are also interested in the distrubtion of favorites and retweets. We want to see how sentiment and emotions in Trump's tweet affect people's response on Twitter. Below is a scatter plot between those two columns.

In [10]:
plt.scatter(df['Favorites'], df['Retweets'], alpha=0.5)
plt.xlim(-100, 51000)
plt.ylim(-100, 21000)
plt.xlabel('Favorites')
plt.ylabel('Retweets')
plt.show()

Univariate Analysis by Category

We group the observations by sentiment and provide a sample of the data, as well as the number of rows in each grouping. We are not so interested in the neutral tweets because it's hard to determine what side of sentiment it is on.

In [11]:
positive_tweets = df.groupby('sentiment').get_group('positive')
positive_tweets.sample(n=2)
Out[11]:
Tweet_Num Date Text Favorites Retweets anger anticipation disgust fear joy negative positive sadness surprise trust polarity sentiment
4525 4534 16-03-06 I will be interviewed on this morning Enjoy 6605 1476 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 positive
3850 3859 16-01-28 Our ReTrumplican Trump support group of 9500... 4441 1338 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 positive
In [12]:
positive_tweets.shape
Out[12]:
(3358, 17)
In [13]:
negative_tweets = df.groupby('sentiment').get_group('negative')
negative_tweets.sample(n=2)
Out[13]:
Tweet_Num Date Text Favorites Retweets anger anticipation disgust fear joy negative positive sadness surprise trust polarity sentiment
6060 6071 16-07-27 No matter what Bill Clinton says and no matter... 40786 10906 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 -1.0 negative
6977 6988 16-10-17 I will sign the first bill to repeal #Obamacar... 39756 14929 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 -1.0 negative
In [14]:
negative_tweets.shape
Out[14]:
(1585, 17)

Next, we graph both set of the dataset to examine the favorite and retweet count in a scatter plot.

In [15]:
plt.scatter(positive_tweets['Favorites'], positive_tweets['Retweets'], alpha=0.5)
plt.xlim(-100, 51000)
plt.ylim(-100, 21000)
plt.xlabel('Favorites - Positive')
plt.ylabel('Retweets - Positive')
plt.show()
In [16]:
plt.scatter(negative_tweets['Favorites'], negative_tweets['Retweets'], alpha=0.5)
plt.xlim(-100, 51000)
plt.ylim(-100, 21000)
plt.xlabel('Favorites - Negative')
plt.ylabel('Retweets - Positive')
plt.show()

Surprisingly, Trump issued more than double the positive tweets than the negative tweets. However, when he published a negative tweet, the chance of the favorite and retweet count being higher than much greater than a positive tweet.

Bivariate analysis

For the bivariate analysis, I use the correlation data from the panda library and generate a heat map from the seaborn library to see how the variable covary. I decide to conduct this over the entire dataframe, even though some data might not be relevant, just in case some interesting trend pops up.

In [17]:
df.corr()
Out[17]:
Tweet_Num Favorites Retweets anger anticipation disgust fear joy negative positive sadness surprise trust polarity
Tweet_Num 1.000000 0.583231 0.530601 0.073896 0.022659 0.047936 0.096024 -0.012950 0.082636 0.014020 0.060119 -0.121790 0.010273 -0.046505
Favorites 0.583231 1.000000 0.895866 0.093647 0.044141 0.079879 0.128372 0.039444 0.122194 0.069911 0.095723 -0.061914 0.062710 -0.031240
Retweets 0.530601 0.895866 1.000000 0.096620 0.018558 0.077048 0.129011 0.014691 0.121908 0.035259 0.093792 -0.066974 0.043102 -0.057484
anger 0.073896 0.093647 0.096620 1.000000 0.171143 0.626407 0.620694 0.118188 0.729193 0.101578 0.653393 0.213901 0.135115 -0.427261
anticipation 0.022659 0.044141 0.018558 0.171143 1.000000 0.067284 0.133885 0.598991 0.142526 0.420231 0.168185 0.389041 0.396646 0.222005
disgust 0.047936 0.079879 0.077048 0.626407 0.067284 1.000000 0.487620 0.026048 0.622289 0.040481 0.569666 0.095019 0.072083 -0.399880
fear 0.096024 0.128372 0.129011 0.620694 0.133885 0.487620 1.000000 0.016559 0.611319 0.049416 0.572460 0.090770 0.087923 -0.385468
joy -0.012950 0.039444 0.014691 0.118188 0.598991 0.026048 0.016559 1.000000 0.049901 0.617947 0.105668 0.459002 0.543210 0.436994
negative 0.082636 0.122194 0.121908 0.729193 0.142526 0.622289 0.611319 0.049901 1.000000 0.058183 0.746742 0.144957 0.096740 -0.647837
positive 0.014020 0.069911 0.035259 0.101578 0.420231 0.040481 0.049416 0.617947 0.058183 1.000000 0.082592 0.298409 0.617434 0.722796
sadness 0.060119 0.095723 0.093792 0.653393 0.168185 0.569666 0.572460 0.105668 0.746742 0.082592 1.000000 0.205958 0.139632 -0.453898
surprise -0.121790 -0.061914 -0.066974 0.213901 0.389041 0.095019 0.090770 0.459002 0.144957 0.298409 0.205958 1.000000 0.305319 0.127364
trust 0.010273 0.062710 0.043102 0.135115 0.396646 0.072083 0.087923 0.543210 0.096740 0.617434 0.139632 0.305319 1.000000 0.404179
polarity -0.046505 -0.031240 -0.057484 -0.427261 0.222005 -0.399880 -0.385468 0.436994 -0.647837 0.722796 -0.453898 0.127364 0.404179 1.000000
In [18]:
sns.heatmap(df.corr(), square=True)
plt.show()

Although not all the squares are worth examing, we are glad to see that some of the data in this section are consistent with the data from the earlier sections. For example, there is a stronger positive correlation between the negative sentiment and favorite and retweet counts than the positive sentiment.

Summary

After this exercise, our assumptions are shattered. we are surprised at how many more positive tweets that Trump sends than negative tweets, as well as finding out that generally, his tweets actually contains many words that inflict trusts. We were very certain that his tweets are going to lean toward the more negative emotions.

Our group is definitely limited by the lexicon that we are using to analyze Trump's tweets. Perhaps we need to think about using multiple lexicons to see if we are obtaining similar results across the board. We are also worried about our method of calculating the sentiment. Right now, we are simply taking the difference between the positive and negative word counts, but should neutral be when both counts are 0 or is it valid to claim that the tweet is also neutral if the two counts cancel each other out?