“I’ve just received a call from Secretary Clinton,” said Donald Trump late in the night of Nov 8, 2016. The result of the 58th Presidential election, an intense battle between Donald Trump and Hillary Clinton, remains a major upset around the world and came as an unpredictable surprise to many. But was it? What can be seen and understood from this election, if we only look at all quotes by them found on the internet during this historic election year? This could give a glimpse into the running of a major democratic election system, and the importance that speech and its representations have in making citizens turn for one party or another. Media biases have deep implications in our lives, and have evolved to be more extreme in this noisy world. Considering this, and using only quotes by these two candidates, we will try to understand how they ran their political campaigns.

Although not impossible to predict, very few Americans would have believed that Donald Trump would sit in the White House just a few years ago.


The Dataset

This data story is based on Quotebank, a corpus of close to 200 million quotes extracted from the Internet using state of the art technology. The dataset ranges from August 2008 to April 2020 with quotes extracted from more than 377 000 web domains.

This investigation only uses quotes of the year of 2016, where the predicted speaker is either “Hillary Clinton” or “Donald Trump”. It is important to note that Quotebank ran into technical issues during the scraping of 2016 quotations which resulted in 3 periods with very few extracted quotes. For this analysis, any quote found inside of these periods was removed. After this and a few other (more trivial) filtrations, 140 000 quotes remain to be used for the analysis. The dataset contains a disproportionately high share of Trump quotations. The distribution of quotes is presented in the two graphs below.


The reader is encouraged to keep the distribution of quotes in mind, such as each speaker’s share, as the report will refer to this and its implications throughout the data story. It is essential to highlight a few caveats:

Quotations could have been falsely assigned to the candidates during the Quotebank extraction process. Quotations might not have been said by the speaker for other reasons. (Fake News, human error, etc.) The content of a quote does not necessarily allow us to draw a conclusion about a person’s belief and speech.

As a result of these caveats we encourage everyone to take this report for what it is:

An interesting (and fun) investigation into what conclusions could be drawn based on what the Internet quoted Hillary Clinton and Donald Trump with during the 2016 election year.

Media bias

What is the bias?

Before diving into any analysis on Trump and Clinton’s speech, it is important to assess the bias in our dataset, which in our case is a list of quotes coming from various media outlets, potentially plagued with bias over two highly polarizing figures. Since our speakers are politicians, we should know if our quotes come from mostly left- or right-leaning outlets. The AllSides news outlet presents news articles in context — by keeping note of the general political bias of news outlets. We will use their data in this part, which we obtained on Kaggle.

Our quotes come from nearly 5 200 different websites, while the AllSides database contains about 400 media outlets, each with an assigned bias (“Left”, “Left-center”, “Right”…) and information on how many people agree with this rating. You can go rate the outlets yourself on their website. Though their ratings are informed by other things than public ratings, this should still raise your eyebrow: it is likely that the only people giving ratings come from a certain political side, which would bias our bias ratings, ironically.

Bias Median confidence*
Left 2.24
Left-center 0.97
Center 0.90
Right-center 1.08
Right 1.97

*The confidence is #Agree / #Disagree

About 39% of the outlets are classified as “Center”, 38% as “Left-center” or “Left”, and 23% as “Right-center” or “Right”. Also, perhaps expectedly, the ratings are less equivocal when the outlet is more extreme.

Can we detect media bias?

Now that we collated the bias data with our quotes, the intriguing question is how the sentiment of quotes relates to the bias of the media outlet. Perhaps a left-wing outlet would quote more negative quotes from Trump, and more positive quotes from Hillary?