You will click this link.
It is on the top.

A data story through Wikispeedia and beyond.


Wikipedia is a beast. But it is also your best friend. Haven’t we all gone down the Wikipedia rabbit hole where you start the day researching on Napoleon and end up on the page of his mother's house? In this case, the link structure of Wikipedia can be quite helpful. Napoleon's page has a link to the page of his mother's life, where you can find that she lived at Palazzo Bonaparte in Roma until she died at 85 years of age.

This is the core idea which should determine the link structure of Wikipedia: On each page, there should be links to related pages.



This intuition is what the game Wikispeedia builds on. The game challenges the user to go from page A to page B only through using the links of the respective pages. Many interesting strategies have been developed from players, including getting to the page of the United States as fast as possible, since it is the most well-connected page.

Try it yourself!

Do you find the links of 2007 Wikipedia intuitive? What strategies do you find effective?


Our Dataset

Our dataset is a combination of two datasets - Wikispeedia data from Stanford University and Wikipedia data from the website itself.

Collaborative authorship is the key that allowed Wikipedia to grow into the beast that it is. Wikipedia has around 44.8 million registered users who can contribute. In November 2022, more than 120 000 individuals submitted updates to Wikipedia pages. Of all existing pages of the Stanford dataset, we scraped the network information of Wikipedia each year from 2008 to 2022.

On top of the network, Wikispeedia also gave us the paths that players took while playing the game. Stanford University collected data on ~76000 of these games, recording users' navigation through ~4600 Wikipedia pages in 2007.

What does our data look like?

Let's take a first look at the categories of the Wikipedia pages.
Science is the most common category, followed by Geography.

Great, but what about the pages themselves?

Next, we embedded the network generated by Wikispeedia into a graph, and then projected it down to three dimensions. The closer the pages appear to each other, the closer they are in semantic space. In general, pages of the same category appear close to each other. In particular, if we filter by Science, we can see that there are dense clusters in Physics and Chemistry. Additionally, notice that under Science, the Bird cluster is much closer to the Dinosaur cluster. In contrast, pages in Geography is more spread out. Wikispeedia players apparently find it easier to relate from places to other topics.

This relationship of semantic similarity is also present across pages.

Try selecting Music and IT only from the category selection on the right side of the Wikipedia graph. Notice that the IT page closest to the largest Music cluster is Napster, a music "sharing" website popular in 2007.

What are the differences between the Wikispeedia paths and the actual Wikipedia pages?

Now select Music only in the Wikispeedia paths in all 3 graphs. In Wikipedia, the main Music cluster forms a crescent shape. Classical music concepts like The Rite of Spring and Venus and Adonis on one end, and modern music like Iron Maiden and Queen on the other. In contrast, the classical music and modern music clusters are more distinctly separate from each other in the Wikispeedia paths embeddings. This means that Wikispeedia players find the relationships within these clusters much more intuitive than between the clusters.

Try playing with the graph and drawing connections between different pages!

But what does this mean?

However, the in-degree and out-degree are not good predictors of how "important" a page is. A set of obscure pages that is well-connected amongst themselves should not increase their importance.

This is where Pagerank comes in. Pagerank is a metric that is used to measure the centrality of a page in the network. Since our research question involves the development of pages and their incoming and outgoing links, Pagerank is a suitable proxy for how important a page is, and tracking how these Pageranks evolve over time can give us insight into how the network is changing.

The distribution of Pageranks becomes more evenly distributed over time. This means that the pages with very low Pagerank are being developed and are becoming more well-linked.

Let us see how the Pagerank changes across categories.

The Pagerank distributions for each category are evolving over the years. In addition, the shaded distribution represents the Pagerank distribution of the Wikispeedia graph. The most prominent difference in distributions can be seen in the Countries category. Namely, we see that the Pagerank distribution is quite concentrated towards high Pagerank values. In other words, Countries pages are considered to be very central. However, Wikispeedia users seem to disagree on this. Instead, according to them, the Countries category is less important than the actual Wikipedia graph would suggest.

Wikipedia: An Evolving Network

Now let's take a look at which pages changed the most in terms of links. Each wordcloud contains the top 20 pages with the greatest percentile increase in links. The size of the word is proportional to the increase.