Common knowledge is shared, but people think differently to infer things.

Introduction

Wikipedia is viewed as storage for an incredible amount of information. However, in reality, it is also a largely connected graph of articles with semantic and contextual similarities.

An analogy in real life would be Cities and Towns representing articles and blue links (hyperlinks within the articles) in Wikipedia are the roads that connect the cities. There might be cities that are interconnected with highways so that more people can get around the cities in a more direct and faster fashion.

With Wikipedia alone, we cannot view the traffic on these roads, since we have no knowledge of how people hop from one idea to another. That’s where the Wikispeedia dataset comes in.

Wikispeedia is an online game where players have to reach the target wiki article from an unrelated start wiki article by clicking links in the articles, where the human-clicked links can be considered as the traffic between cities. The path taken is recorded during the gameplay, which is the main subject of our analysis!

Let’s view people’s connections in the Wikispeedia game as trips between cities and towns, and start the discovery journey!

How are the knowledge cities connected together?

There are some cities that have a lot of roads connected to them, this is the idea of the hub on the graph. Hubs are great for identifying a commonly known idea, but it doesn’t show how ideas are connected and used together. So instead, we decided to look at how we get from one idea to the other, which will give us a better picture of how people think.

We come up with the idea of using common sub-paths of length greater than 1, which represents how logical ideas are linked together. We observed that we have many recurring common paths in the Wikispeedia games, which we call highways - as they are the most commonly used paths between many articles and are largely reused in games. We would like to compare it against real-world Wikipedia data.

Using Wikispeedia dataset

We mainly focused on paths_finished.tsv from the Wikispeedia dataset. It contains hashedIpAddress, timestamp, durationInSec, path and rating, we only use path here. The path consists of the starting article and every player-clicked article. Some paths have a back sign “<”, we decided to remove the sign and the related misselected articles to avoid any misleading. There is also an unfinished path file, we chose not to use it because the number of finished paths (51318) is twice as large as unfinished paths (24875), and finished paths are more valuable for our research.

Highways

As with any road trip, sometimes taking a highway is the fastest way to get to the destination. The common subpaths in Wikispeedia games are those highways that connect multiple clusters of cities and allow game players to travel fast.

We began our analysis by looking into the highways! To do so, we define highways in our dataset as roads reused at least 6 times from all the recorded data. As seen in the heavy tail distribution, we have thousands of “roads” used 2-5 times however, with roads used more than 6 times the variety of these “roads” is in the hundreds and making them significantly rare.

Path counts for subpaths

Once we have the list of all possible subpaths, we need to set a threshold, how many times does a path need to be used for it to count as a road/highway. We plotted the path usage counts against number of times such usage occurs.