WIKISPEEDIA! A PATH FROM WIKI TO REAL WORLD?
Just guess yes or no and scroll down to find the answer

Introduction

About Wikispeedia

What is Wikispeedia?

Wikispeedia is a the human-computation game in which users are asked to navigate from a given source to a given target article, by only clicking Wikipedia links, which implies all the connection between articles is created from human intuition. This dataset contains human navigation paths on Wikipedia, collected through Wikispeedia. This dataset contains data from 2011 to 2014, but we have discarded the 2014 data because it is significantly smaller than the other three years.

Total number of paths by year
Finished and unfinished paths

First Glimpse of Transport Hubs

As wikispeedia dataset can be considered as a article "map", including tons of criss-cross article navigation paths, we try to find out 100 of the most popular words that were identified as Wikispeedia "transport hubs", based on their click-through.

Transport Hubs of 2011
Transport Hubs of 2012
Transport Hubs of 2013

Do hot articles accelerate the reach of the target?

Wikispeedia is a path game, which makes it possible to study the access distance from transport hubs to other articles. We conducted a causal analysis of the average shortest distance from the starting article to the destination. Noting that the coefficient of the fitted linear regression model possess a small p-value, it is reasonable to believe that the transport hub makes the game less difficult and they have a higher throughput rate.

 



Inspiration
Players’ behaviour leads to the creation of these transport hubs, and our findings suggest that people's subconscious tendency to choose does bring them closer to their destination.

What makes people choose these transport hubs? What do these transport hubs have in common? Do they all contain richer semantic information? Or what happened in real life to cause them to be more easily recalled?

Can we place Wiki transport hubs on Semantic level?

Following human intuition, we believe that the larger the degree in the network graph model, the closer the node should be to the centroids. So whether transport hubs of Wikispeedia are also the semantic hubs?

How to measure the semantic information content of an article?
We use cosine distances of their doc embedding.
  1. We firstly obtain the embedded vectors of articles, using Doc2Vec model. Each article is represented as a vector of length 300. The Doc2Vec model, in contrast to the traditional word embedding model, allows the creation of word embedding set while directly obtaining a vector of the entire document. Since it is a direct segment vector, it takes into account the order between words and has better semantic information.

  2. We clustered those articles, using K-means, based on cosine distance. We determined the value of K to be 16 by analysing the category structure of the articles, and the clustering results were tested by these 16 true categories at the end of the clustering.
Clustering results
1. Average cosine distance to centroids

Average cosine distance
Between hubs 0.8658
Among all articles 0.8659
2. Reduce the dimensionality by PCA for visualization

Conclusion
Wikispeedia transport hubs do not cluster at the centroids of the clusters. Their mean distance to centroids is almost equal to the mean distance to centroids from random nodes. So no, the rich semantic information does not make them more selective.

Does the country's type of transport hubs have connections with GDP?

We note that the nation's term is a very typical kind of 'transport hubs'. To place them in the real world, we analyze this in relation to national GDP data. To better understand the relationship between transport hubs and real-world situations the national GDP data set provided by World Bank is imported. The dataset contains the GDP of 266 counties and we picked up the transport hubs in the category of the country by using the dataset categories.tsv.

Now, the question is do the Hotness of transport hubs and the GDP of each country have relations. After plotting the figures, we found that in general, the hotness of transport hubs is in line with The GDP of each country. But why? As we know the GDP is a comprehensive factor that can reflect a country’s strength. A country with higher GDP will have a higher influence in many aspects and then will link to more events.

Therefore Wikipedia pages can link to more topics. This can explain why The GDP is in line with transport hubs.

Hotness and GDP among countries, 2011

 

Hotness and GDP among countries, 2012

 

Hotness and GDP among countries, 2013

Does Google Trends have a positive relationship with transport hubs?

Since we already found that the national GDP is in line with the transport hubs. Now the question is do google trends also have a positive relationship with transport hubs? Using the API Pytrend, we found the google trends of the transport hubs in the years 2011, 2012, and 2013, surprisingly the Google trends ranks do not have a positive relationship with the transport hubs ranks. As is shown in the following figures.

We can conclude that the transport hubs have no relationship with Google trends. You may ask why? Is google trends less accurate than the GDP data? The answer is since the Wikipedia game is more task-oriented, people who are playing this game are trying to find the hubs that can bring them to their destination rather than the hubs that they are more familiar with.

The Google trends only reflect what is popular on the internet or what people are familiar with. This explains why the hubs are more related to GDP rather than Google trends.

Rank of transport hubs in Google Trends and Wikispeedia, 2011

 

Rank of transport hubs in Google Trends and Wikispeedia, 2012

 

Rank of transport hubs in Google Trends and Wikispeedia, 2013

Conclusion

Is it a path from wiki to real world exist?


Accelerate? Yes!

To find whether there exists a “path” from wiki to the real world, first we raise a definition of “transport hubs”. These hubs are the 100 words that appear most frequently in peoples’ choice paths when they are playing the game wikispeedia. We first find that there is a strong relationship between people’s success rate in completing the game and the occurrence of the transport hubs. Then we conducted a causal analysis of the average shortest distance from the starting article to the destination using linear regression. We match a random collection of articles to be compared with the transport hub articles, and conclude that the hotwords can accelerate the speed for people to complete the game.

Where does the magic of transport hubs come from?

For details we want to know which part of the transport hubs can have such effect. First the way we defined the transport hubs gives us a hint. As a result, we assumed that the transport hubs were likely to have a relationship of the popularity of these words. For proving our assumption, we collected the Google trends of these words according to the time when people played the game.(mainly in 2011, 2012 and 2013, so we collected trends in these 3 years.) However, after conducting the experiments, we concluded that there is hardly any relationship between the hubs and their popularity.

Not popularity, but common sense

When we were about to give up, we had another view on the hubs. Was there a possibility that the hubs were chosen by peoples’ commonsense and their familiarity about the world in their lifetime? After all the wikispeedia was task-oriented and people tended to use the most familiar words. We wanted to explore this view. Soon we found that most hubs were names of countries and regions. Generally the GDP of the country can represent their power and influence. It was reasonable to believe the ranking in countries as hubs also was in positive correlation with the GPD ranking. We then collected the GDP data and proved its correctness.

In conclusion, the transport hubs are chosen based on people’s commonsense and their familiarity about the world in their lifetime instead of the popularity of the words, which is actually against the our own commonsense as well.








Acknowledgements


  1. West, Robert & Pineau, Joelle & Precup, Doina. (2009). Wikispeedia: An Online Game for Inferring Semantic Distances between Concepts.. IJCAI International Joint Conference on Artificial Intelligence. 1598-1603.
  2. Le, Quoc & Mikolov, Tomas. (2014). Distributed Representations of Sentences and Documents. 31st International Conference on Machine Learning, ICML 2014. 4.
  3. World Bank(https://data.worldbank.org/)
  4. Google Trends (https://www.google.com/trends).

The team

Zhan Li

Data Science

Yuheng Lu

Electrical and Electrionics Engineering

Yichen Liu

Energy Science

Xiyun Fu

Energy Science