Wikispeedia is a the human-computation game in which users are asked to navigate from a given source to a given target article, by only clicking Wikipedia links, which implies all the connection between articles is created from human intuition. This dataset contains human navigation paths on Wikipedia, collected through Wikispeedia. This dataset contains data from 2011 to 2014, but we have discarded the 2014 data because it is significantly smaller than the other three years.
As wikispeedia dataset can be considered as a article "map", including tons of criss-cross article navigation paths, we try to find out 100 of the most popular words that were identified as Wikispeedia "transport hubs", based on their click-through.
Wikispeedia is a path game, which makes it possible to study the access distance from transport hubs to other articles. We conducted a causal analysis of the average shortest distance from the starting article to the destination. Noting that the coefficient of the fitted linear regression model possess a small p-value, it is reasonable to believe that the transport hub makes the game less difficult and they have a higher throughput rate.
Following human intuition, we believe that the larger the degree in the network graph model, the closer the node should be to the centroids. So whether transport hubs of Wikispeedia are also the semantic hubs?
Average cosine distance | |
---|---|
Between hubs | 0.8658 |
Among all articles | 0.8659 |
We note that the nation's term is a very typical kind of 'transport hubs'. To place them in the real world, we analyze this in relation to national GDP data. To better understand the relationship between transport hubs and real-world situations the national GDP data set provided by World Bank is imported. The dataset contains the GDP of 266 counties and we picked up the transport hubs in the category of the country by using the dataset categories.tsv.
Now, the question is do the Hotness of transport hubs and the GDP of each country have relations. After plotting the figures, we found that in general, the hotness of transport hubs is in line with The GDP of each country. But why? As we know the GDP is a comprehensive factor that can reflect a country’s strength. A country with higher GDP will have a higher influence in many aspects and then will link to more events.
Therefore Wikipedia pages can link to more topics. This can explain why The GDP is in line with transport hubs.
Since we already found that the national GDP is in line with the transport hubs. Now the question is do google
trends also have a positive relationship with transport hubs? Using the API Pytrend, we found the google
trends of the transport hubs in the years 2011, 2012, and 2013, surprisingly the Google trends ranks do not have a
positive relationship with the transport hubs ranks. As is shown in the following figures.
We can conclude that the
transport hubs have no relationship with Google trends. You may ask why? Is google trends less accurate than the GDP data?
The answer is since the Wikipedia game is more task-oriented, people who are playing this game are trying to find the hubs
that can bring them to their destination rather than the hubs that they are more familiar with.
The Google trends only
reflect what is popular on the internet or what people are familiar with. This explains why the hubs are more related to GDP
rather than Google trends.
To find whether there exists a “path” from wiki to the real world, first we raise a definition of “transport hubs”. These hubs are the 100 words that appear most frequently in peoples’ choice paths when they are playing the game wikispeedia. We first find that there is a strong relationship between people’s success rate in completing the game and the occurrence of the transport hubs. Then we conducted a causal analysis of the average shortest distance from the starting article to the destination using linear regression. We match a random collection of articles to be compared with the transport hub articles, and conclude that the hotwords can accelerate the speed for people to complete the game.
For details we want to know which part of the transport hubs can have such effect. First the way we defined the transport hubs gives us a hint. As a result, we assumed that the transport hubs were likely to have a relationship of the popularity of these words. For proving our assumption, we collected the Google trends of these words according to the time when people played the game.(mainly in 2011, 2012 and 2013, so we collected trends in these 3 years.) However, after conducting the experiments, we concluded that there is hardly any relationship between the hubs and their popularity.
When we were about to give up, we had another view on the hubs. Was there a possibility that the hubs were chosen by peoples’ commonsense and their familiarity about the world in their lifetime? After all the wikispeedia was task-oriented and people tended to use the most familiar words. We wanted to explore this view. Soon we found that most hubs were names of countries and regions. Generally the GDP of the country can represent their power and influence. It was reasonable to believe the ranking in countries as hubs also was in positive correlation with the GPD ranking. We then collected the GDP data and proved its correctness.
In conclusion, the transport hubs are chosen based on people’s commonsense and their familiarity about the world in their lifetime instead of the popularity of the words, which is actually against the our own commonsense as well.
Data Science
Electrical and Electrionics Engineering
Energy Science
Energy Science