Other variants

The next hop in our random walk is selected uniformly randomly from the set of neighbors. In [] use a technique for performing search on networks that selects as next hop the neighbor with the highest degree. To test this approach for our problem, we modified the random walk to select the next hop proportionally to the degree of the node. This approach also follows the intuition that, normally, when looking for information on Citeseer users will tend to select links to more cited papers when traversing the citation graph. The results, however, are semantically meaningless. The clusters are highly fragmented and noncontiguous. The main reason for this was that the random walk tended to visit highly connected nodes as expected, but these nodes caused the random walk to lose focus and jump out of the vicinity of the original seed node. This was caused by the scale-free nature of the graph, i.e. there exist a few highly interconnected nodes, and the probability that any node in the graph is linked to one of those hubs is relatively high. This, combined with the bias towards high-degree nodes caused the random walk to get stuck hopping between a small subset of the highly connected nodes. Since the probabilities of reaching the hub nodes from seed nodes are similar, the clusters compete over the hub nodes, and this leads to fragmentation.

Another approach is to select the next hop inversely proportionally to the degree of the node. This, on the other hand, biases the walk towards low-degree nodes and the walk frequently gets stuck hopping back and forth between pairs of poorly connected nodes.

The random walk we are using is memoryless. Intuitively, an intelligent surfer hopping between the publications in the citation graph should remember the pages previously visited and avoid visiting them again. We implemented that in our algorithm but did not observe any major difference in produced, Note that in order to implement this algorithm it is necessary to run the actual Monte Carlo simulation of the random walk. It is no longer possible to express the walk as a system of linear equations. We keep the same uniform next hop selection as before. The results do not differ much from the memoryless variant. The reason for that is that in highly connected parts of the graph it is relatively rare to visit the same node again. On the other hand the probability of doing so is higher when visiting low degree nodes, but these nodes themselves are rarely visited. Therefore the event of visiting another node again is rare overall.

Ali Salehi 2005-04-29