The next hop in our random walk is selected uniformly randomly from
the set of neighbors. In  use a technique for performing search on
networks that selects as next hop the neighbor with the highest degree.
To test this approach for our problem, we modified the random walk to
select the next hop proportionally to the degree of the node.
This approach also follows the intuition that, normally, when looking for
information on Citeseer users will tend to select links to more cited papers
when traversing the citation graph. The results, however, are
semantically meaningless. The clusters are highly fragmented and
noncontiguous. The main reason for this was that the random walk tended
to visit highly connected nodes as expected, but these nodes caused the
random walk to lose focus and jump out of the vicinity of the original
seed node. This was caused by the scale-free nature of the graph,
i.e. there exist a few highly interconnected nodes, and the probability
that any node in the graph is linked to one of those hubs is relatively
high. This, combined with the bias towards high-degree nodes caused the
random walk to get stuck hopping between a small subset of the highly connected
nodes. Since the probabilities of reaching the hub
nodes from seed nodes are similar, the clusters compete over
the hub nodes, and this leads to fragmentation.
Another approach is to select the next hop inversely proportionally
to the degree of the node. This, on the other hand, biases the walk
towards low-degree nodes and the walk frequently gets stuck
hopping back and forth between pairs of poorly connected nodes.
The random walk we are using is memoryless. Intuitively, an intelligent
surfer hopping between the publications in the citation graph should
remember the pages previously visited and avoid visiting them again. We
implemented that in our algorithm but did not observe any major
difference in produced, Note that in order to implement this algorithm
it is necessary to run the actual Monte Carlo simulation of the random
walk. It is no longer possible to express the walk as a system of linear
equations. We keep the same uniform next hop selection as before.
The results do not differ much from the memoryless variant. The reason
for that is that in highly connected parts of the graph it is relatively
rare to visit the same node again. On the other hand the probability of
doing so is higher when visiting low degree nodes, but these nodes themselves are
rarely visited. Therefore the event of visiting another node
again is rare overall.