Graph coherence issue

Group 1 (presented by Grégory T.)

They checked for coherence of Citeseer databases: Web-based and downloaded OAI:
  1. Web-based: pretty accurate in representing reality, both ref links and isRefBy links.
  2. OAI: isRefBy links are (almost) always wrong while ref links are pretty right
They propose an explanation verified by numerous examples in several fields (about 30 papers among three fields of research): context and papers id seem to be messed up. Indeed, if a document A (id:3311) references a paper B in the HTML version, the paper B is actually referenced by A since this version is coherent. On the other hand, in the OAI version, the paper B will be referenced by the context with id 3311, that is part of another field.

Moreover, they noticed that most of papers have no references at all. They provide plots as shown below.


Group 2 (presented by Christophe)

Group 3 (presented by Ali)

They focus on the partitioning algorithm problem.

Tom's summary
Graph partitioning issue

Group 1

They focus only on the graph coherence issue.

Group 2 (presented by Alex)
SCC Table:
Vertices Components
1 338 049
2 3755
10 6
8880 1

Weak Connected Components (WCC) 

Group 3 (presented by ...)

They tried to explore data structures available for the algorithm. They tried linked lists with a random algorithm, but they got very poor performance: a single node merging taking more than 1 second. Tom suggests hash tables instead.

Group 2 (presented by Eda)

Eda contacted several guys who worked on this problem:
  1. David Karger (MIT): has an implementation, but he makes use of a library [1]
  2. Other guys who have no implementation of their algorithm [2,3,4]
She finds also new publications with interesting results and some useful libraries for this problem:
  1. Metis library
  2. Ledas Library
  3. CPLEX (linear programming)
Tom's conclusion

Directed or undirected graph doesn't matter. It could lead to some differences only in a case of two papers citing each other. We should not focus on this problem since a precise algorithm on a modified graph may achieve better results than an heuristic on the original graph.

Finally, he announces that this project is due for April 28 since we are a bit late.


[1] Rounding Algorithms for a Geometric Embedding of Minimum multiway cut. Kager and al. 1999 (1.3438 performance ratio)
[2] An Improved Approximation Algorithm for multiway cut. Calinescu and al. 2000 (1.5 -1/k performance ratio)
[3] A 2-Approximation Alg. for the directed Multiway Cut Problem. Naor and al. (approximation factor of 2)
[4] Multiway Cuts in Directed Graphs and Node Weighted Graphs. Garg and al. (2 log(k) performance ratio)