Group 1 (presented by Grégory T.)

They checked for coherence of Citeseer databases: Web-based and downloaded OAI:

- Web-based: pretty accurate in representing reality, both ref links and isRefBy links.
- OAI: isRefBy links are (almost) always wrong while ref links are pretty right

Moreover, they noticed that most of papers have no references at all. They provide plots as shown below.

Group 2 (presented by Christophe)

- They corroborate the findings of Group 1.
- They note that the graph is very sparse and that XML files contain only links to papers contained in Citeseer (dark blue links in Citations paragraph in HTML version)

Group 3 (presented by Ali)

They focus on the partitioning algorithm problem.

Tom's summary

- Contexts are useful for clustering
- But we are going to use a graph using only references.
- Marc proposes to contact Citeseer to make them aware of this issue

Group 1

They focus only on the graph coherence issue.

Group 2 (presented by Alex)

- There is one big strongly connected components (SCC) in the graph with references only.

Vertices | Components |

1 | 338 049 |

2 | 3755 |

10 | 6 |

8880 | 1 |

Weak Connected Components (WCC)

Group 3 (presented by ...)

They tried to explore data structures available for the algorithm. They tried linked lists with a random algorithm, but they got very poor performance: a single node merging taking more than 1 second. Tom suggests hash tables instead.

Group 2 (presented by Eda)

Eda contacted several guys who worked on this problem:

- David Karger (MIT): has an implementation, but he makes use of a library [1]
- Other guys who have no implementation of their algorithm [2,3,4]

- Metis library
- Ledas Library
- CPLEX (linear programming)

Directed or undirected graph doesn't matter. It could lead to some differences only in a case of two papers citing each other. We should not focus on this problem since a precise algorithm on a modified graph may achieve better results than an heuristic on the original graph.

Finally, he announces that this project is due for April 28 since we are a bit late.

References

[1] Rounding Algorithms for a Geometric Embedding of Minimum multiway cut. Kager and al. 1999 (1.3438 performance ratio)

[2] An Improved Approximation Algorithm for multiway cut. Calinescu and al. 2000 (1.5 -1/k performance ratio)

[3] A 2-Approximation Alg. for the directed Multiway Cut Problem. Naor and al. (approximation factor of 2)

[4] Multiway Cuts in Directed Graphs and Node Weighted Graphs. Garg and al. (2 log(k) performance ratio)