Group Reports

Group GGS

Presented by Alex

They tried to use linear programming to solve this problem as described in [1, Section 2.1]. In their formulation, they had the following variables:

Given a graph G = (V,E), let C E be a cut of this graph (i.e., the edges that need to be removed in order to separate the seed nodes from each other) The linear problem to solve has the following objective function: In their graph, the number of boolean variables of this problem is roughly 1,540,000 (considering the graph without the contexts and with only the “References” edges). For the constraints, there is one constraint per path in the graph between that anchors. The constraints are the following: The problem is that there is a very high number of such constraints. They found out that there were more than 348,410 directed paths between the two first anchors.

The problem is still too large to be solved by usual LP algorithm. Indeed, state-of-the-art implementations can handle up to 680,000 variables (according to this benchmark). Moreover, that linear program is actually a zero-one linear program which is hard to solve (but approximations exist).

Tom suspects that there exists a smaller LP that can be solved

Reference

 M.-C. Costa, L. Létocart and F. Roupin: Minimal Multicut and Maximal Integer Multiflow: a Survey, European Journal of Operational Research, 2005

Group MST

Presented by Marc

First, they tried to make the graph smaller by removing parts that are not relevant when searching for a minimal partition. They simplify the graph by applying the following rules (considering the graph as undirected):

• Nodes of degree 0 (isolated nodes) can be ignored.
• Nodes of degree 1: the only adjacent edge to these nodes can be ignored since they will not be in the cut (if they were, we would have a smaller cut - namely the one with these edge not in the cut - that would still be a satisfying partition, assuming that no anchors have a degree of 1).
• Nodes of degree 2: At most one of the two edges will be removed in a minimal cut (assuming the node is not an anchor). Therefore, we can replace the two adjacent edges by an edge between the two other nodes (if there is one, we should increase the weight of it by one). They tried to extend this approach to nodes with higher degree but there exist no exact solution in those cases. Nevertheless they suggest that a heuristics can be build on this idea of simplication extended to nodes of higher degree.

Using the aforementioned simplifications, they succeeded in reducing the number of nodes from 1,150,000 down to ~ 600,000, and finally 595,000 nodes when considering only the weakly connected component that contains all anchors.

Second, they tried to look at some more information about the (modified) graph. They find out that the distance between anchors was really small (between 2 and 4), a point worth noticing for some heuristics that use the distance between nodes. They briefly mentioned that local optimization might be used after having applied heuristics to improve the solution.

Third, they looked at the cost of a trivial cut to get a baseline against which results can be compared. The trivial cut is the one in which every anchor is in a component with only one node except for the one with highest degree whose component contains all the other nodes. They found that the cost of that cut was 3948.

Tom suggested that it could be interesting to have a minimal number of nodes per components (e.g. 1000) and that for that purpose the normalized cut might be used. He mentioned that local optimizations were heavily used in CAD (for instance, Lin-Kernighan)

Group BGS

Presented by Khaled

They continued to work on the algorithm they mentioned last Tuesday (the one that performs randomized contractions). They still have not been able to reach a result because of the lack of a sufficiently efficient data structure for representing the graph. They tried to use binary search trees but the running time is still too high. Tom suggests that the simplifications presented by Marc might help. They want to try to use hash tables.

They think that the time required for one contraction increases in a first phase (because the degree of nodes increases) but then will start to decrease (because there will be fewer nodes).

Finally, they mentioned that a relatively small running time is required because the algorithm has to be run several times to get better results.

Further Discussions

Normalized Cut

Dirk mentioned that the normalized cut (as he presented last Tuesday) will not imply components of similar sizes (which is not what we are looking for according to Tom). A cut in which one of the components is much smaller that the others will be penalized only if there are not much less edges between the two than between components of similar sizes.

Consequently, Tom proposes that we measure the quality of the solution using both the unnormalized and the normalized cut formally defined as follows.

Considering an (undirected) graph G = (V,E), define a K-terminal cut (K = 16) to be a partition {V 1,...,V K} of V such that each V i, i {1,...,K} contains one anchor.

Unnormalized cut: Normalized cut: Considering the aforementioned trivial partition, the value to beat for the unnormalized cut is 3948 and for the normalized cut, .

CCVisu

Dirk presented the application of the energy-based layout algorithm he presented last Tuesday to some graphs. First, he demonstrated that the algorithm works as expected on simple graphs (like a 7-node satellite configuration) and on a random graph in which there are several well separated clusters (the probability of having an edge between nodes in the same cluster is much higher than the probability of having an edge between nodes in differentclusters). Second, he tried to run it with different combinations of gravity or repulsion forces on the first 5,000 edges of the citation graph. The interesting point is that the result contains clusters of nodes in which papers are from the same area.

It has been suggested too that studying the graph over time, considering the graph years after years and iteratively building a solution over time could be interesting.

Power-log Distribution

Tom recalled that the GGS group mentionned that, according to , the citation graph of the computer literature follows a power-log distribution. A power-log distribution is such that the fraction of nodes with degree k  where e is a constant.  claims that for the citation graph they consider, e 1.7. A natural question would be to know whether this holds for our graph, and in particular, which value of e we would have. If it does, can we take advantage of it by using heuristics that work particularly well on power-log distributed graphs?

Reference

 Y. An, J. Janssen, E.E. Milios: Characterizing and Mining the Citation Graph of the Computer Science Literature, Technical Report CS-2001-02, Faculty of Computer Science, Dalhousie University, 2001

Analysis of Randomized Algorithms

Tom presented the analysis of the randomized algorithm for the 2-terminal problem. The algorithm can be written in the following way:

Consider an undirected weighted graph G = (V,E,weight) where weight :  E  . For a non-weighted graph, initially, u,v E, weight(u,v) = 1.

Contract (V, E, weight)
while |V | > 2 do
pick at random an edge (u,v) E with probability proportional to weight(u,v)
modify the graph as follows:
V := V \{v}
E := E \{(v,w) : w V } {(u,w) : (v,w) E}
weight(u,w) := weight(u,w) + weight(v,w)
od

To analyze this algorithm, we must answer the following question: what is the probability that a contraction step picks an edge (u,v) across the optimal cut? Denote this probability by p.

Suppose the optimal cut has capacity k.

Clearly p = We want to find a bound for the sum of all edge weights. Denote the sum of all edges by S. We know: because for a fixed w (otherwise we could find a better cut: w and V \{w})

Therefore, S > .

Consequently, p < = .

Probability that a chosen edge is “good” > 1 - = .

Probability that each iteration chooses a good edge =   ... = > .

This last figure is the probability that one run of the algorithm finds an optimal cut. Running the algorithm times, the probability of finding an optimal cut = (1 - )   (for large values of n).

To sum up, if this algorithm is run O(n2) times, the probability of finding the optimal cut is constant.

Linear Programming Formulations

The formalization of the max-flow and min-cut problems have been presented. The duality of the problems is reflected by the duality of their corresponding linear programs. Details on this point can be found in a document available from the web page of the course.