As our data set we use the file provided by Citeseer ( It is an XML data file containing the meta information about the papers and the structure of the citation network. In order to use the dataset, we need to parse the resource files. We subsequently explain the different parsing approaches we attmepted. The file provided by Citeseer is an XML file with $41,000,000$ lines (including the name, references, abstract, etc...).

One of the most commonly used tools for parsing is the Unix grep command. We tried to extract the patterns from the file and using grep automating the process of extraction. After the initial test run, the estimated running time for completing was approximately 11 hours (1000 lines per second). There are several variants of grep available such as fgrep which doesn't support the AND operator between patterns, and agrep which provides approximate results. Another approach is to parse the file directly with Java. After a sample run, the speed for reading the file without parsing was $350,000$ lines per second, while the rate of parsing $75,000$ lines per second. We used an implementation of Boyer-Moore algorithm, which gave us the rate of $200,000$ lines per second.

In order to store the parsed information for future use, several methods can be considered. First, using a direct access to a database and storing the data there. Since the operation would be network I/O bound (SQL requests etc.) we decided against pursuing this method. Another approach is using the object serialization capability of Java. As the initial runs have shown, deserializing the object takes a considerable amount of time and memory since the Virtual Machine will need to create a large number of objects on the heap. The final solution is to use an embedded database that is optimized for key based memory access - the Berkeley DB Java Ed.. The time for loading the database into memory is around 5 seconds and the required storage around 100MB.

Ali Salehi 2005-04-29