knowledge transfer


Literature Categorization Through
Conceptual Associative Spatial Graphs

Submitted by J. van den Berg, 28.02.2002, IBA-E


Scientific and technical literature is often fragmented, which implies that answers to queries often require combining information from various sources. Professionals regularly perform literature searches to determine articles relevant to a question and related to the topic searched. Categorization of available literature is a means to guide the search and increase its efficiency. Currently, literature is typically categorized by assigning the articles to one of the standard categories. However, categorization based on actual associations amongst published material is much more effective and can help professionals to better focus on the relevant material and topics. Intelligent computer assistance for automatically building databases of such associations is thus a valuable service.


Erasmus University Rotterdam has developed a new methodology to find associations between related concepts in scientific texts. To construct the representation, first all concepts in the articles are identified. Each concept is assigned to a vertex in a graph. Arcs are added to the graph if two concepts co-occur in an article in order to trace related concepts. Furthermore, weights of relevancy for related concepts (based on the number of co-occurrence) are represented in an associative conceptual space (ACS). The combined combination of a graph and the ACS is called a conceptual associative spatial graph (CASG). After training, the distribution of the concepts reflects co-occurrence: concepts that appear together in the articles frequently are located close to one another in the space. Hence, CASG is an example of a self-organizing system, to a certain extent similar to the self-organizing maps. The distance between the concepts in ACS is used as weights of the graph. The CASG allows its users to find paths between concepts, provided that these concepts are linked via other concepts in the literature.

Status and results

The methodology has been implemented in a prototype system. and the algorithms have been tested for constructing CASG. Simulations using artificially constructed test sets have shown convergence to a configuration reflecting co-occurrence of concepts: clusters of frequently co-occurring concepts are always separated from each other. Paths linking arbitrary concepts can be found instantly. The methodology is also tested on a set of articles published in Nature Genetics for over four years. The training was efficient (it was completed within 3 hours) after which searching was possible. Currently, efforts for performance comparison of the method developed to other competing methods are planned. Definitions of precision and recall for paths are needed for this purpose.

Adaptivity and portability

In the current situation, literature categorization is made based on pre-defined categories in the library cataloguing systems. Categorization through an intelligent system such as CASG implies that it is adaptive to the actual relations as perceived and used by the contributors to the literature. CASG itself, however, is not adaptive in the EUNITE sense. In the future, adaptation of CASG to newly available data is bound to become more important, since the generation of literature is a dynamic and ever-changing process. Another interesting question to investigate is how to adapt a categorization system that is trained with data from a particular field to one that can also function within the scope of another field.

More information

J. van den Berg and M. Scheumie. Information retrieval systems using an associative conceptual space. In Proceedings of the 7 th European Symposium on Artificial Neural Networks, pages 351-356. Bruges, Belgium, 1999.

C.C. van der Eijk. Knowledge Discovery in Scientific Literature. Master's Thesis. Erasmus University Rotterdam, the Netherlands.

technical reports & case studies

executive summaries

scientific papers




search eunite with