Human Centered Computing

Extracting bibliographic data from textual documents

Betreuer: Claudia Müller-Birn
Fach: Information Retrieval
Abschluss: Master thesis


There are two motivations for this thesis: On the one hand, in one of our projects many historical documents have been digitalised. These documents often contain references as unstructured information that can only be analyzed at large scale, when they are transferred into a structured format. On the other hand, while reading research papers, users often want to integrate included references in their personal literature database. At the moment, this process is often difficult since the user needs to change her context of work (from PDF viewer to Web browser/literature data browser). 

In both cases, a concept needs to be designed, developed, and tested that allow to translate semi-structured information from  these documents into a structured data format. The goal is to segment the (marked) text into individual strings such as author, title, venue, and year, and then compared with existing Citation Retrieval Tools such as Web of Science. 


  • Sunita Sarawagi: Information Extraction. Foundations and Trends in Databases. Vol. 1, No. 3 (2007) 261–377.

  • Mario Lipinski, Kevin Yao, Corinna Breitinger, Joeran Beel, and Bela Gipp. 2013. Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries (JCDL '13). ACM, New York, NY, USA, 385-386. 

  • Michael Granitzer, Maya Hristakeva, Kris Jack, and Robert Knight. 2012. A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In Proceedings of the 27th Annual ACM Symposium on Applied Computing (SAC '12). ACM, New York, NY, USA, 962-964.