Duration: January 2022 - December 2026
Subsidy provider: NWO Large-Scale Infrastructure Programme
Subsidy size: 3.8 million euro
Remarkable: The GLOBALISE-infrastructure makes it much easier for researchers to conduct research in the so-called ‘Overgekomen Brieven en Papieren’ (‘Letters and papers received’, OBP).
Valorisation: We expect to deliver a first prototype of the research infrastructure in early 2024; all tools and data will be available by the end of 2026.

The GLOBALISE project will make it much easier to do research with the so-called ‘Overgekomen Brieven en Papieren’ (‘Letters and papers received’, OBP) of the Dutch East India Company (VOC). These documents from the 17th and 18th centuries not only provide a view of the organization of the VOC and the colonized societies under its rule, but are also brimming with unique data about the peoples and regions with which it came into contact.

Unique world heritage made digitally accessible

The OBP is the most interesting and important document series in the VOC archives. It consists of almost 5 million handwritten pages that were sent from Batavia to the Dutch Republic. Many documents have not yet been studied. This is not only because of the size of the series, but also because of the language barrier and the difficult handwriting. GLOBALISE will make it easier for everyone in the world to conduct research with these documents. The entire series was recently scanned by the National Archives in The Hague.

From scan to knowledge graph

GLOBALISE will make this resource more accessible by first converting the handwritten documents into computer-readable text using advanced automatic handwritten text recognition techniques. Next, we train language models to recognize entities (such as people, places, goods, and ships), events (such as diplomatic missions, ship voyages, wars, and rebellions), and dates in the text. Subsequently, we will try to link these data—many millions of entities and events—to a “digital encyclopaedia” of entities and events that we are compiling from different sources used by the project.

The identification of entities and events through links to data in the digital encyclopaedia is initially carried out automatically and then curated using manual checks. In addition, we label the entities and events with terms from a GLOBALISE thesaurus and then place all data in their original context in a knowledge graph. As this will be a large-scale undertaking, we plan to invite guest researchers and the wider interested public to help us annotate and enrich the texts.

The encyclopaedia, the thesaurus and the linguistic model describing the relationships between entities and events will all be shared in English. This will make it possible for researchers with limited knowledge of (old) Dutch to track down the data relevant to their work. An easily accessible user interface will allow anyone to wander through the data, create search queries and generate overview visualizations. In this way, GLOBALISE will facilitate the acceleration and broadening of research on early-modern Asia and the VOC.

A long tradition

With the GLOBALISE project, the Huygens Institute is once again making an important historical resource digitally accessible to the public. The GLOBALISE team is building on the expertise of earlier infrastructure projects, such as Golden Agents and REPUBLIC. In addition, GLOBALISE fits into a long tradition of making VOC sources more widely accessible: between 1960 and 2017, the Huygens Institute and its predecessors published fourteen volumes of transcriptions and editorial notes of the General Letters, which are a series of summary reports within the much larger collection of documents that GLOBALISE processes.

GLOBALISE is a collaboration of the Huygens Institute with the International Institute of Social History, and the Digital Infrastructure department of the KNAW Humanities Cluster, the Computational Linguistics & Text Mining Lab of the Vrije Universiteit Amsterdam, the CREATE research program of the University of Amsterdam, and the Dutch National Archives. The project team consists of historians, computational linguists, data specialists and software developers. We expect to deliver a first prototype of the research infrastructure in early 2024; all tools and data will be open access available by the end of 2026.