Asian Directories and Chronicles case study

Overview

The Asian Directories and Chronicles project is based at the Europa Institute Basel (EIB) where it is the subject of multiple PhD projects, and it is a major research focus for Data Futures. It has already produced more than ten research collections using new analysis techniques as well as supporting scholars at Aix-Marseille University, ENS-Lyon and Heidelberg University. It is also one of Data Futures primary evaluators for Web Annotation Data Model workflows - offering a major leap in annotation workflow technologies for sustainable scholarly annotation.

Background

The series of British books called the Asian Directories and Chronicles was published between 1860 and 1940. Each volume provides listings of active corporations, foreign residents and government agencies of all nationalities for that year, together with their addresses in countries including Borneo, China, Indo-China, Japan, Korea and The Philippines. However, although well-known to historians and political scientists, these publications are complex and no longue-durée analysis been attempted prior to this project.

By 1937—the seventy-fifth year of publication of the Directories—volumes exceeded 2000 dense pages, compiled annually from a multiplicity of local sources and research, and had become an indispensable source of information for both Western states and communities of foreigners living in Asia. They included treaties, coverage of conflicts, changes in extra-territorial jurisdiction and courts of law, and currencies and taxes; the addresses of all corporations and institutions and consulates, as well as the occupations and employers of all the foreign residents—together with their addresses. They also provided a host of other information about weights and measures, public holidays, festivals and traditions—almost every aspect of global lifestyle in Asia, from the Buddhist association in Hong Kong to the many branches of Christian missions, Arab and Jewish schools, and from insurance policies to the varieties of horse-racing and golf clubs. In short, they supplied global Information at a glance: commerce, law, power and society.

The structure of these publications is typical of corpora known as 'suites'—describing continuing phenomena at regular chronological intervals—which makes it essential to examine many volumes in order to exploit them effectively. Being global in reach they were, in the 19th and early 20th centuries, necessarily compiled through aggregation of other local independent resources as well as through original research and reportage, because of the breadth of countries addressed. Moreover, they met ambitious annual printing and distribution deadlines in a world before electronic information technology and air transport, which makes their production the more remarkable. However, the constraints this imposed on consistent treatment of information from different resources must be recognized. As a consequence the same entity appearing in volumes from adjacent years is often represented differently—for example in the case of persons' names which are substituted elsewhere with initials, and job descriptions which do not use a rigorous vocabulary but instead reflect instance-by-instance the discretion of individual contributors. Layered upon this was the rapidly evolving representation of South East Asian nations by different European states—manifest in the gradual transition from a range of transliterations (e.g. Shanghae, from the two characters 上海, or 'Zaan he', meaning 'upon-the-sea') towards wider standardization. Irregular page layout reflecting the very diverse information in the Directories, as well as concessions in paper weight and print quality made necessary by the high page count and transitory nature of these publications also compound difficulties with digital extraction in more concrete ways.

In consequence, previous considerations of automated analysis of the Directories using conventional techniques indicated the necessity for teams of experts and the expectation of an extended project even to make progress with individual volumes. The requirement of analyzing significant 'runs' of volumes in order to gain any insight into the suite as a whole created a task of overwhelming proportions. Overcoming challenges of accurate digitization led only to problems of disambiguation of multiple representations of the same entity, which required the experience of very skilled contributors. Of course, the issues associated with developing high quality disambiguated entity data—beyond accurate capture of printed text—are not new in computational methods in the humanities. Analysing the huge diversity of detailed information condensed in the Directories during the nineteenth and first half of the twentieth century enables us to better understand the development of the global information society. We can show how information was used as raw material, contributing decisively to creation of global expertise. This would provide a remarkable new lens with which to better understand the turbulence of the 21st century. However the real challenge of this project is to address the disambiguation of digitized information at a fundamental level before attempting to draw any quantitative conclusions from this suite.