The Geocoder: A Tool for Geo-Referencing Place Names

The Geocoding Tool

The majority of the datasets used in this project did not contain geo-referenced place names data and therefore a tool was required to add specific coordinates to all identifiably individual place names. To facilitate geo-referencing on such a large scale, the Humanities Research Institute at the University of Sheffield developed an online geo-coding tool to carry out both automated matching and manual checking. This was implemented as a Java servlet, controlled via an AJAX-style web interface. The geocoder also contains simple logging and analysis tools, which can be used to evaluate and optimise its performance on specific data sets.

Challenges in Geo-Referencing Place Name Data

A number of features of the place name data make geocoding a difficult process. In several of the datasets a "place name" may in fact consist of an address comprising several "segments", which are separated in varying ways, such as "in", "at" or with a comma. Spelling variations, especially in manuscript sources, create further serious problems for matching, as do variations in naming between different types of source. The geocoder required an extensive set of string-processing rules for segmenting and normalising place names in order to deal with these issues, and these were extended and refined during the development process to improve the success rate of automated linking.

The existence of multiple locations with the same name was also an issue, and a hierarchical place-name gazetteer was used to aid identification and disambiguation where possible. Additional indexes were manually created for variant spellings and alternative names, places outside the boundaries of the Rocque map or not identified on the map, as well as frequently-occurring markup errors and places that are not individually identifiable (such as "Stables").

The geocoder also provides a tool for adding explicit manual links in cases where automated linking fails, primarily to identify place names that can be manually disambiguated because of multiple locations on the map. The geocoder may be adapted to facilitate crowd-sourcing in order to improve the accuracy of the geo-referencing in the future.

Accuracy

The complexity of the data can also make it difficult to measure the success rate of the geocoder in many cases. For example, in a place name consisting of a street name and parish, the geocoder may successfully identify the parish but fail to match to the street level. Even when any successful match is considered, success rates vary considerably between different datasets. The Old Bailey Proceedings, a printed text recording place names in a relatively formal and consistent manner, achieved a much higher success rate (66%) than the diverse manuscript sources in London Lives (38%). As a result, this should be seen as an ongoing process, with further refinements planned in the future to improve the results.

The results list of each search of a data set in Locating London includes, in addition to a list of the first 200 hits, a statistic of the percentage of these hits which can be mapped. All individual hits listed include either the statement Show on map or This record could not be mapped; in the case of the latter it may become clear, by checking the text of the place name, why the location could not be mapped. By clicking on Show More Hits you can view additional tranches of 200 hits.

Downloadable Geocoder

In the future, we hope to make the geocoder available as an open-source downloadable tool for use by similar projects.