Engineering the Anthroponymical Lexicon of the Historical Low Countries

User-Assisted Data Extraction from Archival Sources Resulting in an Auto-Generated Online Collaborative Database

Dr Wouter Soudan, November 2012

The research project described here, encompasses the editing and publication of a large onomastical data-set of early medieval anthroponyms (personal names), alongside the technical implementation of its database. We discuss the principal methodological strategies for implementing a semi-automated, user-assisted text analysis procedure, packaged into a web application that allows a community of trusted users to edit, amend and augment the pre-processed base material, thereby collaboratively maintaining an online, editable database of archival sources. It is a case study of text and data mining (information retrieval), involving the automated indexing of a large corpus of digitized plain text documents, the output of which is a relational database with an upfront querying layer, allowing for customisable search queries that may reveal patterns and extract relationships from a collection of unstructured text documents, such that we may eventually succeed in uncovering the social graph of the early middle ages.123

A digital future for Gysseling’s legacy

In 1960, the prolific scholar of Dutch Maurits Gysseling (1919–1997), an esteemed palaeographer and the well-known editor of the Diplomata Belgica (1950), published his Toponymic Lexicon of the Low Countries. At the time, it was — and it still remains — a groundbreaking work of reference for the study of toponomy, onomastics, early netherlandish and the history of the early middle ages in general. It had cost Gysseling quite a few decades of paleographic scrutiny and the devotion of a lifetime.

Maurits Gysseling, Toponymisch Woordenboek van België, Nederland, Luxemburg, Noord-Frankrijk en West-Duitsland (vóór 1226). 1960
Maurits Gysseling, Toponymisch Woordenboek van België, Nederland, Luxemburg, Noord-Frankrijk en West-Duitsland (vóór 1226). 1960

The print edition of the Toponymic Lexicon is presented as a dictionary of place names, arranged in alphabetical order. For each entry, Gysseling records many variant orthographies, each assertion holding dozens of scrupulous references to its archival origins.

Handling vast amounts of data

The base material from which Gysseling compiled his Toponymical Lexicon, had been his own work, too. From the 1930s on, he had ploughed through the archives from Friesland to Pas-de-Calais, had microfilm sent over from Munich and London, and scrupulously transcribed a nation’s archival heritage. On these transcripts he copied place names and the names of persons, as well as the relevant contexts in which they occur.

A pile of weary shoeboxes was awaiting me at my new desk (2009) at the University of Antwerp…
A pile of weary shoeboxes was awaiting me at my new desk (2009) at the University of Antwerp…

Several dozens of weary boxes holding thousands of such excerpt sheets, bear witness of Gysseling’s life-long devotion. One could ask how he was able to manage such huge undertaking and bring the edition of a dictionary with many thousands of entries, to a successful end. And how he could turn a pile of papers into two large volumes of the uttermost complexity and scholarly scrutiny? Sure, with the patience of a monk and the perseverance of an old-school scholar.

From thousands of handwritten excerpt sheets to two large volumes of structured entries.
From thousands of handwritten excerpt sheets to two large volumes of structured entries.

However, a few years after the publication of the Toponymical Lexicon, Gysseling admitted that the compilation of the pendant Antrhoponymical Lexicon, from the same source material, was “an onerous task,” he said, “of which the completion will not be for tomorrow,” fully aware as he was of the sheer multitude of entries required. In fact, Gysseling had indeed started preparing the work on a companion Anthroponymical Lexicon.

Small index cards, handwritten and typewritten, give us an idea of how Gysseling may have envisioned his Anthroponymical Lexicon.
Small index cards, handwritten and typewritten, give us an idea of how Gysseling may have envisioned his Anthroponymical Lexicon.

In his legacy, we have about 15.000 small index cards, clearly being the onset for his dictionary. But he had to leave this enormous work indeed incomplete. A few weeks before he died, in 1997, Gysseling expressed the explicit wish that this important work would be executed in order to disclose the highly momentous material on which he had spent a greater part of his scholarly pursuits. But no successor would endeavor the heavy task.

In his will, Gysseling entrusted his records to professor Jozef Van Loon, with the request that the latter would oversee the posthumous realization of his major scholarly pursuit. It would take Van Loon over a decade to find the time to honor Gysseling’s wish and obtain a project grant from the Flemish Fund for Scientific Research. By then, Van Loon was familiar with the task at hand, as he (in association with Tom J. de Herdt) had edited Gysseling’s Toponymisch Woordenboek and converted it into an online database.

The creation of a digital Anthroponymical Lexicon would be much harder. The Toponymical Lexicon had existed in print since 1960, and was already converted into electronic format in the late 1970s. Its anthroponymical pendant, on the other hand, was barely more than an idea, for which Gysseling had left a few disparate annotations on how such a dictionary might be arranged.

As it turned out, finding a collaborator to develop the software framework, was even more difficult. Van Loon’s requirements were quite exacting: his assistant had to read (medio-)Latin, be a trained historian, preferably a mediaevist, and — oh, almost forgot — should also “kwow a bit about computers.” I just finished my PhD in Art History (as a student of the Age of Enlightenment, I was/am not a big fan of the Dark Ages), and have a bachelor’s degree in classical languages (and thus am only a half-trained linguist). But I do know a bit about computers… My assignment was simple enough, though: I was tasked (starting January 2009) to take that huge pile of Gysseling’s papers, and turn it into a digital lexicon, preferably in the form of a fully searchable online database.

9.860 typescripts were found at the University of Leuven, in the basements of the Faculty of Arts, where they had been sitting under the dust of two decades.
9.860 typescripts were found at the University of Leuven, in the basements of the Faculty of Arts, where they had been sitting under the dust of two decades.

Luckily for me, Gysseling’s transcripts were found to have been copied in typed format somewhere in the 1970s at the University of Leuven. It turned out that about 90% of Gysselings excerpt sheets were indeed available as typewritten copies, so we would only need to type-over some 1.500 extra excerpt sheets.

Analogue and digital typewriting

Gysseling’s original handwritten transcripts were of course useless for a computer-assisted edition. But the typescripts had been produced with a regular typewriter and a clearly legible typeface, and therefore were good candidates for digitization. We could easily turn them into a into a machine-readable electronic format using Optical Character Recognition (OCR) software.

Converting into a machine-readable format

In principle, both the handwritten transcripts and typescripts should cover identical data. However, this was found not always to be the case: occasionally, the typist inadvertently put alterations, omissions, errors and typos in the typescript copies.

As we compare Gysseling’s original handwritten transcript with the typewritten copy, we note several alterations, omissions, and typos.
As we compare Gysseling’s original handwritten transcript with the typewritten copy, we note several alterations, omissions, and typos.

Moreover, the OCR process turned out not to be a trivial task, either: many other errors sneaked into the electronic text. The many paleographic peculiarities of the transcripts and their unsolved sigla, the mix of Latin, Old French and Middle Netherlandic, make for an hermetic text that is hard to read, even for an expert, let alone a computer. Numerical values proved especially troublesome and error-inducing, as they may represent either dates, page numbers, file sheets or record marks. Too, the computer often mistakes the letter u for v and vice versa.

Cases of reading errors that ask for the intervention of a human editor to be disambiguated.
Cases of reading errors that ask for the intervention of a human editor to be disambiguated.
An interesting palaeographic use case

An accurate transcription of graphemic variants does matter. Especially in the case of <u> vs. <v>, as the centuries in which our documents originate, is precisely the age wherein the Latin alphabet went through a transformational phase. From before the 11th through the 14th centuries (at least), the round <u> and sharp <v> were no different letters yet, with distinct semantical, orthographic, nor phonemic values, but could be interchanged as key letters for differing styles of calligraphy. It is therefore all the more interesting, when we find written testimony of the evolutional waverings those graphemes went through.

Gysseling’s transcripts include an excerpt from an 11th-century manuscript signed by several scribes, who wrote their respective names in their own hand. One of them litteraly stands out, as he is the only one who deliberately writes his name, Itesboldvs, with a sharp <v>, while his colleagues write their name endings -us, with a round <u> — all the more reason for Itesboldvs to ascribe himself the epitheton “scriptor optimvs”, as he, unlike his peers, is truly the “best penman” of them all… Also remark how they disagree on the “right” orthography for the bilabial fricative/approximant /β/, c.q. labial-velar approximant /w/ phoneme(s), i.e. our modern digraph double-u (French: double v): UU, VU (and in other manuscript we also found Uu, Vu, and VV).

Gysseling’s transcripts are scrupulously faithful to the letter: the difference between the round <u> and the sharp <v> may be deliberately intended…
Gysseling’s transcripts are scrupulously faithful to the letter: the difference between the round <u> and the sharp <v> may be deliberately intended…
Auto-“magical” batch corrections

It would take a great deal of effort to weed out of the digitized text all OCR reading errors due to source-specific issues. A careful examination of the processed documents learned however that we could establish common patterns of erroneous character recognition.

Some of the many OCR reading errors we encountered
Some of the many OCR reading errors we encountered

Subsequently, these patterns were compiled into a comprehensive sequence of find-and-replace actions, by which we were eventually able to detect, highlight, and even “automagically” correct huge amounts of erroneous data rather swiftly.

The conversion from the scanned typescript into a machine-readable text file, using OCR, is a process that, in its turn, inevitably introduced even more errors. But thanks to our software, many of those could be corrected “automagically” with a few hundred batch processes.
The conversion from the scanned typescript into a machine-readable text file, using OCR, is a process that, in its turn, inevitably introduced even more errors. But thanks to our software, many of those could be corrected “automagically” with a few hundred batch processes.

Crowdsourcing

However, as conversion errors will inevitably remain, and require human intelligence to be detected, we needed to devise a method by which stubborn errors could be corrected transparently by human editors, too.

From the outset of the project, we thought of input by an authorized community of users as a quintessential feature of the final online edition and web application. Unlike similar electronic edition projects, our digital dictionary would be ‘on its own’, without the financial means of a permanently assigned editor who would correct errors, add newly found documents to the corpus, or scrutinize emendation requests by users of the database. We thus had to embrace ‘crowd sourcing’ as a feasible methodology for scholarly research.

The implementation had to be designed in such way, that would establish a flexible, yet trustworthy environment for managing the contributions of an a priori unknown community of independent users. A proper access-permissions management system now ensures that accredited users only, are able to make editorial adjustments to our source material. On top of that, we implemented a robust revision control system, by which we can revert at all times to earlier versions of the texts, and see who edited what, exactly at what time.

All output data is free to everyone. Users who want to edit the source material, must however authenticate — they’ll be assigned a role as an editor, or as a reviewer.
All output data is free to everyone. Users who want to edit the source material, must however authenticate — they’ll be assigned a role as an editor, or as a reviewer.

Securing the integrity of the data

We use several online authentication protocols for user login, of which Facebook is probably the most widely used. User accounts, read-write access, user-generated content, tagging and rating have all become well established features of today’s ubiquitous ‘web apps’, yet they remain rather foreign to the Internet ventures of the digital humanities. We nevertheless believe that our future users will feel comfortable using their already curated online ID to identify themselves to our application.

Within this highly secure framework we created an online text editor that runs in a web browser and through which each and every visitor of our website can compare the scans of the original documents with the electronic text that forms the basis of our database.

In the editor application, all excerpts are available as a triple set: we have Gysseling’s original transcript, the 1970s typewritten sheet, next to the digitized plain text. The electronic text looks like the typescript, but it’s editable. Users can correct OCR reading errors, typos, and remove superfluous spacing. As they do so, the edit state of the record is tracked, until the user is satisfied and saves his changes. All users are allowed to review and make edits to the text. Trusted users only, are granted special rights and can approve or reject edits by other users. Users logged in as a privileged user (reviewers) can validate their edits on the document.

We developed a custom text editor, accessible through all (modern) browsers. Reviewers see the digitized facsimile of Gysseling’s original excerpt, alongside its typewritten copy, and the digital plain text output on which they can make edits. Editors can approve of, or disapprove the edits made by the reviewers.
We developed a custom text editor, accessible through all (modern) browsers. Reviewers see the digitized facsimile of Gysseling’s original excerpt, alongside its typewritten copy, and the digital plain text output on which they can make edits. Editors can approve of, or disapprove the edits made by the reviewers.

Software developers around the world are using Git to maintain their program source code, and so do we, for our in-house development. It occurred to me that Git could be used not only for managing development of program code, but to keep proper version tracking of our archival source material, too.

A simplified schema of the application’s architecture.
A simplified schema of the application’s architecture.

As this simplified diagram shows, the idea is that each user has his own dedicated cloned copy of the entire repository, holding all of the corpus’s source text files. They can then go and make edits on their copy only and “commit” these changes to another copy of the repository, to which only trusted users have access. As soon as these editors accept the submitted changes, they are “pushed” to the “master branch” holding a clean copy of all the text documents, which form the basis for the dynamically indexed database. And it is to this database that all queries are directed, ensuring that our users and visitors always are guaranteed with the most up-to-date and validated search results.

Side-by-side

If we were to involve users to improve on the scope and quality of our source material, then we had to give them access to the original handwritten documents in order to make detailed comparisons and verify the digitized text. It was clear that Gysselings’s original handwritten excerpts would be a great benchmark along with the typescripts and could offer a fall-back, so that the ongoing reviewing process of the digitized text by our users, could make advantage of a side-by-side comparison of the different “edition states”, or versions, of a document. We fancied that such a comparison would be a key feature for all future users of our database, too.

“Dit is de eerste fiche die ik gemaakt heb!” Gysseling writes at the bottom of excerpt #304. It’s the very first of many thousands that are to follow. Some easter eggs bear witness of Gysseling’s frugality and of the limited means of his time…
“Dit is de eerste fiche die ik gemaakt heb!” Gysseling writes at the bottom of excerpt #304. It’s the very first of many thousands that are to follow. Some easter eggs bear witness of Gysseling’s frugality and of the limited means of his time…
Side-by-side: from the original medieval manuscript, over an early twentieth-century transcript and a late twentieth-century typescript, to our early twenty-first century digital file.
Side-by-side: from the original medieval manuscript, over an early twentieth-century transcript and a late twentieth-century typescript, to our early twenty-first century digital file.

And the crown piece would be of course to not only supply Gysseling’s material in such a comparison, but also the very document from which he had excerpted the text. Users could then verify the quality of our electronic edition themselves, and they could help to improve the database.

In the event that national archives will be disclosing their collections in an online format, and in near futures will publish electronic facsimiles of medieval manuscripts, we have already everything in place so that users can easily provide links to the images or provide and upload their own scanned copies.

Errare humanum est

It goes without saying that providing users with all available data greatly benefits research, scrutiny, mutual criticism, and thus, the progress of knowledge. Withholding information indeed contradicts the very essence of scientific research. Not that researchers must be suspected of acting in bad faith, but the mere fact that erring is human, is already a liability. We just can’t trust the authority of a single man, how diligent and meticulous he may be, even when he’s considered a primus of his domain.

We saw that Gysseling’s excerpts are highly faithful transcriptions of the orginal source documents, to the very letter. Yet, very understandable, due to the laborious, monotonous work of transcribing, Gysseling too, occasionally left his guard.

There’s a 12th-century charter, of which I found the microfilmed original in Gysseling’s archives. (Gysseling deemed it a forgery, but it’s unclear from which indications he infers that judgment.) Gysseling’s transcription litteraly reads: “Ludolfus abbas s̅c̅ı̅ Laurentii in paludeplaude (sic).” The facsimile instead clearly shows: “Ludolfus abbas sc̅ı̅ Laurencij inplaude”. Gysseling substituted the finial j (at the end of “Laurencij”) by i — likely for reasons of orthographic normalization (which in itself is a contestable resolve). However, he also transcribed t, whereas the original has a c (which is not implausible altogether, given the presumable pronunciation in medio-Latin). Furthermore, in the original “inplaude” is written as one word, while Gysseling’s transcript has “in plaude”, in two words.

Ludolfus abbas sc̅ı̅ Laurencij inplaude
Ludolfus abbas sc̅ı̅ Laurencij inplaude

Gysseling’s error is understandable, as he probably got entangled in a twofold self correction, while restoring the error in the original that he had instinctively rectified ahead. Gysseling read “inplaude”, but, from automatism, he wrote down “in palude”. Then he noticed that the original was wrong, making him rectify into “in paludeplaude”, adding “(sic)” to indicate that not he, but the 12th-century scribe had erred. However, Gysseling left the word space inbetween “in” and “palude”, and he overlooked that he wrongly corrected the scribe in the preceeding word “Laurencij”. The more precise transcription would thus have been: “Ludolfus abbas sc̅ı̅ Laurencij inplaude (sic)”.

This triffling use case does not set Gysseling’s unquestioned prestige as a prominent paleographer under suspicion. It only shows, I assume, that a crowdsourced approach is methodologically justified, especially when dealing with large volumes of data where the chance of human error is directly proportional to increasing fatigue.

Letters, marks, characters, glyphs and scribal scribbles

On this facsimile reproduction of an original 12th century manuscript, we can read several of the many interesting paleographic signs, letters and marks, that were widely in use in medieval scriptoria.

As an experienced paleographer, Gysseling read these abbreviational and other marks flawlessly. However, in transcribing them, he decided to resolve them. In his handwritten excerpt we see that the abbreviational strokes above the letter b have been transcribed as “er” and that the abbreviational 9-shaped mark at the end of “Hardbertus” has been solved by Gysseling as well. The contracted “ecclesie”, however, has not been solved, and the double-p at the start of “prepositus” neither. But in his handwritten excerpt, Gysseling kept the abbreviational overline. On a typewriter, it was not possible to render these overlines, and during the typing-over, they were lost, whereas the resolved abbreviations of “ber” and “us” did get through.

With present-day digital encoding and font technology, we are however able to re-transcribe the original scribal hand as a faithful diplomatic rendition — and the users of our crowd-sourced text editor can help us in achieving this goal.

Encoding: Unicode, MUFI, and OpenType

We need many special sorts, though, required for the encoding of a truly diplomatic edition of our palaeographic sources. Trivial though it may seem, character encoding still is a crucially important issue. Issues of encoding still cause software bugs, especially when dealing with extraordinary text material. The digital humanities have special requirements, which are to be dealt with with care, while software consultants do not always have that special kind of philological scrutiny…

Special sorts required for a truly diplomatic rendering of the palaeographic source, encoded using Unicode, MUFI and a custom web font. We had to device a complex mix of the available encoding standards.
Special sorts required for a truly diplomatic rendering of the palaeographic source, encoded using Unicode, MUFI and a custom web font. We had to device a complex mix of the available encoding standards.

Characters are encoded through a convention whereby letters and other symbols are paired with a numerical value or “code point”, so that they can be stored and exchanged between platforms. The Unicode standard already provides a large set of special letters and sorts, although some rare ones remain to be desired.

In an effort to patch this lacuna, medievalists and linguists have created the Medieval Unicode Font Initiative standard (MUFI), which inventories graphemes found in Western paleographic sources, both as a subset of Unicode and as an addition, pending the official adoption of those characters by the Unicode consortium. Other graphemes, however, commonly used in medieval source texts, are still missing from the Unicode standard. While we gratefully make use of MUFI, some remaining graphemes are identified neither in Unicode, nor in MUFI, or they have ambiguous semantical meaning.

Some of the more frequent breviatures and contractions, displayed using smart OpenType ligature substitutions.
Some of the more frequent breviatures and contractions, displayed using smart OpenType ligature substitutions.

Moreover, sometimes we may need to display ligatures and contractions that cannot be hard-encoded, but nevertheless require a special treatment for readability. We therefore designed our own dedicated paleographic typeface and developed the proper web fonts, with powerful OpenType glyph substitution features, in order to make them displayable in regular web browsers.

Teaching the computer how to read (Latin)

The challenge presented by our project is simply put. Once all of the material is available in a clean electronic format, how are these vast collections of unstructured, plain text to be transformed into a fully searchable database at which users can fire complex queries generating fine-grained results? All the data is in there. But we need to have it structured. Unless properly instructed, a computer program is unable to recognize particular data types and to discriminate between them out of the box. It needs to be told how to interpret strings of characters, or, if you will, it needs to be taught how to read.

Natural Language Processing

Given a feed of plain text, our goal is to have the software recognize the data types that are of relevance to us. We have a vast body of text files, from which we want to draw an index. The entries of this index are principally an infinite collection of tokens, arbitrary spread across the text.

So, we need to collect these tokens from the text in order to generate the index. Similar tokens will be grouped under the same lemma, each one paired with a link to its place in the text. Once such indexes would be generated, we could then take an instance in the list and point back to its origin. And thus, we would indeed have generated a searchable database.

First we collect all the “words” that are of interest, and we list them. The same words are grouped together, while we keep a reference to where we found them. Once the list is ready, we will have an index where we can lookup all the places where a specific word occurs in the text.
First we collect all the “words” that are of interest, and we list them. The same words are grouped together, while we keep a reference to where we found them. Once the list is ready, we will have an index where we can lookup all the places where a specific word occurs in the text.

The piece of software responsible for the recognition of entities is usually referred to as a lexical analyzer: it reads in a stream of characters, identifies instances of data types, and labels them as belonging to a certain type.

We will be looking for page numbers and dates.
We will be looking for page numbers and dates.

In our case, we needed to extract, for example, the excerpt sheets’s unique numbers, in order to procure handlers to the records in our database. A simple approach for this procedure of tokenization and consequent data extraction, would be by means of a method known as ‘Named Entity Recognition’ (NER). This requires the precompilation of a reference list of unique words or phrases the computer must recognize. In the case of our sheet numbers, this evidently is the wrong approach: if we were able to provide in advance a comprehensive list of reference numbers, then we would already have at our disposal the list that the computer is supposed to generate.

Regular Expressions

Rather than to specify plain keywords that need to match exactly, the more ingenious approach is to define abstract patterns with which the incoming character strings need to fit merely loosely. A well-established technique for constructing such patterns involves what are referred to in programming as ‘regular expressions’ (Regex).

    public function getPattern(){
        return'/-[\w\s\.]*-$/';
    }

However, we may want our analyzer to be able to recognize document dates, too, which are also numerical values. Hence, we need to disambiguate and to confine the prospective tokens within a numerical range. Instead of pre-compiling a list of years, we can simply define a pattern, matching all strings that contain three or four digits, with values between 500 and 1250.

Part from a detector function that uses a regular expression to match dates between the year 500 and 1250.
Part from a detector function that uses a regular expression to match dates between the year 500 and 1250.

Indeed, the human mind does not memorize an exhaustive vocabulary of years either, but, assisted by the constraints imposed by the immediate context, is able to recognize a number consisting in a sequence of digits as a year.

Artificial Intelligence

The same line of reasoning applies in somewhat less obvious cases. For example, Gysseling was not familiar with all personal names in use during the Middle Ages, either. On the contrary, he wanted to compile a lexicon of these names from the sources. He was able to recognize a name when he read one, though, not because he had learned it beforehand, but because he understood the characteristic pattern of an anthroponym. This was in fact exactly how he proceeded with the compilation of his Toponymical Lexicon: he went through his excerpt sheets, underlining each and every place name he met and recognized through an understanding of the text’s meaning.

Gysseling went through thousands of his handwritten excerpts, marking all place names. We would design our software to do roughly the same thing for personal names: iterate over all the documents in our corpus, recognize anthroponymical forms, and store them in a database. Programming an intellectual process of the human mind — it’s quite a challenge!
Gysseling went through thousands of his handwritten excerpts, marking all place names. We would design our software to do roughly the same thing for personal names: iterate over all the documents in our corpus, recognize anthroponymical forms, and store them in a database. Programming an intellectual process of the human mind — it’s quite a challenge!

We would need our software to do the same for names of persons. If we would use a plain vanilla NER approach, then we again needed to list all possible instances of all anthroponyms, beforehand. And in the end, we again would only obtain more or less the same list as we provided. This was obviously impracticable.

A few of the many hundreds of Germanic personal names in use during the early Middle Ages.
A few of the many hundreds of Germanic personal names in use during the early Middle Ages.

We had to start somewhere, though. Doing as Gysseling did, we hand-compiled a list from randomly selected text samples in our corpus, hoping to find a common pattern. In modern orthography, a personal name is of course capitalized, which helps. But in medieval paleography, this rule does not always apply.

More significant in the latter context are the morphemic units an anthroponym is typically composed of. We dubbed them “onyms” or “name pieces”. We notice, for example, that many names end in “baldus”, or “boldus”, or “aldus”, “chardus”, “fridus”, “godus”, “gysus”, “helmus”, “waldus”, “winus”, and so on.

Germanic names: variations on a theme
Germanic names: variations on a theme

The same pattern of common morphological units applies at the beginning of personal names. We thus isolated these units, which could then be coupled to their etymological proto-Germanic stems, which could be further compressed to a list of unique tokens, could be extended again with findings from other hand-compiled samples, and, finally, could be used to compile a list of combinations of beginning and ending name pieces.

Compiling a list of instances of anthroponymical forms into lists of etymological word parts, and, eventually, into an abstract pattern, from which we will generate all conceivable forms: onym + (linking phoneme) + onym + flectional suffix.
Compiling a list of instances of anthroponymical forms into lists of etymological word parts, and, eventually, into an abstract pattern, from which we will generate all conceivable forms: onym + (linking phoneme) + onym + flectional suffix.

Now for each instance on this list of conjectural prototypes, we could establish a pattern. Each anthroponym of Germanic origin can indeed be rewritten as a formula, consisting of an “onym”, which may or may not be followed by a linking phoneme, and another “onym”.

Medio-latin flectional suffixes, written as regular expressions, reckoning with scribal variations.
Medio-latin flectional suffixes, written as regular expressions, reckoning with scribal variations.

The great majority of the documents in our corpus are however in Latin, and thus most anthroponymical assertions are inflected. Therefore, one single personal name can take many different forms, which tokens all need to be detected automagically and be assigned to their corresponding nominal form or lemma.

Instead of generating an almost infinite list of such flectional variants, we devised computable patterns that implement Latin flectional cases, and will capture Germanic uninflected forms, too. With these formulae in place, we are finally able to express anthroponyms as computable patterns.

Finally, we’re able to express anthroponyms as a computable pattern. A single formula matches an almost infinite list of variant names, that may or may not appear in the documents.
Finally, we’re able to express anthroponyms as a computable pattern. A single formula matches an almost infinite list of variant names, that may or may not appear in the documents.

Equipped with but a few lists, a handful of abstract patterns, and some rules to recognize the several data types in our source texts, we finally taught the computer to read. To detect and recognize, to interpret human language, very similar as how a human would rehearse the vocabulary of a foreign language along with its morphological and syntactical rules.

Our detectors recognize card index numbers, archive and document sigla, years and dates, other metadata, such as document place, granter and beneficiary, personal names (in all their many morphological and orthographic appearances), place names (both as proper nouns and adjectivated), offices, functions and roles, and even personal relationships. We finally taught the computer to read!
Our detectors recognize card index numbers, archive and document sigla, years and dates, other metadata, such as document place, granter and beneficiary, personal names (in all their many morphological and orthographic appearances), place names (both as proper nouns and adjectivated), offices, functions and roles, and even personal relationships. We finally taught the computer to read!

We start with a blank sheet, in which the computer now recognizes excerpt page numbers, references to archives and document numbers, many different expressions of dates, other metadata, such as document place, granter and beneficiary, personal names in their manyfold flectional forms, and orthopalaeographic variants, as well as place names, offices, functions and roles, and even mentions of personal relationships between individuals.

Sine dato

One more thing. Lots of documents in our corpus are stigmatized with the dreaded “sine dato”, and cannot be dated. For some, Gysseling attempted to make an approximation, based on such indications like the material conditions of the substrate, the document’s historical context, and characteristics of the scribal hand. Still, we found that our data mining software extracted a staggering 40% of undated documents. That’s a disappointing number indeed: for many use cases — and at least from an historiographical perspective — lots of our search results are thus rather useless. Or are they?

Let’s search for all occurrences of the name “Balduinus”. While we key in our search term, we see the nifty search bar auto-completing as we type, suggesting what we might be looking for. It’s quite powerful. We can browse through the options, picking the exact orthography we are looking for, or select the hyperlemma to get all results, regardless their orthography or flection.

When we fire the query, the result set is gathered from the database and presented as a list of all the archival excerpts in our corpus which contain a match. The list is organized hierarchically, starting at the conjectural proto-germanic hyperlemma, followed by the normalized nominative singular form, followed by the flection and verbatim orthography as found in the excerpt. “Balduinus” is of course asserted in lots of documents, for it was a traditional christian name, given to the new-born, future counts of Flanders. So, there are a lot of results, indeed. And remember, they are not the input of a human editor, but the output of a computer program, that did all of the data mining automatically…

In front of the lemma, there’s a little icon (histogram icon). When clicked, it pops up a chart diagram, plotting out a statistical histogram, showing how the different orthographies of “Balduinus” have been used throughout the ages, all tagged with a certain dating. It seems “Balduinus” was a popular name in the late 12th and early 13th centuries, indeed.

However, when we take a better look at the individual results, we notice that most excerpts are dated, while for some (as mentioned earlier, about 40% on average) our extractor could unfortunately not retrieve a date from the excerpt text. For each individual excerpt we have the relevant metadata available: archive sigil, document number, and, when there is one, a date. When the date is absent, we still have the histogram icon available. How can that be? Well, in such cases, the data plot shows a slightly different graph: it’s the composite histogram for all the anthroponyms in the document, based on their occurrences elsewhere, in dated documents. This means: we have an extrapolation of all datable assertions of a name, suggesting a computed dating for the document, based on statistics. In fact, our software has already generated approximated datings for all dateless documents in our collection! I imagine medievalist using these documents, might get enthused to hear about our findings, and discuss the validity of our calculated suggested datings, so that we can improve on the algorithm. The true power of information technology for historiography and the digital humanities in general…

A statistical plot of all the occurrences we have in our corpus of all variant orthographies of the name “Balduinus”. We developed an algorithm that is able to make accurate estimated guesses even for undated documents. The improbable peak in our histogram, around the year 755, evinces a small bug in our date detection — data visualizations like these prove particularly helpful for uncovering errors in the data, which in mere tabular format is too impervious to debug.
A statistical plot of all the occurrences we have in our corpus of all variant orthographies of the name “Balduinus”. We developed an algorithm that is able to make accurate estimated guesses even for undated documents. The improbable peak in our histogram, around the year 755, evinces a small bug in our date detection — data visualizations like these prove particularly helpful for uncovering errors in the data, which in mere tabular format is too impervious to debug.

Datamodeling infinite use cases

The computer read all of the text files in our corpus, and produced orderly structures from them, each node properly typed and enriched with extra metadata, such as precise information on archives and document references and their condition, which may change, and users thus can edit, as they can do with the precise qualification of dates, like their precision, or their literal presence in the source text. Places, too, of course, and extra meta-data, such as territories and geographical areas, and geolocations, which will be useful for plotting data on auto-generated maps. Personal names can be of course browsed, examined, and edited into great detail.

The data as it is extracted from the text, and saved into the database. We can now browse through, discover, improve on, and edit all the many properties of each entity.
The data as it is extracted from the text, and saved into the database. We can now browse through, discover, improve on, and edit all the many properties of each entity.

With such a detailed data store in place, we can now have our users perform a myriad of fine-grained search operations. Thanks to the state-of-the art technologies we’re using, querying our large database and filtering results is very performing.

A “Facebook” of the Middle Ages

I can only show you so much in half an hour. There’s a lot more under the hood, and more to come. For instance: we now have a vast collection of personal names, all dated. These names, actually denote individuals, people interacting with each other, saying they’re one’s brother, mother, father, son; a bishop brought his servant as a witness, the count brought his steward.

Eventually, we will be able to unveil the social graph of late medieval times. When algorithms traverse documents, find connections in the Big Data of our archives, re-establish relationships between persons, and profile their faces, we will have a “Facebook” of the Dark Ages.
Eventually, we will be able to unveil the social graph of late medieval times. When algorithms traverse documents, find connections in the Big Data of our archives, re-establish relationships between persons, and profile their faces, we will have a “Facebook” of the Dark Ages.

We can analyze these relationships, and uncover the social graph of the early middle ages. Hence, we will in fact be revealing a veritable ‘Facebook’ of the Dark Ages, concealed within the documents, and are thus bridging the gap between linguistics and historiography.

Wrapping up

I must say I am myself quite excited about what is becoming possible already, when I consider from where we’ve come. When I accepted the challenge, four years ago, the mission seemed impossible. There was a huge pile of papers, that had been untouched for a few decades; they had to be turned into a database. But the project’s funding was on a tight budget: even if we have had a team of trained and committed editors, they could not have done the job in three or four years.

So, we had to come up with a particular strategy. We rejected the notion of a one-time, single-purpose solution, whereby our massive text material would be edited and marked-up by hand. We instead opted for a generalist and reusable approach to accomplish every such task and thus we sought to develop our own dedicated software.

Our application now offers a digital work environment, using web-based tools, wherein crowdsourcing is embraced as a feasible scholarly method for the edition of archival sources on an ongoing base of improvement.

At this very moment, our software generates one of the largest onomastical databases ever, in only four to five hours. Obviously, this achievement is not the output of a magical “black box”: Rather than slogging our way through thousands of documents manually, we invested the time we had available in the research and development of an general-purpose methodology and a reusable strategy, to tackle the problem. A lot of effort was put in hand-compiling lists of sample data; it required domain knowledge of the precise nature of the data, historical linguistics, onomastics and diplomatics, all while having the right skills in information architecture. At the same time we dived into computer sciences, explored and experimented with several state-of-the-art database technologies and we took our time to make technology decisions with trial-and-error.

The greatest benefit of our strategy, is that the tools we designed, may well be of great value to all kinds of projects of electronic text edition and dedicated to the digital disclosure of archival resources.


A big thank you is due to my brother, Pieter Soudan, who holds a master’s degree in Computer Science and runs a web development shop. Without Pieter, the implementation of the ideas and methodologies discussed in this paper and employed in the project could never have come to fruition. Moreover, Pieter must be credited with the design of the dating extrapolation algorithm.

How can we be of service?

Software development nears completion, and the application almost finished its first job successfully. Our software has proved its use and is up for other challenges and new projects.

If your project involves the electronic edition of large document collections, then you might look into our web-based text editor, which offers a crowd-sourced solution. If you’re involved with layered critical text editions, using standard mark-up, such as HTML, XML and TEI, then probably you may want to look into our data extraction software. If you’re already curating meticulously annotated electronic editions, then our data visualization techniques can help you offer new statistical insights to your users.

These are but a few buzz-words. The bottom line is: our software is reusable, and could be further customized for all sorts of electronic editions. I’d be glad, if I could be of assistance.
These are but a few buzz-words. The bottom line is: our software is reusable, and could be further customized for all sorts of electronic editions. I’d be glad, if I could be of assistance.

If you want to see what was presented in this paper, live, then do have a look at our website at AntroponymischWoordenboek.be. You can try the search feature and browse through our database. The crowdsourced text editing application, too, will soon launch in public beta. If you’re interested in a sneak preview of the text-editor app, or if you want to sign-up as a private beta-tester, please drop me a line.

AntroponymischWoordenboek.be
AntroponymischWoordenboek.be

If you’re interested to learn more, or if you want to inquire how the software that we are developing may be of use to your own projects, please get in touch!


  1. The present text is based on my talk for the 9th Conference of the European Society for Textual Scholarship (ESTS 2012) (Amsterdam, 2012.11.23), to which are added some extra chapters from talks for the Koninklijke Academie voor Nederlandse Taal- en Letterkunde (KANTL) (Ghent, 2013.01.16), the Royal Academy of Belgium (Brussels, 2013.01.28), and the ING Huygens Instituut (The Hague, 2013.02.05).

  2. The slides for my talk in The Hague can be downloaded as a pdf.

  3. A prior version of this text was accepted for publication by Literary and Linguistic Computing, but as I abhor the pesky back-and-forth of the peer-review process and its inherent byzantine nitpicking, I felt that I might as well just publish the damned thing myself. Cheers!