Automatically enforcing consistent naming of places

Last week I wrote about how new actors find their way into museum-digital’s controlled vocabulary for actors during imports. One of the first steps detailed in the post is the automatic cleanup of the actor’s name and the application of some rules to ensure a consistent naming of actors.

For time names a much more extensive cleaning is done, both in the case of imports and when working directly in musdb. Time names in the sense of the controlled vocabulary of museum-digital describe clearly defined times. This may be timespans, years, or full dates. But the number of possible input values is thus already significantly limited. In fact, it is small enough, and timespans, years, etc. are uniform enough to allow the automatic parsing of time names. Where that is possible, they are automatically rewritten to their default form in the controlled vocabulary and automatically translated to some 30 languages.

In the case of place names a similar cleaning and consolidating of place names was limited to the simple stripping of leading and trailing spaces and commas and the removal of indicators for uncertainty (e.g. trailing question marks). Since the weekend, place names are more extensively rewritten to ease the work of the vocabulary editors and ensure a more immediately consistent data entry both in musdb and when importing.

General rewriting

1. Simple spelling issues

The very first step in rewriting entered place names remains the removal of superfluous characters. As such, duplicate white spaces, trailing commas, etc. are removed.

Example: , Berlin , > Berlin

2. Removal of indicators for uncertainty

Next, known indicators for uncertainty are used to update the dedicated flag describing the certainty of the link to the place and then stripped. As they indicate the relation between object and place, they are not actually part of the place name and can thus be removed safely.

Example 1: Berlin ? > Berlin
Example 2: Maybe Budapest > Budapest

3. Removal of duplicates in enumerations

Commas are used in input place names in two ways. On the one hand, they may be used to further specify a place name (Beijing, PRC), on the other, they are often used in import data to designate multiple space names at a time (Beijing, Tokyo, Nairobi; this contradicts the logic of a database altogether and needs to be cleaned up manually be the vocabulary editors).

In both cases, duplicate names in the enumeration are superfluous and can be removed.

Example 1: Berlin, Germany, Germany > Berlin, Germany

4. Language-dependent rewriting

The next steps in rewriting entries depend on the language of the entry. In musdb, the language the user set to use musdb is used to guess the language they are entering data in. In the case of imports, the default language of the given instance of museum-digital that a museum imports to is used.

4.1. Extension of common abbreviations

Where there are abbreviations, there are also unabbreviated names. And surely, both will be used. Will inevitably leads to duplicate entries. In the case of some common abbreviations, they are thus automatically rewritten to a canonical form.

Example (German): Adalberthstr. 12 (Berlin) > Adalbertstraße 12 (Berlin)
Example (Hungarian): Vaci u. 12 (Budapest) > Vaci utca 12 (Budapest)

4.2. Reordering names in commas based on indicators for more specific place names

As stated above (3.), commas may either indicate a specification of a single place, or they may indicate that the entry actually refers to more than one place. Some components of a name can be used to almost certainly determine that the former is the case – and which place of the given list is the specific one and which one is a superordinate named mainly for clarification. Common such names are “street”, “plaza”, “pier”.

If there is exactly one comma and such a name component is encountered, the entered place name can be rewritten to contain the less specific name only in brackets.

Example (German): Berlin, Adalberthstraße 12 > Adalbertstraße 12 (Berlin)
Example (Hungarian): Vaci utca 12, Budapest > Vaci utca 12 (Budapest)

If both names contain such an indicator, no rewriting is applied. Vaci utca 12, Vaci utca 13‘ is thus not rewritten.

4.3. Budapest special: Extending the names of districts

Street names in Budapest are usually referred to including the naming of the district. These districts are referred to in a number of ways. If they are referred to using roman numerals, this is automatically extended to the canonical form.
This rewrite is only applied if the language is set to Hungarian.

Example: Petőfi Sándor utca 3. Budapest, IV. > Petőfi Sándor utca 3. (Budapest, 4. kerület)

4.4. Reordering names in commas based on country names

Similar to the rewrites described in 4.2., country names can be used to indicate a hierarchical relationship between two places in a comma-separated list. If one given name is a country name and the other is not, it is likely that the non-country name is part of the given country. The comma can be replaced with brackets while the name can be reordered into the preferred specific (unspecific) form.
This check is also applied to names separated by hyphens.

Example (German): Budapest, Ungarn > Budapest (Ungarn)
Example (Hungarian): Berlin-Németország > Berlin (Németország)

There are however some common cases in which this logic does not apply – significantly cardinal directions. If one such term is found to be the non-country part, the rewrite is not applied. As such West-Deutschland remains West-Deutschland without being rewritten.

The list of names of countries and historical countries used stems from Wikidata (thanks!).

Where do these rewrites apply?

The rewrites listed above are now implemented in musdb, the import tool, and nodac.

They have also been used to consolidate existing place names, allowing us to identify some 500 duplicate place entries over the weekend (that amounts to almost 0.7 percent of the whole vocabulary). Clearly, identifying similar cases of regularly appearing, varied ways to express the same thing, and determining a canonical way of naming places in such cases holds a lot of potential for reducing the editors workload and improving data quality for everybody.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

General rewriting

1. Simple spelling issues

2. Removal of indicators for uncertainty

3. Removal of duplicates in enumerations

4. Language-dependent rewriting

4.1. Extension of common abbreviations

4.2. Reordering names in commas based on indicators for more specific place names

4.3. Budapest special: Extending the names of districts

4.4. Reordering names in commas based on country names

Where do these rewrites apply?

Ähnliche Beiträge: