Infrastructure | museum-digital: blog

md:down (Temporarily)

Joshua Ramon Enslin — Tue, 24 Feb 2026 08:32:35 +0000

Between 6 a.m. and 9:15 a.m., museum-digital’s main server unfortunately experienced another outage. The immediate outage has been resolved now.

The larger issue remained the same as so often during the rocky months last year:

With IIIF processes stopping, they keep holding onto RAM, until eventually the server is overloaded
With no available RAM left, the database (MySQL) falls over

To bring back the server, we needed to:

Stop the web server
Restart PHP (freeing up RAM)
Restart MySQL (bringing it back)
Restart the web server

This was, to some extend, to be expected: PHP ran without issue for some weeks, and the outage shows in the real world how close we can get. As such, the solution will be to further reduce available resources for the IIIF setup and regularly restart PHP.

For some background, see also:

Trimming.

Joshua Ramon Enslin — Mon, 29 Dec 2025 01:10:16 +0000

In the last weeks we struggled with server stability. As written before, the critical, resource-heavy and publicly available tasks have for a long time been the generation of timelines (and thus complicated search queries) and on the other hand those involving the processing or generation of large files; namely the IIIF API and PDF generation.

In the last post, I detailed how we severely restricted the availability of the public PDF generation functionalities in museum-digital according to available system resources. That, as it turned out, was not enough to bring reliable stability to our systems. After the server fell over on December 26th once more, we hence moved the IIIF Image API into the same PHP setup used for PDF generation – meaning that any user/IP can only request the API 10 times a minute and that for any instance of museum-digital, only one PHP worker serves it. This allowed us to severely reduce the maximum available resources per worker for the frontend outside of those two use cases (where the IIIF Image API may use up to 80 MB of RAM, no other part of the frontend will go beyond 5). Since then, the system runs as smoothly as if AI scraping had never become an issue.

A Limited Goodbye to IIIF & Server-Side Image Manipulation

Now, what does that mean in practice? On the one hand, we have not fully removed the IIIF image API. All links generated using it remain valid and will be served, even if comparatively slowly.

On the other hand the user experience with viewing the images in a IIIF viewer will be significantly worse, even though this strongly depends on the IIIF viewer. The most popular “full” IIIF viewers being Mirador and Universal Viewer, significant problems (or a complete inability to use an object’s images) are to be expected with Mirador. Mirador in its default configuration loads multiple segments of an image separately to then assemble the displayed image from those – with the creation of the segments happening on the server, thus consuming resources centrally. It also seems to set extremely low limits on accepted response times, which museum-digital’s IIIF Image API now regularly exceeds due to the aggressive rate limiting. Simply looking at the demo installation of Universal Viewer, the software seems to be much more targeted in its API calls and might still work well despite the restrictions.

As far as I know, there are no published numbers on the market share of the different IIIF image viewers. And about whether IIIF viewers external to whoever provides the API are actually regularly used or not. The most jaded – and likely true – assumption would be that the share of users who use IIIF without a viewer hosted next to the API is miniscule and that most users will use one of the abovementioned. Our experience, once again, seems to support that hypothesis: We released our implementation of IIIF 2 in 2020, but essentially nobody noticed before we also started hosting a IIIF viewer.

As we do use Mirador as a viewer, assume the “visible” IIIF image API at museum-digital to be more or less broken. Developers and those making direct use of the API without our installation of Mirador can still benefit from the API. But those are comparatively few.

The radical restriction of resources provided to the IIIF Image API is thus likely indeed a goodbye to IIIF, if a limited one. The basic idea is great – to create a unified way to reference parts of an image (or later a wider media file) and annotate it. In times of significantly increased bot activity, reduced funds, and foreseeably rising hosting costs, our example may be an early sign that the decision to realize that aim by specifying an API to be implemented by the data providers restricts the ability to fully support IIIF to very well resourced institutions. And as funds are shrinking, that is less and less institutions. Let’s hope that the most basic need IIIF wished to fulfill can be achieved in a different way in the future; one that is accessible to anybody. Realistically this means that computing would need to happen on the client PCs, not on a server.

To end the saga on a more positive note: Since we limited the IIIF Image API, our systems run wonderfully smoothly again and we were able to reduce the overall rate limiting on the rest of museum-digital’s portals. We will monitor the situation and increase the limit slowly to allow more simultaneous API requests without risking stability.

Communications

Second, the whole ordeal posed a challenge to our communication channels. If any significant error occurs anywhere on museum-digital, I personally am sent an encrypted error message via mail. Usually. In this case, the primary component falling over was the PHP server, which is also responsible for managing the sending of mails. If a service fell over, the primary way to learn of it was receiving mails about that instead. Reaction times were thus worse than they needed to be. This means that we need to improve our monitoring.

On the other hand there was the issue of explaining what was going on. We had a thread about it in the forum, which few people read. We had the blog posts. Which few people read. We lack (or lacked) a unified source of information about current events that we can assume people to read. The blog could and should be exactly that.

At the top right of the login screen of musdb, the two most recent blog posts from the respective region as well as from the “development” category of the blog have been shown for years. Then we turned on the “remember me” feature by default, which means that people only very rarely see the login page at all anymore.

The first page most users see upon logging in or opening musdb while logged in is the dashboard, the default subsection of which previously offered a summary of the database contents a user has access to, a tile for writing personal notes to oneself, a tile with messages from the respective regional administrators, a tile for the integration of a discourse forum, and links to the museum elsewhere on the web.

The summary of database contents and the links to the museum elsewhere are certainly useful. The other features not so much. Checking their actual use revealed that barely anybody used any of the note-taking features (likely also because musdb itself offers better alternatives elsewhere), while the discourse integration has not been in use for years. The very first features one sees when opening musdb were thus largely unused, wasting space that could be filled with a feed of relevant blog entries.

And so we removed the unused features and replaced them with a more prettily designed feed. This feed now contains the two newest blog posts from the development feed in the user’s language, the regional or national feed (again in the user’s language) as well as – importantly – the English-language development feed. None of the most recent development-related posts were translated to any language other than the original English, mainly because the time was better spent trying to alleviate or fix the issues than describing them in yet another language. Besides, most people know enough English to grasp the posts. And for those who do not: Community contributions to the blog – also translations for those who do not – are always welcome.

The dashboard in musdb now features a feed of recent news relevant to the development of museum-digital and whatever is going on regionally. The posts are sorted chronologically.

This content is licensed under a Creative Commons Attribution 4.0 International license.

Cleaning Out Our Closet

Joshua Ramon Enslin — Mon, 22 Dec 2025 17:20:40 +0000

Since the last post (i.e. the update to PHP 8.5 amid an onslaught of AI scrapers) and the later introduction of much stricter per-IP rate limiting, the stability issues around md are better – but they are not yet completely resolved.

As such, we have expanded our efforts in rewriting and reformulating key resource-intensive functionalities for increased stability. Different from before, we have also started to fully remove or disable functionalities that are simply not tenable anymore under the current conditions.

PDF Generation

Thus far, there were two basic types of PDFs that were generated (on the server side) in museum-digital’s portals: PDF representations of object pages (“data sheet”) on the one hand and PDFs encapsulating all images of an object in one document for easy printing.

The latter was – simply by nature of its envisioned task – extremely resource-intensive. All image files had to be loaded from disk, embedded into the PDF, compressed and served. The option had thus been available for fewer and fewer objects. Where it was originally available in case of any object with more than three images, it was later limited to objects of less than 40 images. As such, its availability was increasingly hard to communicate clearly, while its usefulness was relatively reduced with the introduction of a new download option for all images of an object. Its natural resource-intensiveness remained a problem however, and as scrapers will click any link they can find, this type of PDF generation continued to be used quite regularly (every few seconds before the recent surge in bot activity). As of last week, the functionality has been entirely removed.

The “data sheet” PDF generation has been further limited as well. As stated in the previous blog post, its usefulness is significantly reduced with the introduction of a print stylesheet (you will get better results simply pressing CTRL + P on an object page and printing the page to PDF). Nevertheless, it remained rather popular and has not been removed entirely. To reduce its impact on server stability, we however further limited its availability: If the server load is any higher than comfortable, the PDF will not be generated and an error message will appear. If the load is high (up from around 70% of comfortable) and the user’s browser language is not the default language of an instance of museum-digital, the same error message will appear.

Failed Search Pages

If a search query for objects fails, users are forwarded to a failed search page, on suggestions for alternative search queries are made. This is essentially the same as Google automatically suggesting corrections when search terms contain typos. Identifying the alternatives and offering previews for each is not free. As it is simply suggestions, the benefit or general accuracy of the suggestions fluctuates from case to case.

Now, looking at the logs, we had a large number of queries for non-existing entities – obviously scrapers who were trying out different IDs after analyzing the URL scheme. Each of those queries was executed and then forwarded to the failed search page, triggering the loading of suggestions and previews and thus further using resources on the server for little benefit (besides getting more links to scrape). We have now introduced a similar logic to the limitations on the data sheet PDF generation. Suggestions and previews are only generated when server load is comparatively low, with non-primary language users being slightly disadvantaged vis-a-vis primary-language users in an instance.

Timelines

Timelines remain popular – and a problem. A very common type of query we would see in our logs would combine timelines with searches by start and end date. This was likely due to another possible loop of endless URL generation for scrapers – specify a timeline until it forwards to search pages for a given timespan, then open the timeline for that timespan. Exactly that behavior has now been made impossible. If a search by a timeline (“start after”, “end before”) has been set, timelines will not be offered in the sidebar anymore. Trying to generate them for such a search using URL manipulation or the API will return an error page.

Search: Cleanup, Image Search & Checking Entity Existence Early

A more messy way of optimizations hit the core of the object search. In around 2021, we introduced a new search logic. Almost all pages relying on the core search logic – search overview pages, maps for objects, timelines, were adjusted to work with the new logic. The only exception from this was the image search. Still, as the new search logic re-used some of the old search logic’s functions, we kept both as separate classes, which grew over time. Simply loading the new search logic took about one ms (without OPCache enabled, measured through PHPBench). This sounds like little, but hints at a lack of modularization of the code and gains relevance with many unpredictable requests with servers automatically spinning up and down.

And indeed, in writing the new search logic, we did not modularize thoroughly HTML generation, query building and database querying. With last weeks updates, there are now separate classes for each of these and functionalities relevant only to the old search functions have been moved to class managing the image search logic. This reduces startup time for only the new / main search logic by about half (ca. 0.6 ms).

Second, we reduced the available search options for image searches. The remaining search parameters are either those actually relevant to the images or those linked to the controlled vocabularies. As a positive side effect, this also solves some issues in communication: Making it legible what the difference between searching images by their own license and by the license of (unrelated) metadata of objects the images are linked to is, is complicated.

Finally, as stated above, the logs revealed a lot of queries for objects linked to e.g. either entirely non-existent places or places that are not linked to any object in the instance of museum-digital altogether. When a place or tag is queried, we hence check whether there exists any public mention of the entity in the current instance of museum-digital during query building. If there is no link at all, it is clear early on that a more detailed (i.e. costly) query combining the search by that entity with other parameters will not return any results.

The Current Situation

All these improvements help, but a look at the current real-world numbers is warranted. On the one hand, the database server now often falls down to half or even less of the expected server load. This is a positive sign for system stability outside of peak times.

On the other hand, there are noticably spikes in the morning (around 10:20 in Germany) and in the afternoon (starting around 5 p.m.). The spike in the morning is likely related to the start of workdays and has led to the server falling over multiple times last week. This can likely be fixed only with a further tuning of the PHP-FPM settings. The spikes in the afternoon and early evening on the other hand remain hard to explain, but are altogether much less critical.

We’re on it.

This content is licensed under a Creative Commons Attribution 4.0 International license.

Updates, AI scrapers, and Resilience

Joshua Ramon Enslin — Tue, 09 Dec 2025 00:11:30 +0000

Between Thursday last week (November 27th) and yesterday (December 6th), museum-digital has seen its most instable week in about four years. Now that the dust has settled a bit, there’s finally some time to discuss what happened and how we managed to tackle the multiple issues leading to the (very noticeable) instability.

Background

Scrapers

There were (or are) two factors simultaneously pushing our servers to their limits and requiring changes. On the one hand, scraping of museum-digital has gotten even more aggressive. Where we usually has something around 10-30 requests per second across all of museum-digital a year ago, we had around 300 two weeks ago. Right now it’s often between 500 and 700. This number excludes any access to static files.

As I’ve written elsewhere, the scrapers are mostly noticable by coming from IP ranges in Asia or (to a lesser extent) the US. On the other hand, the IPs change constantly and user-agents etc. resemble regular users. Likely they simply use an actual chrome browser for scraping. Which is to say, attempting to block them is futile. Worse yet, attempts to block scrapers would likely also impact some real users.

Fortunately museum-digital is run on dedicates servers paid by time rather than by compute. The onslaught of scrapers thus has no financial impact on us. But the scrapers still use resources, and as they try to scrape as many different pages as possible, it is much harder to optimize for them than it is to optimize for actual human users (see this article on a similar issue at Wikimedia).

Either way, AI scrapers can result in improvements. Viewed positively, they essentially act as a free stress test on a service and enforce efficiency in all aspects. If most pages are optimized for performance already, scrapers will find the unoptimized ones and bring down a service by overusing those. Which is to say, they help to identify yet unoptimized scripts/pages/classes and enforce that necessary changes are made. At museum-digital, there are three main weak spots that are hard to optimize: timelines, image manipulation (including the IIIF API), and PDF generation.

PHP

On November 20th PHP 8.5 was released. Thus far, museum-digital had been running on PHP 8.3 for web hosting and PHP 8.4 on the command line. When we attempted to update to 8.4 last year, the server fell over. This was mainly caused by the IIIF API (and thus, image manipulation via libvips).

Dependencies at museum-digital are (like pretty much universal with PHP) handled using the package manager composer. Setting up a new instance of museum-digital, composer (managed on version 8.4) required PHP 8.4 or later to run – the new instance was thus unable, being stuck on version 8.3 for hosting.

That leaves two options: Either to set up composer using PHP 8.3 again, or to simply update everything to the current version. While PHP 8.3 will be supported until 2027, it is generally advisable to update when possible. So updating it was.

Importantly, PHP at museum-digital is run via PHP-FPM. Before the update, we had one socket running per subdomain. This means, that if a PHP process serving the frontend stopped working for any reason, users in musdb were impacted as well.

Upgrading PHP to version 8.5

Once we upgraded the PHP version to 8.5 on Thursday, the same problems we faced with PHP 8.4 appeared again. The server would run rather smoothly for some hours, then more and more PHP processes would die and PHP-FPM would fall over for a given subdomain, and users would get a 504 gateway timeout error. Again, the IIIF API and image manipulation were the main causes of PHP-FPM getting stuck. Of course, the number of AI scrappers continuing to use the site did not help.

PHP-FPM settings

A natural first point to consider was the configuration of PHP-FPM. PHP-FPM knows three basic modes for running an application:

ondemand You define a maximum number of processes the application may use. When a new request is made, idle processes get used. If there is no idle process, PHP-FPM starts a new one. After a specified number of requests or a given number of seconds, an old process is closed. This is primarily aimed at being able to scale way down – if there is no requests, there will be no processes (which is to say, less resources used). On the other hand, starting new processes takes time.
static You define a number of processes that should always be running for the application. This means that there should always be processes already started and ready for usage, but it also means that those processes take up resources even when they are little used. Which is to say, this is useful if one has a high and constant stream of users.
dynamic You define a maximum number of processes, as well as how many processes should be always running for immediate use, and a (minimum and maximum) number of spare processes to keep running. PHP-FPM then manages if more processes should be started or if one of the already running ones shall be used. This, in theory, is useful if one wants to reliably and quickly serve users, expects some use all the time, but wants the server to dynamically scale up and down as needed.

With museum-digital spread out over around 80 subdomains, we had thus far used the ondemand mode for most subdomains. Only the largest and most used instances / subdomains of museum-digital were run using dynamic mode. With the update to PHP 8.4 and then 8.5, the behavior of the ondemand mode seems to have changed. If one process dies, the whole subdomain goes seems to go down with it (I have not found a documentation on this, but it’s evident from the last two weeks).

We hence moved critical subdomains impacted by the errors (which is to say, any “regular” instance of museum-digital) to dynamic mode. As dynamic mode enforces stricter limits on how many processes can be run respective to the available hardware (which is to say, dynamic mode requires a better-written configuration), this also meant that we needed to adjust the specified numbers of processes per subdomain according to their use.

To actually grasp real use of a subdomain including bots, we turned to the logs we keep for about a week (and then rotate out). In server logs, usually one line corresponds to a single request. With a small script, we loop all the different subdomains and check how many requests were made. To be really sure that only requests to relevant PHP scripts are processed, we filter them by the presence of the substring “php” before counting. The result for today between 1 a.m. and 4 p.m. looks as follows:

| Requests count in instance                         |      Total |      musdb |        PDF | 
| -----                                              |      ----- |      ----- |      ----- | 
| agrargeschichte.museum-digital.de                  |     341508 |       1245 |        719 | 
| bawue.museum-digital.de                            |     454228 |      12559 |       6819 | 
| bayern.museum-digital.de                           |     176291 |          0 |        158 | 
| berlin.museum-digital.de                           |     223280 |      14917 |       6814 | 
| brandenburg.museum-digital.de                      |      63286 |       6927 |       3873 | 
| bremen.museum-digital.de                           |     221208 |          0 |       2026 | 
| bund.museum-digital.de                             |        261 |        167 |          5 | 
| collectors.museum-digital.de                       |     108398 |        449 |        648 | 
| hamburg.museum-digital.de                          |      35489 |          0 |         11 | 
| hessen.museum-digital.de                           |      50932 |       7962 |       2486 | 
| meckpomm.museum-digital.de                         |      94177 |         11 |        139 | 
| nds.museum-digital.de                              |     137703 |       4105 |       4134 | 
| owl.museum-digital.de                              |     427667 |       1258 |       2412 | 
| rheinland.museum-digital.de                        |      64838 |       1753 |       1276 | 
| rlp.museum-digital.de                              |     207944 |       7405 |       7532 | 
| sachsen.museum-digital.de                          |     120931 |      16117 |       6034 | 
| saarland.museum-digital.de                         |        210 |          0 |          1 | 
| smb.museum-digital.de                              |     228542 |          0 |      11517 | 
| sh.museum-digital.de                               |      21098 |          0 |         48 | 
| st.museum-digital.de                               |     317913 |       6243 |       6217 | 
| thue.museum-digital.de                             |     117893 |          0 |        495 | 
| westfalen.museum-digital.de                        |     101584 |       2033 |       3310 | 
| br.museum-digital.org                              |      43413 |          0 |         16 | 
| jateng.id.museum-digital.org                       |        211 |          0 |          0 | 
| jatim.id.museum-digital.org                        |      23410 |          0 |        159 | 
| lazio.it.museum-digital.org                        |        295 |          0 |          0 | 
| ma.pl.museum-digital.org                           |        385 |          0 |          0 | 
| noe.at.museum-digital.org                          |     906386 |          0 |        369 | 
| tirol.at.museum-digital.org                        |        537 |          0 |          7 | 
| vbg.at.museum-digital.org                          |         96 |          0 |          0 | 
| wien.at.museum-digital.org                         |     472305 |        586 |       3243 | 
| ulster.ie.museum-digital.org                       |      28869 |          0 |          2 | 
| connacht.ie.museum-digital.org                     |        392 |          0 |          0 | 
| va.srb.museum-digital.org                          |       5599 |          0 |         22 | 
| ko.rou.museum-digital.org                          |       9036 |        635 |        567 | 
| mm.rou.museum-digital.org                          |        235 |          0 |          0 | 
| ca.usa.museum-digital.org                          |       3946 |          0 |          0 | 
| ma.usa.museum-digital.org                          |        357 |          0 |          0 | 
| ny.usa.museum-digital.org                          |      19576 |          0 |        294 | 
| syddanmark.dk.museum-digital.org                   |        675 |          0 |          9 | 
| de.pt.museum-digital.org                           |       1241 |          0 |         29 | 
| zh.ch.museum-digital.org                           |     233280 |        512 |        650 | 
| ba.hu.museum-digital.org                           |      99927 |       1901 |         72 | 
| be.hu.museum-digital.org                           |     100830 |        244 |       3005 | 
| bk.hu.museum-digital.org                           |     489446 |         55 |       3985 | 
| bu.hu.museum-digital.org                           |     213616 |       6206 |       5753 | 
| bz.hu.museum-digital.org                           |     598550 |        680 |       1788 | 
| cs.hu.museum-digital.org                           |      88585 |          0 |       1054 | 
| fe.hu.museum-digital.org                           |     199812 |          7 |        215 | 
| gs.hu.museum-digital.org                           |     216680 |       4215 |        912 | 
| hb.hu.museum-digital.org                           |      61250 |          0 |         65 | 
| he.hu.museum-digital.org                           |      26312 |          7 |         26 | 
| jn.hu.museum-digital.org                           |      11970 |          0 |        131 | 
| ke.hu.museum-digital.org                           |     370219 |       2959 |       1680 | 
| no.hu.museum-digital.org                           |     119487 |          0 |       1545 | 
| pe.hu.museum-digital.org                           |     603846 |       2957 |       1446 | 
| so.hu.museum-digital.org                           |     308116 |       6151 |       6698 | 
| sz.hu.museum-digital.org                           |        116 |          0 |          0 | 
| to.hu.museum-digital.org                           |      52406 |          0 |       1229 | 
| va.hu.museum-digital.org                           |     184231 |       2839 |       1666 | 
| ve.hu.museum-digital.org                           |    1015509 |       3672 |        296 | 
| za.hu.museum-digital.org                           |        199 |          0 |          6 | 
| ce.cz.museum-digital.org                           |          3 |          0 |          0 | 
| ccc.cz.museum-digital.org                          |         17 |          0 |          0 | 
| academia.hu.museum-digital.org                     |       9158 |          0 |         13 | 
| cherkasy.ua.museum-digital.org                     |      25567 |          0 |         26 | 
| chernihiv.ua.museum-digital.org                    |       3258 |         99 |        156 | 
| dnipro.ua.museum-digital.org                       |      26725 |          0 |        109 | 
| donetsk.ua.museum-digital.org                      |         17 |          0 |          0 | 
| ivfr.ua.museum-digital.org                         |        722 |          0 |          9 | 
| kharkiv.ua.museum-digital.org                      |      12932 |          0 |         39 | 
| kyiv.ua.museum-digital.org                         |     436482 |       5967 |       1351 | 
| kyivska.ua.museum-digital.org                      |       2159 |          0 |         79 | 
| lviv.ua.museum-digital.org                         |     163358 |        188 |        274 | 
| poltava.ua.museum-digital.org                      |       7657 |        284 |          3 | 
| odesa.ua.museum-digital.org                        |         93 |          0 |          1 | 
| rivne.ua.museum-digital.org                        |      59510 |         65 |        156 | 
| sumy.ua.museum-digital.org                         |      35890 |        303 |          3 | 
| ternopil.ua.museum-digital.org                     |     150700 |         37 |        184 | 
| zhytomyr.ua.museum-digital.org                     |          3 |          0 |          0 | 
| vinnytsia.ua.museum-digital.org                    |      14229 |          0 |          0 | 
| volyn.ua.museum-digital.org                        |      16705 |          0 |        485 | 
| zakarpattia.ua.museum-digital.org                  |       2865 |          0 |         30 | 
| zaporizhzhia.ua.museum-digital.org                 |      24348 |        338 |         56 | 
| scotland.museum-digital.org                        |          0 |          0 |          0 | 
| md.museum-digital.org                              |          0 |          0 |          0 | 
| demo.museum-digital.org                            |         12 |          2 |          0 | 
| goethehaus.museum-digital.de                       |     260072 |          0 |         85 | 
| lmw.museum-digital.de                              |     326724 |          0 |         65 | 
| gedenkstaetten.museum-digital.de                   |       3474 |          0 |          0 | 
| turcica.museum-digital.de                          |      75533 |          0 |          1 | 
| nat.museum-digital.de                              |    1238860 |          0 |       4657 | 
| at.museum-digital.org                              |     631578 |          0 |         89 | 
| cz.museum-digital.org                              |          2 |          0 |          0 | 
| dk.museum-digital.org                              |       5415 |          0 |          4 | 
| hu.museum-digital.org                              |     359619 |          0 |       2827 | 
| id.museum-digital.org                              |       8030 |          0 |          0 | 
| ie.museum-digital.org                              |       2073 |          0 |          0 | 
| it.museum-digital.org                              |         78 |          0 |          0 | 
| rou.museum-digital.org                             |       8277 |          0 |        466 | 
| pl.museum-digital.org                              |        142 |          0 |          0 | 
| pt.museum-digital.org                              |          0 |          0 |          0 | 
| srb.museum-digital.org                             |        565 |          0 |          0 | 
| ua.museum-digital.org                              |     232115 |          0 |        805 | 
| usa.museum-digital.org                             |       3752 |          0 |         34 | 
| ch.museum-digital.org                              |      53417 |          0 |          1 | 
| global.museum-digital.org                          |     727690 |          0 |       2199 |

Note that the number of requests obviously is also impacted by bots changing attention – once a scraper is done with one subdomain, they turn to the next. The elevated number of requests in ve.hu.museum-digital.org is normal, but still starkly exaggerated when compared to other days. The Germany-wide instance is persistently the most frequented one, usually the global one is second at around 80% of requests.

Now equipped with actual numbers, we could scale the PHP-FPM to a much more suitable configuration than before (we had thus far never bothered counting actual requests, instead relying on the number of objects).

A second step in the PHP-FPM configuration was to reduce the impact the problems had. Previously there was one shared configuration and socket per subdomain. On the one hand, this meant that stuck processes in the frontend impacted users in musdb (and vice-versa). On the other hand, some constraints on resource usage cannot be set on a per-directory level but must be set per PHP-FPM socket / server (see the PHP documentation on user.ini and the list of php.ini directives). As the frontend and musdb have different requirements (frontend: low maximum memory use, short timeouts, no file uploads, generally strict settings; musdb: long timeouts for uploads, generally more lenient), being able to configure them independent of each other is useful in general.

We thus separated the configuration for the frontend, musdb, and PDF generation in the frontend; providing dedicated sockets for each. The frontend has a reduced priority on the system overall, strict constraints on how it may be used, etc. The settings are stricter than they were before. musdb has an elevated priority and more lenient settings (file uploads, longer timeouts), in fact more lenient than before. Finally, PDF generation is a special case as it offers no real benefit over the browser’s print tool (see MDN on print CSS), while being resource-intensive. As such, it has a far reduced priority and very strict settings.

With the separated configuration and sockets, we can now better tailor the configuration to each application’s needs and have the added benefit of problems in one application not impacting the other.

Code

As we had already prepared the codebase for PHP 8.4 awaiting an eventual upgrade, the upgrade to PHP 8.5 only required minimal changes. Aside from the deprecation of the functions finfo_close() and curl_close(), references to which were accordingly removed from the code, the update necessitated no further work.

Scaling in Software

Improving the PHP configuration was not enough to fix the issues, especially with the now increased number of requests from bots. To get some breathing room, we adjusted the most resource-intensive pages.

Frontend

In the frontend these are, again, the IIIF API, PDF generation, and timelines. Finally, we made changes to the pages for failed searches to better handle high load situations.

Image pages

The IIIF API was used for the main image pages in the frontend. We used (and use) Mirador as a IIIF viewer. Simply opening an image page thus meant three requests to fetch different regions of an image. Zooming into the image triggered further requests to fetch the relevant parts of the image. Cropping the image to the requested region with IIIF happens on the server (which is no problem if there are few users, but is turning into a problem when you have hundreds of requests per second).

We thus changed the default of image pages: The new default image page is the old, non-IIIF one. As features like zooming into images, that Mirador comes with, are popular and useful and the old image page did not support those, we worked to improve the page. To do so, we rely on OpenLayers, a library we already use for maps. Besides including maps from tile servers, OpenLayers also supports loading simple image files – which we do here. The image is hence loaded once in full size and zooming etc. happen entirely in the browser.

Taking the opportunity, we improved the page overall. An often noticed problem of image pages thus far was, that users who opened image pages coming from external services (think Google Images) had problems identifying that the image was an object image and that there is further object data to be found on object pages. The updated image pages now come with a header stating reflecting the name of the image, the name of the object and the name of the institution. Note that many images do not feature a dedicated title, musdb uses the object name as a default image title in that case, which is why the object title will often appear twice in the header. Maybe this can be used as an encouragement for the colleagues working in musdb to more consistently set expressive image titles in the future.

Also new is a mini map at the bottom left, displaying where in the wider context of the image one has currently zoomed in, as well as the ability to link exactly the region one has currently zoomed into. To enable the latter, the URL updates as one zooms or navigates around the image. Somebody else opening the same URL will then open exactly the same image region the linking person was viewing when copying the URL. Finally, we finally set specific Content Security Policies relevant to the currently opened media. If the displayed media entry is an internally stored image, no external images need to be allowed to load. If the displayed media entry is an audio file stored on archive.org, archive.org needs to be whitelisted as a source for audio files – but only archive.org and no other page. Previously, embedding images from anywhere on the net was allowed, increasing the potential damage a potential attacker may cause.

Making the use of Mirador a secondary, non-default option reduced the need for server-side image manipulation and the corresponding resource use significantly. The IIIF remains largely unchanged, but its use must now be requested explicitly.

PDF generation

As stated above, PDF generation brings little advantages to the browser’s print functionality in combination with object pages. On the contrary, the PDFs generated using the frontend’s templates feature less information. But they come with the file ending “.pdf” and seem to be extremely popular with bots. On the other hand, PDF generation means, among others, loading whatever images are to be embedded into the PDF and manipulating them fit into the PDF. The resulting files are significantly larger than the corresponding HTML files and thus also use more of the available bandwidth.

The update to handle PDF generation respective to resource usage was already introduced in the last months: publicly linked PDFs are now only generated if overall load on the server is low, if a user has set their browser language to any language different from a museum-digital instance’s default language. As most scrapers do not bother to change their browser language (which means they come with either none, English or Chinese), this means they will mostly be unable to trigger the generation of PDFs. They see an error page instead.

Failed Search Pages

If a user tries to execute a search query without any results, they will get suggestions for similar search terms – similar to how Google will ask one searching for “Berrlin”, if they meant “Berlin”. Trying to identify suitable suggestions obviously costs resources and whether the suggestions are actually what a user wanted is by nature hit or miss – it’s suggestions after all. In the case of scrapers, suggesting alternative search queries offers them a never-ending stream of possible search queries to run and keep scraping the subdomain with – to nobody’s benefit (not even the scrapers’, as they likely got the same content with other search queries already).

We thus now use the same function used to identify whether PDFs should be generated for a user to check if search suggestions should be provided. It a user comes with a non-default browser language and resource use is high, no suggestions will be provided.

Timelines

Timeline pages as implemented in museum-digital’s frontend offer another source of endless links and search queries, as they link to further and further specifications of the time searched by. Again, an improvement already introduced months ago, was to better parse queries by time: If a user searches for objects that are linked to times “after 1920” and “after 1930”, the latter already includes the former. “After 1920 and after 1930” means exactly the same as “after 1930”. Which is one join instead of two – half the resource usage.

A minor improvement we noticed on the side was impact of automatic redirects in the timelines. Say, a user searches objects by their link to a given tag and then generates a timeline for said objects. If all objects were created in the 20th century, the timeline will automatically redirect so as to “zoom” into a more appropriate time scale than from the big bang to now. Until the last weekend, script execution was not stopped when that redirect happened – which means that all database queries for time time from the big bang to now were still executed even though the user never got to see them. That is now fixed.

The Anti-Climactical Solution

All of those changes got the frontend more or less stable. Problems with uploading images remained however. Finally, the only thing that helped was uninstalling libvips (which we use for image manipulation) and reinstalling it. That seems to have fixed the issues.

Especially as the number of requests from scrapers continues to increase, the current strategy outlined above seems to be fruitful. By reducing the use (and sometimes the availability altogether) of especially resource-intensive and – depending on the context – little useful functionalities, much stability and can be gained.

The update seems to finally be largely completed (aside from maybe some further fine-tuning of the PHP-FPM configuration) and museum-digital is stable despite the bot problem, while we haven’t had to take more drastic or costly actions yet – such as blocking or adding additional servers.

This content is licensed under a Creative Commons Attribution 4.0 International license.

Making Interoperability Easy

Joshua Ramon Enslin — Mon, 24 Nov 2025 15:56:37 +0000

Interoperability has been one of the focal issues around museum-digital practically since its inception. Offering different, simple ways to bring data into the system was a necessary requirement to even think of what we do. And offering simple ways to get the data out of the system again is just good practice – though all too often neglected.

To that end, there have traditionally been two primary ways for data retrieval. In musdb, one could run batch exports and receive a ZIP with some form of XML files. One per object, with the objects matching the results of any given object search.

On the other hand, there is the public API. Using URL manipulation, one can access the (primary) contents of each page in a machine-readable way. To access the JSON representation of an object’s published metadata, where the object’s ID is 7141 in the Hesse instance of museum-digital (URL: https://hessen.museum-digital.de/object/7141), one simply has to insert json to the path: https://hessen.museum-digital.de/json/object/7141.

Next to the default JSON output, additional APIs are offered wherever suitable for a given data type. For objects, the primary additional output method is a LIDO API.

Thus far, the main limitation of the public API was that it only allowed one object (or institution, collection, etc.) to be queried at a time.

Querying Object Metadata in Batches

After a significant refactoring of the code to load object data for object pages – primarily to improve caching and allow for parallelized requests to the database – we are now finally able to offer APIs for querying object metadata in batch. Thanks to grouped database queries, performance and resource usage scale nicely. Taking simply the currently most recent objects in the Germany-wide instance of museum-digital: Loading all object data of one object and presenting it in JSON takes 0.0087 seconds and loading and generating the JSON for the 100 most recent objects takes 0.197 seconds (or 0.00197 per object). Note that not all queries for all aspects of an object’s metadata are grouped yet, performance may thus get even better over time. This does also not yet account for the overhead of the many HTTPs requests one would previously need to execute to get each object’s metadata one by one – real performance improvements are thus even greater.

Now, how to access object metadata in batches?

The batch access is linked with the search API and reuses its main query parameter (“s”). Say, if one is searching for objects related to Berlin (a.k.a. the place of the ID 61), the URL of the respective search page would be https://global.museum-digital.org/objects?s=place%3A61. The corresponding API for retrieving all of the objects’ published metadata would then be https://global.museum-digital.org/export/json/place:61?limit=24&offset=0. Like the search page itself and its primary API (/json/objects), the full batch export API is paginated with a maximum of currently 100 objects per page being returned.

Additional to URL manipulation, the batch export API is linked in the menu of object search results pages.

Currently, the batch export of full object metadata is available for JSON and LIDO (XML) representations of the object data. More can rather easily be added later, should a demand arise.

OAI

Implementing a performant way to export full object metadata in bulk was one of the two main missing components for the long-missing implementation of an OAI-PMH API.

OAI is a standard tailored towards data harvesting. Say, an external service like the German Digital Library or Worldcat wants to do something with external data from diverse sources, e.g. to also display objects or implement a search across the different collections / libraries to find which one has which object / book. To be able to do so, they need to be able to access the respective data in some way. Ideally, using a common standard that describes how to query data, helps to identify any data sets that need to be updated or added (or deleted), and finally presents a uniform way to access the data periodically. That, exactly, is OAI-PMH.

In a nutshell: OAI-PMH allows other services to copy all (published) data from another service in a maschine-readable way and can thus significantly improve reuse in aggregation. Of course this only applies to technical questions; legally, potential re-users need to comply with the metadata license applied by the initial data provider regardless of the (technical) means of access.

Since last week, museum-digital now provides a OAI-PMH API at /oai respective to a given subdomain. E.g.: https://hessen.museum-digital.de/oai. As of now, the OAI-PMH API provides access to the objects’ metadata using LIDO (XML) and the mandatory OAI-DC format.

Note that there are some caveats remaining for now: First, the LIDO representation of object metadata is not (and can by definition not be) as complete and fine-grained as the JSON API. It is also not exactly similar to the LIDO as returned by exports from musdb (one is formed natively in PHP, the other using XSLT, leading to divergent development paths). Also, the LIDO output lists different identifiers from the ones used by the OAI-PMH API and the OAI-DC representations otherwise.

Finally, the OAI-PMH API at museum-digital does not implement OAI-PMH data sets to group collections. Instead, it follows the existing search logic (essentially providing a new endpoint per query). Example searches via OAI might thus look as follows:

All objects from the Agrargeschichte instance of museum-digital, represented in OAI-DC: https://agrargeschichte.museum-digital.de/oai?verb=ListRecords&metadataPrefix=oai_dc
All objects linked to Berlin (place #61), in the Berlin instance of museum-digital, represented in LIDO: https://berlin.museum-digital.de/oai/place:61?verb=ListRecords&metadataPrefix=lido
All objects of the Freies Deutsches Hochstift, Frankfurt am Main, represented in LIDO: https://hessen.museum-digital.de/oai/institution:1?verb=ListRecords&metadataPrefix=lido
All objects from Baranya with only their identifiers: https://ba.hu.museum-digital.org/oai?verb=ListIdentifiers&metadataPrefix=lido

Credits

Credits where credit is due: the Städel Museum deserves praise for their implementation of OAI-PMH. For me personally, seeing the Städel’s API was the first time I saw a visibly stateless OAI-PMH implementation, enabled by the ingenious idea of using machine-readable, JSON-encoded resumption tokens over UIDs that have to be resolved on the server side. I spent weeks (or months?) making museum-digital’s frontend stateless. By following the Städel’s example, it can remain so even while offering an OAI-PMH API.

This content is licensed under a Creative Commons Attribution 4.0 International license.

Maintenance / Outage: Search Database not Available for Today

Joshua Ramon Enslin — Wed, 11 Jun 2025 15:54:52 +0000

Some minutes ago, our search server crashed. To bring everything back to normal, we will have to re-index, which will take a few hours. As that is being done, most of the frontend will remain operational without issue. musdb will, depending on the instance, be unavailable into the night.

Bringing back character-driven search for inventory numbers in musdb

Joshua Ramon Enslin — Sat, 29 Mar 2025 22:59:32 +0000

If you search for “run”, you want to find entries (objects, blog posts, etc.), that mention “ran”. If you search for inventory numbers like “*1”, you want to find “0001”. These are fundamentally different categories of search. In the first case, you want to have a language-aware full-text search. In the latter case, you simply want to work with characters. In technical terms: Inventory numbers are strings (groups of signs or characters), but not common “text”.

When musdb and museum-digital’s frontend received their last large-scale update to their respective object search functions around 2021, enabling actually complex search requests across almost all of the fields and data types linkable to objects, this was – among others – made possible by our use of Manticore, a dedicated search server. Traditional relational databases excel at searches by indexes – pre-defined, known search parameters, that one prepares for searchability beforehand – while search servers like Manticore are reasonably good at that and excel at full-text searches.

In moving the search to Manticore, all searches in free text fields were defined as full-text searches. This was mostly the right decision: Full-text searches are the way to go with objects’ titles, descriptions, and the like. But two specific types of fields posed a challenge, because – as indicated above – they usually run on formalized strings that operate very differently from prose: objects’ locations and their inventory numbers. The software does not know about institution-specific and non-standardized rules of formalization, but users do. Hence, the preferred way for those specific types of fields is character-level searching.

For managing locations, we have in the meantime introduced spaces as a dedicated category capable of hierarchization as well as advanced features like the storage of sensor data. Object’s locations can now simply be expressed as a link to a space, which is by far the superior way when compared to the legacy free text field. If one does so, one can search for objects exactly in a given space, those that are located within it or its sub-spaces (e.g. a box in a depot room), etc. A migration tool from the legacy free-text field to the controlled spaces module is available through musdb’s dashboard. “Fixing” the issue of character-driven searches vs. full-text searches in locations is thus a least-priority issue – a better alternative is available anyway.

With inventory numbers on the other hand, there is no alternative to character-driven searching.

Laying the Foundations: From MSQL to Manticore and (Somewhat) Back

The basis for the expansion of search capabilities for objects was the introduction of a dedicated search server running Manticore. As the number of requests increased, this proved to be a blessing and – to some extent – a curse. Manticore offers more and better search options than a classic relational database, but it does not achieve the same level of stability. As long as queries remain index-bound and not concerned with text, the performance is roughly similar on our hardware: (both are about as quick, even with subqueries in MySQL; MySQL uses more resources, but is much more stable). If a query concerns a free-text field on the other hand, there is almost no comparison. Manticore offers a multitude of additional features at a great performance.

As stability had become an issue for a while, we adjusted the search to be able to use Manticore or MySQL as a backend, depending on which was more suitable in a specific context. In practice, this means that each search parameter is translated into a query string for Manticore and – if possible – for MySQL. If all search parameters have a MySQL equivalent, the search will be performed using the MySQL backend. Otherwise, Manticore will be used.

This simple way of negotiating which backend is more suitable works only as long as one of the alternatives (Manticore) supports all search options, while the other (MySQL) is preferrable in a subset of the search contexts. Which is to say, character-driven searches in inventory numbers break the negotiation logic – they work somewhat well in MySQL, but do not work in Manticore.

Breaking the Logic / Mitigating Confusion

Up to this weekend, all search options were compatible with each other:

If one searches for all objects one has acc ess to, both Manticore and MySQL can handle the query. MySQL will be used.
If one searches for all “helmets” (tag) from “Europe” (place), both Manticore and MySQL can handle the query. MySQL will be used.
If one searches for “helmets” (full-text) from “Europe” (place), MySQL can only sufficiently handle the search by place, while Manticore can meet both search requirements. Manticore will be used.

Character-driven searches by inventory numbers break that compatibility. If one were to search for objects for “helmets” (full-text) with inventory numbers starting with 1 (“1*”), the search parameter “helmets” could only be satisfied by Manticore, while the character-driven search by inventory numbers can only be satisfied by MySQL. Which is to say, the combined search cannot be executed.

Due to popular demand, we introduced character-driven searches for inventory numbers back into musdb. As there is no way to sensibly combine all search parameters anymore, given our circumstance, we had to handle reduce the resulting confusion. For this, there are theoretically two ways.

The theoretically cleaner way would have been to disable the extension of search queries by full-text-focused parameters once an inventory number had been searched. As a full-text search by inventory number is theorecally still possible, the opposite direction (setting a full-text search first, then searching by inventory number) might still have been acceptable, as it would not have led to visibly different results. The basic idea of this solution would have been to prevent users from performing combined searches that are not possible in the targetted way. But if users actually managed in some way, the confusion would have been major. Worse yet, it would have been hard to explain – or rather, it would have been hard to find an appropiate spot in the UI for an explanation -, why certain search options are suddenly disabled.

The alternative route we chose is to allow users to do the impossible combinations, perform the search as best as we can (by transforming the character-driven search by inventory number into a full-text search), and aggressively warn about the likelihood of unexpected results when trying to perform such combined searches. This solution looks unpolished, but it is transparent about the imperfections of the software, and it allows users to find their own solutions to actually perform the combined searches they want. The simplest such solution would be to first search by inventory number, move all the objects into a watch list, and then search by the watch list and combine that search with the full-text search.

This content is licensed under a Creative Commons Attribution 4.0 International license.

A Concordance Checker for Preparing Imports to museum-digital

Joshua Ramon Enslin — Thu, 23 Jan 2025 15:00:40 +0000

When one runs an import to museum-digital – specifically one focused on internal collection management data – there is a chance to encounter errors of unmatched entries. The import tool identified that one tried to import a yet unknown value to what is a controlled field in musdb. Common issues appear especially with actor roles and entry types.

Say, a museum’s previous database used actor roles over an event structure to express who created an object. As such, the museum entered that the object has a linked actor X that is linked to the object as a “main creator” and a linked time Y marked as the “creation time”. During the import, these roles (“main creator” and “creation time”) are then translated to museuem-digital’s event types to form an event: The object was created by actor X at the time Y. This works, because the terms “main creator” and “creation time” have been matched to the creation event type.

If a term is not yet matched to a corresponding value of a controlled list in museum-digital, the importer will simply abort the import. On the one hand this is a way to uselessly require resources for an import that cannot be completed anyway. On the other, it is tedious. One recognizes yet unmatched entries only one by one.

A Small New Tool

A small new tool, available at concordance.museum-digital.org, makes the process a bit less tedious. Users can upload all the import data from a given field (e.g. the actor roles) – one a line – and check whether they are already matched using the concordance lists or not.

For entries that are not yet matched, the tool will offer selection boxes to perform the matching using the graphical user interface. Once all entries have been matched, one can then generate the relevant lines of code to enter the missing entries to the concordance list upon the click of a button.

While simply checking and extending the relevant open source lists should be trivial even to most non-technical users, this way is certainly more convenient. Importantly, it also removes the need to run the import multiple times until one does not encounter errors caused by unmatched entries anymore. And, well, it certainly is also more convenient to match to regular human language values than to the internal IDs of the target values.

The concordance checker’s MIT-licensed code can be found here.

This content is licensed under a Creative Commons Attribution 4.0 International license.

A Calendar is a Commitment

Joshua Ramon Enslin — Tue, 14 Mar 2023 00:39:54 +0000

Last year, we started a monthly user meet-up. As things go, we managed to continue the series at a stable time slot for some months – and then we did not anymore. People’s calendars are of course an issue, but another major one was simply that there were no consistently pre-determined meeting URLs.

Over the weekend, we have added a new feature to the project page, www.museum-digital.org: a calendar for trainings, meetings, and the like. The events for this calendar are pulled from various shared calendars via the respective iCalendar files and compiled into lists of the upcoming events.

While the primary source for events thus far is the calendar compiled by the German association for museum-digital, the museum-digital e.V., it also includes events in English. We can thus schedule the meetings ahead of time with meeting URLs set ahead of time without having to write a blog post every time. On the other hand, (publicly) scheduled events will be kept for sure.

And thus, publishing the calendar also means that this time around, we will surely do better and continue the series of monthly user meet-ups on every Tuesday of a month, 5 to 7 p.m. consistently for longer than we did last year.

This content is licensed under a Creative Commons Attribution 4.0 International license.

Service announcement: Image server unavailable today

Joshua Ramon Enslin — Wed, 04 Jan 2023 13:42:39 +0000

Our image server received a critical BIOS update and needs to be restarted urgently. The colleagues at digiS will do so starting at 2:30 p.m. The domain asset.museum-digital.org, from which images are commonly served in the frontend of museum-digital is therefore now available for now.

We were fortunately notified early enough to mitigate the issue in instances of the public “frontend” (e.g. nat.museum-digital.de; bu.hu.museum-digital.org) by serving images from a fallback location.