Between Thursday last week (November 27th) and yesterday (December 6th), museum-digital has seen its most instable week in about four years. Now that the dust has settled a bit, there’s finally some time to discuss what happened and how we managed to tackle the multiple issues leading to the (very noticeable) instability.
Background
Scrapers
There were (or are) two factors simultaneously pushing our servers to their limits and requiring changes. On the one hand, scraping of museum-digital has gotten even more aggressive. Where we usually has something around 10-30 requests per second across all of museum-digital a year ago, we had around 300 two weeks ago. Right now it’s often between 500 and 700. This number excludes any access to static files.
As I’ve written elsewhere, the scrapers are mostly noticable by coming from IP ranges in Asia or (to a lesser extent) the US. On the other hand, the IPs change constantly and user-agents etc. resemble regular users. Likely they simply use an actual chrome browser for scraping. Which is to say, attempting to block them is futile. Worse yet, attempts to block scrapers would likely also impact some real users.
Fortunately museum-digital is run on dedicates servers paid by time rather than by compute. The onslaught of scrapers thus has no financial impact on us. But the scrapers still use resources, and as they try to scrape as many different pages as possible, it is much harder to optimize for them than it is to optimize for actual human users (see this article on a similar issue at Wikimedia).
Either way, AI scrapers can result in improvements. Viewed positively, they essentially act as a free stress test on a service and enforce efficiency in all aspects. If most pages are optimized for performance already, scrapers will find the unoptimized ones and bring down a service by overusing those. Which is to say, they help to identify yet unoptimized scripts/pages/classes and enforce that necessary changes are made. At museum-digital, there are three main weak spots that are hard to optimize: timelines, image manipulation (including the IIIF API), and PDF generation.
PHP
On November 20th PHP 8.5 was released. Thus far, museum-digital had been running on PHP 8.3 for web hosting and PHP 8.4 on the command line. When we attempted to update to 8.4 last year, the server fell over. This was mainly caused by the IIIF API (and thus, image manipulation via libvips).
Dependencies at museum-digital are (like pretty much universal with PHP) handled using the package manager composer. Setting up a new instance of museum-digital, composer (managed on version 8.4) required PHP 8.4 or later to run – the new instance was thus unable, being stuck on version 8.3 for hosting.
That leaves two options: Either to set up composer using PHP 8.3 again, or to simply update everything to the current version. While PHP 8.3 will be supported until 2027, it is generally advisable to update when possible. So updating it was.
Importantly, PHP at museum-digital is run via PHP-FPM. Before the update, we had one socket running per subdomain. This means, that if a PHP process serving the frontend stopped working for any reason, users in musdb were impacted as well.
Upgrading PHP to version 8.5
Once we upgraded the PHP version to 8.5 on Thursday, the same problems we faced with PHP 8.4 appeared again. The server would run rather smoothly for some hours, then more and more PHP processes would die and PHP-FPM would fall over for a given subdomain, and users would get a 504 gateway timeout error. Again, the IIIF API and image manipulation were the main causes of PHP-FPM getting stuck. Of course, the number of AI scrappers continuing to use the site did not help.
PHP-FPM settings
A natural first point to consider was the configuration of PHP-FPM. PHP-FPM knows three basic modes for running an application:
ondemandYou define a maximum number of processes the application may use. When a new request is made, idle processes get used. If there is no idle process, PHP-FPM starts a new one. After a specified number of requests or a given number of seconds, an old process is closed. This is primarily aimed at being able to scale way down – if there is no requests, there will be no processes (which is to say, less resources used). On the other hand, starting new processes takes time.staticYou define a number of processes that should always be running for the application. This means that there should always be processes already started and ready for usage, but it also means that those processes take up resources even when they are little used. Which is to say, this is useful if one has a high and constant stream of users.dynamicYou define a maximum number of processes, as well as how many processes should be always running for immediate use, and a (minimum and maximum) number of spare processes to keep running. PHP-FPM then manages if more processes should be started or if one of the already running ones shall be used. This, in theory, is useful if one wants to reliably and quickly serve users, expects some use all the time, but wants the server to dynamically scale up and down as needed.
With museum-digital spread out over around 80 subdomains, we had thus far used the ondemand mode for most subdomains. Only the largest and most used instances / subdomains of museum-digital were run using dynamic mode. With the update to PHP 8.4 and then 8.5, the behavior of the ondemand mode seems to have changed. If one process dies, the whole subdomain goes seems to go down with it (I have not found a documentation on this, but it’s evident from the last two weeks).
We hence moved critical subdomains impacted by the errors (which is to say, any “regular” instance of museum-digital) to dynamic mode. As dynamic mode enforces stricter limits on how many processes can be run respective to the available hardware (which is to say, dynamic mode requires a better-written configuration), this also meant that we needed to adjust the specified numbers of processes per subdomain according to their use.
To actually grasp real use of a subdomain including bots, we turned to the logs we keep for about a week (and then rotate out). In server logs, usually one line corresponds to a single request. With a small script, we loop all the different subdomains and check how many requests were made. To be really sure that only requests to relevant PHP scripts are processed, we filter them by the presence of the substring “php” before counting. The result for today between 1 a.m. and 4 p.m. looks as follows:
| Requests count in instance | Total | musdb | PDF | | ----- | ----- | ----- | ----- | | agrargeschichte.museum-digital.de | 341508 | 1245 | 719 | | bawue.museum-digital.de | 454228 | 12559 | 6819 | | bayern.museum-digital.de | 176291 | 0 | 158 | | berlin.museum-digital.de | 223280 | 14917 | 6814 | | brandenburg.museum-digital.de | 63286 | 6927 | 3873 | | bremen.museum-digital.de | 221208 | 0 | 2026 | | bund.museum-digital.de | 261 | 167 | 5 | | collectors.museum-digital.de | 108398 | 449 | 648 | | hamburg.museum-digital.de | 35489 | 0 | 11 | | hessen.museum-digital.de | 50932 | 7962 | 2486 | | meckpomm.museum-digital.de | 94177 | 11 | 139 | | nds.museum-digital.de | 137703 | 4105 | 4134 | | owl.museum-digital.de | 427667 | 1258 | 2412 | | rheinland.museum-digital.de | 64838 | 1753 | 1276 | | rlp.museum-digital.de | 207944 | 7405 | 7532 | | sachsen.museum-digital.de | 120931 | 16117 | 6034 | | saarland.museum-digital.de | 210 | 0 | 1 | | smb.museum-digital.de | 228542 | 0 | 11517 | | sh.museum-digital.de | 21098 | 0 | 48 | | st.museum-digital.de | 317913 | 6243 | 6217 | | thue.museum-digital.de | 117893 | 0 | 495 | | westfalen.museum-digital.de | 101584 | 2033 | 3310 | | br.museum-digital.org | 43413 | 0 | 16 | | jateng.id.museum-digital.org | 211 | 0 | 0 | | jatim.id.museum-digital.org | 23410 | 0 | 159 | | lazio.it.museum-digital.org | 295 | 0 | 0 | | ma.pl.museum-digital.org | 385 | 0 | 0 | | noe.at.museum-digital.org | 906386 | 0 | 369 | | tirol.at.museum-digital.org | 537 | 0 | 7 | | vbg.at.museum-digital.org | 96 | 0 | 0 | | wien.at.museum-digital.org | 472305 | 586 | 3243 | | ulster.ie.museum-digital.org | 28869 | 0 | 2 | | connacht.ie.museum-digital.org | 392 | 0 | 0 | | va.srb.museum-digital.org | 5599 | 0 | 22 | | ko.rou.museum-digital.org | 9036 | 635 | 567 | | mm.rou.museum-digital.org | 235 | 0 | 0 | | ca.usa.museum-digital.org | 3946 | 0 | 0 | | ma.usa.museum-digital.org | 357 | 0 | 0 | | ny.usa.museum-digital.org | 19576 | 0 | 294 | | syddanmark.dk.museum-digital.org | 675 | 0 | 9 | | de.pt.museum-digital.org | 1241 | 0 | 29 | | zh.ch.museum-digital.org | 233280 | 512 | 650 | | ba.hu.museum-digital.org | 99927 | 1901 | 72 | | be.hu.museum-digital.org | 100830 | 244 | 3005 | | bk.hu.museum-digital.org | 489446 | 55 | 3985 | | bu.hu.museum-digital.org | 213616 | 6206 | 5753 | | bz.hu.museum-digital.org | 598550 | 680 | 1788 | | cs.hu.museum-digital.org | 88585 | 0 | 1054 | | fe.hu.museum-digital.org | 199812 | 7 | 215 | | gs.hu.museum-digital.org | 216680 | 4215 | 912 | | hb.hu.museum-digital.org | 61250 | 0 | 65 | | he.hu.museum-digital.org | 26312 | 7 | 26 | | jn.hu.museum-digital.org | 11970 | 0 | 131 | | ke.hu.museum-digital.org | 370219 | 2959 | 1680 | | no.hu.museum-digital.org | 119487 | 0 | 1545 | | pe.hu.museum-digital.org | 603846 | 2957 | 1446 | | so.hu.museum-digital.org | 308116 | 6151 | 6698 | | sz.hu.museum-digital.org | 116 | 0 | 0 | | to.hu.museum-digital.org | 52406 | 0 | 1229 | | va.hu.museum-digital.org | 184231 | 2839 | 1666 | | ve.hu.museum-digital.org | 1015509 | 3672 | 296 | | za.hu.museum-digital.org | 199 | 0 | 6 | | ce.cz.museum-digital.org | 3 | 0 | 0 | | ccc.cz.museum-digital.org | 17 | 0 | 0 | | academia.hu.museum-digital.org | 9158 | 0 | 13 | | cherkasy.ua.museum-digital.org | 25567 | 0 | 26 | | chernihiv.ua.museum-digital.org | 3258 | 99 | 156 | | dnipro.ua.museum-digital.org | 26725 | 0 | 109 | | donetsk.ua.museum-digital.org | 17 | 0 | 0 | | ivfr.ua.museum-digital.org | 722 | 0 | 9 | | kharkiv.ua.museum-digital.org | 12932 | 0 | 39 | | kyiv.ua.museum-digital.org | 436482 | 5967 | 1351 | | kyivska.ua.museum-digital.org | 2159 | 0 | 79 | | lviv.ua.museum-digital.org | 163358 | 188 | 274 | | poltava.ua.museum-digital.org | 7657 | 284 | 3 | | odesa.ua.museum-digital.org | 93 | 0 | 1 | | rivne.ua.museum-digital.org | 59510 | 65 | 156 | | sumy.ua.museum-digital.org | 35890 | 303 | 3 | | ternopil.ua.museum-digital.org | 150700 | 37 | 184 | | zhytomyr.ua.museum-digital.org | 3 | 0 | 0 | | vinnytsia.ua.museum-digital.org | 14229 | 0 | 0 | | volyn.ua.museum-digital.org | 16705 | 0 | 485 | | zakarpattia.ua.museum-digital.org | 2865 | 0 | 30 | | zaporizhzhia.ua.museum-digital.org | 24348 | 338 | 56 | | scotland.museum-digital.org | 0 | 0 | 0 | | md.museum-digital.org | 0 | 0 | 0 | | demo.museum-digital.org | 12 | 2 | 0 | | goethehaus.museum-digital.de | 260072 | 0 | 85 | | lmw.museum-digital.de | 326724 | 0 | 65 | | gedenkstaetten.museum-digital.de | 3474 | 0 | 0 | | turcica.museum-digital.de | 75533 | 0 | 1 | | nat.museum-digital.de | 1238860 | 0 | 4657 | | at.museum-digital.org | 631578 | 0 | 89 | | cz.museum-digital.org | 2 | 0 | 0 | | dk.museum-digital.org | 5415 | 0 | 4 | | hu.museum-digital.org | 359619 | 0 | 2827 | | id.museum-digital.org | 8030 | 0 | 0 | | ie.museum-digital.org | 2073 | 0 | 0 | | it.museum-digital.org | 78 | 0 | 0 | | rou.museum-digital.org | 8277 | 0 | 466 | | pl.museum-digital.org | 142 | 0 | 0 | | pt.museum-digital.org | 0 | 0 | 0 | | srb.museum-digital.org | 565 | 0 | 0 | | ua.museum-digital.org | 232115 | 0 | 805 | | usa.museum-digital.org | 3752 | 0 | 34 | | ch.museum-digital.org | 53417 | 0 | 1 | | global.museum-digital.org | 727690 | 0 | 2199 |
Note that the number of requests obviously is also impacted by bots changing attention – once a scraper is done with one subdomain, they turn to the next. The elevated number of requests in ve.hu.museum-digital.org is normal, but still starkly exaggerated when compared to other days. The Germany-wide instance is persistently the most frequented one, usually the global one is second at around 80% of requests.
Now equipped with actual numbers, we could scale the PHP-FPM to a much more suitable configuration than before (we had thus far never bothered counting actual requests, instead relying on the number of objects).
A second step in the PHP-FPM configuration was to reduce the impact the problems had. Previously there was one shared configuration and socket per subdomain. On the one hand, this meant that stuck processes in the frontend impacted users in musdb (and vice-versa). On the other hand, some constraints on resource usage cannot be set on a per-directory level but must be set per PHP-FPM socket / server (see the PHP documentation on user.ini and the list of php.ini directives). As the frontend and musdb have different requirements (frontend: low maximum memory use, short timeouts, no file uploads, generally strict settings; musdb: long timeouts for uploads, generally more lenient), being able to configure them independent of each other is useful in general.
We thus separated the configuration for the frontend, musdb, and PDF generation in the frontend; providing dedicated sockets for each. The frontend has a reduced priority on the system overall, strict constraints on how it may be used, etc. The settings are stricter than they were before. musdb has an elevated priority and more lenient settings (file uploads, longer timeouts), in fact more lenient than before. Finally, PDF generation is a special case as it offers no real benefit over the browser’s print tool (see MDN on print CSS), while being resource-intensive. As such, it has a far reduced priority and very strict settings.
With the separated configuration and sockets, we can now better tailor the configuration to each application’s needs and have the added benefit of problems in one application not impacting the other.
Code
As we had already prepared the codebase for PHP 8.4 awaiting an eventual upgrade, the upgrade to PHP 8.5 only required minimal changes. Aside from the deprecation of the functions finfo_close() and curl_close(), references to which were accordingly removed from the code, the update necessitated no further work.
Scaling in Software
Improving the PHP configuration was not enough to fix the issues, especially with the now increased number of requests from bots. To get some breathing room, we adjusted the most resource-intensive pages.
Frontend
In the frontend these are, again, the IIIF API, PDF generation, and timelines. Finally, we made changes to the pages for failed searches to better handle high load situations.
Image pages
The IIIF API was used for the main image pages in the frontend. We used (and use) Mirador as a IIIF viewer. Simply opening an image page thus meant three requests to fetch different regions of an image. Zooming into the image triggered further requests to fetch the relevant parts of the image. Cropping the image to the requested region with IIIF happens on the server (which is no problem if there are few users, but is turning into a problem when you have hundreds of requests per second).
We thus changed the default of image pages: The new default image page is the old, non-IIIF one. As features like zooming into images, that Mirador comes with, are popular and useful and the old image page did not support those, we worked to improve the page. To do so, we rely on OpenLayers, a library we already use for maps. Besides including maps from tile servers, OpenLayers also supports loading simple image files – which we do here. The image is hence loaded once in full size and zooming etc. happen entirely in the browser.
Taking the opportunity, we improved the page overall. An often noticed problem of image pages thus far was, that users who opened image pages coming from external services (think Google Images) had problems identifying that the image was an object image and that there is further object data to be found on object pages. The updated image pages now come with a header stating reflecting the name of the image, the name of the object and the name of the institution. Note that many images do not feature a dedicated title, musdb uses the object name as a default image title in that case, which is why the object title will often appear twice in the header. Maybe this can be used as an encouragement for the colleagues working in musdb to more consistently set expressive image titles in the future.
Also new is a mini map at the bottom left, displaying where in the wider context of the image one has currently zoomed in, as well as the ability to link exactly the region one has currently zoomed into. To enable the latter, the URL updates as one zooms or navigates around the image. Somebody else opening the same URL will then open exactly the same image region the linking person was viewing when copying the URL. Finally, we finally set specific Content Security Policies relevant to the currently opened media. If the displayed media entry is an internally stored image, no external images need to be allowed to load. If the displayed media entry is an audio file stored on archive.org, archive.org needs to be whitelisted as a source for audio files – but only archive.org and no other page. Previously, embedding images from anywhere on the net was allowed, increasing the potential damage a potential attacker may cause.
Making the use of Mirador a secondary, non-default option reduced the need for server-side image manipulation and the corresponding resource use significantly. The IIIF remains largely unchanged, but its use must now be requested explicitly.
PDF generation
As stated above, PDF generation brings little advantages to the browser’s print functionality in combination with object pages. On the contrary, the PDFs generated using the frontend’s templates feature less information. But they come with the file ending “.pdf” and seem to be extremely popular with bots. On the other hand, PDF generation means, among others, loading whatever images are to be embedded into the PDF and manipulating them fit into the PDF. The resulting files are significantly larger than the corresponding HTML files and thus also use more of the available bandwidth.
The update to handle PDF generation respective to resource usage was already introduced in the last months: publicly linked PDFs are now only generated if overall load on the server is low, if a user has set their browser language to any language different from a museum-digital instance’s default language. As most scrapers do not bother to change their browser language (which means they come with either none, English or Chinese), this means they will mostly be unable to trigger the generation of PDFs. They see an error page instead.
Failed Search Pages
If a user tries to execute a search query without any results, they will get suggestions for similar search terms – similar to how Google will ask one searching for “Berrlin”, if they meant “Berlin”. Trying to identify suitable suggestions obviously costs resources and whether the suggestions are actually what a user wanted is by nature hit or miss – it’s suggestions after all. In the case of scrapers, suggesting alternative search queries offers them a never-ending stream of possible search queries to run and keep scraping the subdomain with – to nobody’s benefit (not even the scrapers’, as they likely got the same content with other search queries already).
We thus now use the same function used to identify whether PDFs should be generated for a user to check if search suggestions should be provided. It a user comes with a non-default browser language and resource use is high, no suggestions will be provided.
Timelines
Timeline pages as implemented in museum-digital’s frontend offer another source of endless links and search queries, as they link to further and further specifications of the time searched by. Again, an improvement already introduced months ago, was to better parse queries by time: If a user searches for objects that are linked to times “after 1920” and “after 1930”, the latter already includes the former. “After 1920 and after 1930” means exactly the same as “after 1930”. Which is one join instead of two – half the resource usage.
A minor improvement we noticed on the side was impact of automatic redirects in the timelines. Say, a user searches objects by their link to a given tag and then generates a timeline for said objects. If all objects were created in the 20th century, the timeline will automatically redirect so as to “zoom” into a more appropriate time scale than from the big bang to now. Until the last weekend, script execution was not stopped when that redirect happened – which means that all database queries for time time from the big bang to now were still executed even though the user never got to see them. That is now fixed.
The Anti-Climactical Solution
All of those changes got the frontend more or less stable. Problems with uploading images remained however. Finally, the only thing that helped was uninstalling libvips (which we use for image manipulation) and reinstalling it. That seems to have fixed the issues.
Especially as the number of requests from scrapers continues to increase, the current strategy outlined above seems to be fruitful. By reducing the use (and sometimes the availability altogether) of especially resource-intensive and – depending on the context – little useful functionalities, much stability and can be gained.
The update seems to finally be largely completed (aside from maybe some further fine-tuning of the PHP-FPM configuration) and museum-digital is stable despite the bot problem, while we haven’t had to take more drastic or costly actions yet – such as blocking or adding additional servers.




