A blog on museum-digital and the broader digitization of museum work.

Since the last post (i.e. the update to PHP 8.5 amid an onslaught of AI scrapers) and the later introduction of much stricter per-IP rate limiting, the stability issues around md are better – but they are not yet completely resolved.

As such, we have expanded our efforts in rewriting and reformulating key resource-intensive functionalities for increased stability. Different from before, we have also started to fully remove or disable functionalities that are simply not tenable anymore under the current conditions.

PDF Generation

Thus far, there were two basic types of PDFs that were generated (on the server side) in museum-digital’s portals: PDF representations of object pages (“data sheet”) on the one hand and PDFs encapsulating all images of an object in one document for easy printing.

The latter was – simply by nature of its envisioned task – extremely resource-intensive. All image files had to be loaded from disk, embedded into the PDF, compressed and served. The option had thus been available for fewer and fewer objects. Where it was originally available in case of any object with more than three images, it was later limited to objects of less than 40 images. As such, its availability was increasingly hard to communicate clearly, while its usefulness was relatively reduced with the introduction of a new download option for all images of an object. Its natural resource-intensiveness remained a problem however, and as scrapers will click any link they can find, this type of PDF generation continued to be used quite regularly (every few seconds before the recent surge in bot activity). As of last week, the functionality has been entirely removed.

The “data sheet” PDF generation has been further limited as well. As stated in the previous blog post, its usefulness is significantly reduced with the introduction of a print stylesheet (you will get better results simply pressing CTRL + P on an object page and printing the page to PDF). Nevertheless, it remained rather popular and has not been removed entirely. To reduce its impact on server stability, we however further limited its availability: If the server load is any higher than comfortable, the PDF will not be generated and an error message will appear. If the load is high (up from around 70% of comfortable) and the user’s browser language is not the default language of an instance of museum-digital, the same error message will appear.

Failed Search Pages

If a search query for objects fails, users are forwarded to a failed search page, on suggestions for alternative search queries are made. This is essentially the same as Google automatically suggesting corrections when search terms contain typos. Identifying the alternatives and offering previews for each is not free. As it is simply suggestions, the benefit or general accuracy of the suggestions fluctuates from case to case.

Now, looking at the logs, we had a large number of queries for non-existing entities – obviously scrapers who were trying out different IDs after analyzing the URL scheme. Each of those queries was executed and then forwarded to the failed search page, triggering the loading of suggestions and previews and thus further using resources on the server for little benefit (besides getting more links to scrape). We have now introduced a similar logic to the limitations on the data sheet PDF generation. Suggestions and previews are only generated when server load is comparatively low, with non-primary language users being slightly disadvantaged vis-a-vis primary-language users in an instance.

Timelines

Timelines remain popular – and a problem. A very common type of query we would see in our logs would combine timelines with searches by start and end date. This was likely due to another possible loop of endless URL generation for scrapers – specify a timeline until it forwards to search pages for a given timespan, then open the timeline for that timespan. Exactly that behavior has now been made impossible. If a search by a timeline (“start after”, “end before”) has been set, timelines will not be offered in the sidebar anymore. Trying to generate them for such a search using URL manipulation or the API will return an error page.

Search: Cleanup, Image Search & Checking Entity Existence Early

A more messy way of optimizations hit the core of the object search. In around 2021, we introduced a new search logic. Almost all pages relying on the core search logic – search overview pages, maps for objects, timelines, were adjusted to work with the new logic. The only exception from this was the image search. Still, as the new search logic re-used some of the old search logic’s functions, we kept both as separate classes, which grew over time. Simply loading the new search logic took about one ms (without OPCache enabled, measured through PHPBench). This sounds like little, but hints at a lack of modularization of the code and gains relevance with many unpredictable requests with servers automatically spinning up and down.

And indeed, in writing the new search logic, we did not modularize thoroughly HTML generation, query building and database querying. With last weeks updates, there are now separate classes for each of these and functionalities relevant only to the old search functions have been moved to class managing the image search logic. This reduces startup time for only the new / main search logic by about half (ca. 0.6 ms).

Second, we reduced the available search options for image searches. The remaining search parameters are either those actually relevant to the images or those linked to the controlled vocabularies. As a positive side effect, this also solves some issues in communication: Making it legible what the difference between searching images by their own license and by the license of (unrelated) metadata of objects the images are linked to is, is complicated.

Finally, as stated above, the logs revealed a lot of queries for objects linked to e.g. either entirely non-existent places or places that are not linked to any object in the instance of museum-digital altogether. When a place or tag is queried, we hence check whether there exists any public mention of the entity in the current instance of museum-digital during query building. If there is no link at all, it is clear early on that a more detailed (i.e. costly) query combining the search by that entity with other parameters will not return any results.

The Current Situation

All these improvements help, but a look at the current real-world numbers is warranted. On the one hand, the database server now often falls down to half or even less of the expected server load. This is a positive sign for system stability outside of peak times.

On the other hand, there are noticably spikes in the morning (around 10:20 in Germany) and in the afternoon (starting around 5 p.m.). The spike in the morning is likely related to the start of workdays and has led to the server falling over multiple times last week. This can likely be fixed only with a further tuning of the PHP-FPM settings. The spikes in the afternoon and early evening on the other hand remain hard to explain, but are altogether much less critical.

We’re on it.

Leave a Reply

Your email address will not be published. Required fields are marked *

To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. (Find out more about Webmentions.)