Landmark decision by the Hamburg Regional Court on the copyright admissibility of data scraping for training AI models
In addition to several major cases in the US, the first cases are now pending in Germany, raising interesting copyright issues on both the input and output sides of AI models. The Hamburg Regional Court is now the first German court to deal with the copyright permissibility of the automated collection and use of copyrighted works for the purpose of AI training. In our blog, we discuss the judgment of 27 September 2024 (Case No. 310 O 227/23), classify its significance for you, and provide both authors and AI developers with practical tips.
1. Technical background
Artificial Intelligence (AI) has made considerable progress in recent years and is finding its way into more and more areas of everyday and professional life. Large language models (LLMs) such as ChatGPT or image generators such as Midjourney are particularly popular.
The performance of these AI systems depends largely on the quality and quantity of the data on which they are trained. A common method of collecting these large amounts of data is known as “data scraping”. In this automated process, software programs (called “bots” or “crawlers”) systematically crawl the web and extract information from websites, in particular text, images, videos, program code or other digital content. The process works as follows: A crawler visits a website and reads its content. It then follows the links on the website to other pages and repeats the process. In this way, large amounts of data can be automatically collected in a short period of time. The collected data is then stored on the AI provider’s data servers, used to train the AI models, and then deleted.
Data scraping is particularly attractive because the free Internet provides an extremely diverse and extensive database that is growing by the second and is constantly updated – and yet is accessible for free. At the same time, the method raises many legal questions: Although much information is freely available on the Internet, extracting and using it to train AI models without consent or even a licence is in obvious conflict with copyright law – because the free accessibility of content does not mean it is not protected by copyright.
2. Decision of the Hamburg Regional Court
In a recent case, the Hamburg Regional Court became the first German court to address the copyright implications of such data collections for AI training. The ruling therefore concerns the input side of AI models, not the output side (such as the question of whether AI-generated content can be protected by copyright).
Facts of the case
The plaintiff is a producer and photographer of stock photos, which he distributes and licences through various stock photo platforms. The defendant is a non-profit organisation with the self-
proclaimed aim of providing open datasets, tools and models to promote research in the field of machine learning. One of these datasets contains approximately 5.8 billion text and image pairs collected by the defendant through automated data scraping from publicly available sources on the internet. The dataset was subsequently made available by the defendant for the training of AI models.
The plaintiff discovered that one of his images had been used in the dataset without his consent. Specifically, the image had originally been uploaded to a stock photo and video platform and was included in the dataset in low resolution and with a watermark. The platform’s terms of use prohibited automatic downloading and use of the content by bots or similar programs.
The plaintiff considered the use to be an infringement of his copyright and demanded that the defendant remove his image from the training set and provide information about the extent of the use of his work.
Judgement of the court
The court dismisses the case. After first unsurprisingly finding a reproduction within the meaning of Sec. 16 of the German Copyright Act (“UrhG”), which may only be made with the author’s consent, it turns to the legal core issue of the case: the examination of the “text and data mining” exceptions in (“TDM exceptions”).
The court first comments in detail on Sec. 44b UrhG and states, with reference to the wording of the provision, that automated data scraping is generally to be qualified as “text and data mining” within the meaning of the provision, as the reproduction serves to obtain information about “correlations”. The Court rejected a teleological reduction of the provision, as sometimes suggested in the literature (see below).
The court then commented obiter dicens on the reverse exception provided for in Sec. 44b(3) UrhG, according to which text and data mining is not permitted if the right holder has declared a reservation of use, which must be in a “machine-readable format” in the case of data accessible on the internet. In this respect, the plaintiff referred to the prohibition of automated downloading and use of this content by bots or similar programmes contained in the platform'’ terms of use (see above). In this respect, the Court is inclined to accept that the plaintiff was entitled to rely on this third party reservation and that it was sufficiently clear. Moreover, the Court states that this reservation of use also likely meets the requirements of “machine readability”. It would be contradictory to allow AI developers to develop ever more powerful text-understanding AI models via the exception in Sec. 44b UrhG, but on the other hand not to expect them to use existing AI models to detect reservations of use. Ultimately, the decisive factor was whether a technology has been available at the time of the act of reproduction that could have captured the content of the exception. Ultimately, however, the Court leaves open the question of whether the exception applies.
Because of the particular circumstances of the case, the court was able to refer to the more specific TDM exception in Sec. 60d UrhG, which allows reproductions for text and data mining for the purposes of scientific research, provided they are made by non-commercial research organisations. According to Sec. 60d(2)(3) UrhG, a reverse exception applies if a private company has a decisive influence on the research organisation and has preferential access to the results of the scientific research. The Court held that the defendant was a research organisation within the meaning of the provision. The plaintiff had the burden of proof for the application of this exception, which he failed to meet in the present case.
The ruling is not yet final. Reportedly, both parties are seeking to take the issues in dispute through the court instances. As Sec. 44a UrhG is based on EU law, it is expected that this case (or a similar case) will eventually have to be decided by the European Court of Justice – but it will be some time before that happens. In the meantime, it will be interesting to see how other German and European courts decide such cases.
3. Comment
The decision is a landmark case, not least because of its comprehensive comments on Sec. 44b UrhG, and is important for tech companies and authors alike. Although the decision is specific to the use of images, it is no different for text, program code, video or music – the legal issues are the same. Given that Sec. 44b, 60d UrhG is based on EU law, the decision should also receive attention in other Member States.
The applicability of the TDM limitations to automated data scraping has also been predominantly affirmed in the (German) legal literature to date, but has recently been denied in a highly regarded and readable study by Dornis/Stober on behalf of the Initiative Urheberrecht. While the wording of Sec. 44b UrhG does indeed speak in favour of the applicability of the TDM exception to AI training, the result raises concerns in terms of legal policy. This is because the provision is based on Art. 4 of Directive (EU) 2019/790 of 2019 – although AI training did undoubtedly already take place at that time, it is unlikely that the legislator intended to regulate its permissibility under copyright law. Furthermore, the exemption from remuneration that was expressly intended when the provision was introduced is ultimately based on the fact that the knowledge gained through text and data mining does not compete with the data or products that are extracted. However, this is not the case with the training of generative AI: as the case in question shows, AI models are trained with (copyrighted) image material in order to ultimately generate images themselves – thus creating competition with the author of the extracted work. In other words: The copyright holder receives no remuneration for the use of their work and, in addition to that, has to accept that the AI trained on his work competes with him. Against this background, it is doubtful whether it is really desirable that copyrighted works can be used (free of charge) to train a generative AI that can replace these works. It will be interesting to follow further discussions on this issue, as well as legislative activities and the efforts of collecting societies to develop licensing models.
What is surprising – because it contradicts the prevailing view in the (German) literature – is the obiter dictum of the court on the question whether a reservation of use within the meaning of Sec. 44b(3) UrhG had been effectively declared. It is highly questionable whether a reservation of use formulated in natural language is actually “machine-readable” or only “human-readable”. This is also true if one considers the undoubtedly impressive text recognition capabilities of large language models. If a reservation in natural language were to be considered sufficient, a host of follow-up questions would arise, ranging from the requirements for specific wording and the consequences of unclear wording, to the relevant languages and placement on the website. However, an instruction is really only “machine-readable” if the crawler can easily identify and clearly understand it as such. For the time being, therefore, the “gold standard” for excluding unwanted text and data mining is likely to be the already widely used Robots Exclusion Standard, in which a simple text file called robots.txt is placed in the root directory of the website, which can prohibit certain bots and crawlers from reading all or part of the website but of course also depends on the “co-operation” of the bot.
4. Practical advise
From a practical perspective, the ruling casts both light and shade. From the perspective of AI developers, it is pleasing that data scraping is generally covered by the TDM exception. Nevertheless, a considerable degree of legal uncertainty remains, as the decision hardly clarifies the requirements for the reservation of use, which is only contained in Sec. 44b(3) UrhG, and in particular the requirements for “machine readability” – on the contrary. This is all the more problematic if one considers the obligation of AI developers to design their systems in such a way that they reliably recognise and respect machine-readable reservations of use (cf. Art. 53(1)(c) of the AI Act).
Authors, on the other hand, are advised to take proactive measures themselves and to exercise particular caution when selecting the websites on which they publish their works, if they wish to protect their works shared online from data scraping and thus indirectly from being used for AI training. Of course, authors will face practical difficulties at the latest when they license their works to customers who use them on the internet themselves, as licensees will then be required to include an effective, machine-readable reservation of use on the relevant websites. If this is enforceable at all in practice, it is advisable to use robots.txt for the time being due to the uncertainties involved with natural language reservations.
You might also be interested in this
German collecting society GEMA now seems to be going on the offensive against providers of generative AI systems. Following the presentation of a – in their opinion – fair licensing model for generative artificial intelligence at the end of September, an “AI Charter” as a suggestion and guideline for the responsible use of generative AI was presented at the beginning of November, and now a lawsuit has been filed against OpenAI at the Munich Regional Court.
In a landmark decision, the European Court of Justice (ECJ) ruled on 24 October 2024 that the Member States of the European Union are obliged to protect works of applied art, regardless of their country of origin or the nationality of their creators. “Works of applied art” are objects that serve a specific purpose but are also artistically designed. Examples include furniture such as chairs, shelves and lamps, but also – under strict conditions – fashion creations.
The use of cheat or modding software has always been controversial in the world of video games. While many gamers see it as a way to make games easier or more exciting, developers and publishers often see it as a threat to their rights and the integrity of their products. The European Court of Justice (ECJ) had to consider the copyright component of this issue in a dispute between Sony and the UK company Datel over the use of cheat software called “Action Replay”, which allowed users to alter the course of a game to gain unintended advantages. Read our article to find out how the case was decided and what the implications are for software development practice.
According to the decision “Der Novembermann” of the Federal Court of Justice (BGH), the fees for warning letters are to be calculated on the basis of a so-called overall value of the claim (“Gesamtgegenstandswert”) and allocated to the individual warning letters if they are related to each other in such a way that the same matter is to be assumed.