Landmark decision by the Hamburg Regional Court on the copyright admissibility of data scraping for training AI models

Artificial intelligence is rightly on everyone's lips as it promises to change our everyday lives forever. However, it also poses major challenges to the law, which has traditionally lagged behind technological developments, particularly in the areas of data protection and copyright.

1. Technical background

Artificial Intelligence (AI) has made considerable progress in recent years and is finding its way into more and more areas of everyday and professional life. Large language models (LLMs) such as ChatGPT or image generators such as Midjourney are particularly popular.

The performance of these AI systems depends largely on the quality and quantity of the data on which they are trained. A common method of collecting these large amounts of data is known as “data scraping”. In this automated process, software programs (called “bots” or “crawlers”) systematically crawl the web and extract information from websites, in particular text, images, videos, program code or other digital content. The process works as follows: A crawler visits a website and reads its content. It then follows the links on the website to other pages and repeats the process. In this way, large amounts of data can be automatically collected in a short period of time. The collected data is then stored on the AI provider’s data servers, used to train the AI models, and then deleted.

Data scraping is particularly attractive because the free Internet provides an extremely diverse and extensive database that is growing by the second and is constantly updated – and yet is accessible for free. At the same time, the method raises many legal questions: Although much information is freely available on the Internet, extracting and using it to train AI models without consent or even a licence is in obvious conflict with copyright law – because the free accessibility of content does not mean it is not protected by copyright.

2. Decision of the Hamburg Regional Court

In a recent case, the Hamburg Regional Court became the first German court to address the copyright implications of such data collections for AI training. The ruling therefore concerns the input side of AI models, not the output side (such as the question of whether AI-generated content can be protected by copyright).

Facts of the case

The plaintiff is a producer and photographer of stock photos, which he distributes and licences through various stock photo platforms. The defendant is a non-profit organisation with the self-

proclaimed aim of providing open datasets, tools and models to promote research in the field of machine learning. One of these datasets contains approximately 5.8 billion text and image pairs collected by the defendant through automated data scraping from publicly available sources on the internet. The dataset was subsequently made available by the defendant for the training of AI models.

The plaintiff discovered that one of his images had been used in the dataset without his consent. Specifically, the image had originally been uploaded to a stock photo and video platform and was included in the dataset in low resolution and with a watermark. The platform’s terms of use prohibited automatic downloading and use of the content by bots or similar programs.

The plaintiff considered the use to be an infringement of his copyright and demanded that the defendant remove his image from the training set and provide information about the extent of the use of his work.

Judgement of the court

The court dismisses the case. After first unsurprisingly finding a reproduction within the meaning of Sec. 16 of the German Copyright Act (“UrhG”), which may only be made with the author’s consent, it turns to the legal core issue of the case: the examination of the “text and data mining” exceptions in (“TDM exceptions”).

The court first comments in detail on Sec. 44b UrhG and states, with reference to the wording of the provision, that automated data scraping is generally to be qualified as “text and data mining” within the meaning of the provision, as the reproduction serves to obtain information about “correlations”. The Court rejected a teleological reduction of the provision, as sometimes suggested in the literature (see below).

The court then commented obiter dicens on the reverse exception provided for in Sec. 44b(3) UrhG, according to which text and data mining is not permitted if the right holder has declared a reservation of use, which must be in a “machine-readable format” in the case of data accessible on the internet. In this respect, the plaintiff referred to the prohibition of automated downloading and use of this content by bots or similar programmes contained in the platform'’ terms of use (see above). In this respect, the Court is inclined to accept that the plaintiff was entitled to rely on this third party reservation and that it was sufficiently clear. Moreover, the Court states that this reservation of use also likely meets the requirements of “machine readability”. It would be contradictory to allow AI developers to develop ever more powerful text-understanding AI models via the exception in Sec. 44b UrhG, but on the other hand not to expect them to use existing AI models to detect reservations of use. Ultimately, the decisive factor was whether a technology has been available at the time of the act of reproduction that could have captured the content of the exception. Ultimately, however, the Court leaves open the question of whether the exception applies.

Because of the particular circumstances of the case, the court was able to refer to the more specific TDM exception in Sec. 60d UrhG, which allows reproductions for text and data mining for the purposes of scientific research, provided they are made by non-commercial research organisations. According to Sec. 60d(2)(3) UrhG, a reverse exception applies if a private company has a decisive influence on the research organisation and has preferential access to the results of the scientific research. The Court held that the defendant was a research organisation within the meaning of the provision. The plaintiff had the burden of proof for the application of this exception, which he failed to meet in the present case.

The ruling is not yet final. Reportedly, both parties are seeking to take the issues in dispute through the court instances. As Sec. 44a UrhG is based on EU law, it is expected that this case (or a similar case) will eventually have to be decided by the European Court of Justice – but it will be some time before that happens. In the meantime, it will be interesting to see how other German and European courts decide such cases.

3. Comment

The decision is a landmark case, not least because of its comprehensive comments on Sec. 44b UrhG, and is important for tech companies and authors alike. Although the decision is specific to the use of images, it is no different for text, program code, video or music – the legal issues are the same. Given that Sec. 44b, 60d UrhG is based on EU law, the decision should also receive attention in other Member States.

The applicability of the TDM limitations to automated data scraping has also been predominantly affirmed in the (German) legal literature to date, but has recently been denied in a highly regarded and readable study by Dornis/Stober on behalf of the Initiative Urheberrecht. While the wording of Sec. 44b UrhG does indeed speak in favour of the applicability of the TDM exception to AI training, the result raises concerns in terms of legal policy. This is because the provision is based on Art. 4 of Directive (EU) 2019/790 of 2019 – although AI training did undoubtedly already take place at that time, it is unlikely that the legislator intended to regulate its permissibility under copyright law. Furthermore, the exemption from remuneration that was expressly intended when the provision was introduced is ultimately based on the fact that the knowledge gained through text and data mining does not compete with the data or products that are extracted. However, this is not the case with the training of generative AI: as the case in question shows, AI models are trained with (copyrighted) image material in order to ultimately generate images themselves – thus creating competition with the author of the extracted work. In other words: The copyright holder receives no remuneration for the use of their work and, in addition to that, has to accept that the AI trained on his work competes with him. Against this background, it is doubtful whether it is really desirable that copyrighted works can be used (free of charge) to train a generative AI that can replace these works. It will be interesting to follow further discussions on this issue, as well as legislative activities and the efforts of collecting societies to develop licensing models.

What is surprising – because it contradicts the prevailing view in the (German) literature – is the obiter dictum of the court on the question whether a reservation of use within the meaning of Sec. 44b(3) UrhG had been effectively declared. It is highly questionable whether a reservation of use formulated in natural language is actually “machine-readable” or only “human-readable”. This is also true if one considers the undoubtedly impressive text recognition capabilities of large language models. If a reservation in natural language were to be considered sufficient, a host of follow-up questions would arise, ranging from the requirements for specific wording and the consequences of unclear wording, to the relevant languages and placement on the website. However, an instruction is really only “machine-readable” if the crawler can easily identify and clearly understand it as such. For the time being, therefore, the “gold standard” for excluding unwanted text and data mining is likely to be the already widely used Robots Exclusion Standard, in which a simple text file called robots.txt is placed in the root directory of the website, which can prohibit certain bots and crawlers from reading all or part of the website but of course also depends on the “co-operation” of the bot.

4. Practical advise

From a practical perspective, the ruling casts both light and shade. From the perspective of AI developers, it is pleasing that data scraping is generally covered by the TDM exception. Nevertheless, a considerable degree of legal uncertainty remains, as the decision hardly clarifies the requirements for the reservation of use, which is only contained in Sec. 44b(3) UrhG, and in particular the requirements for “machine readability” – on the contrary. This is all the more problematic if one considers the obligation of AI developers to design their systems in such a way that they reliably recognise and respect machine-readable reservations of use (cf. Art. 53(1)(c) of the AI Act).

Authors, on the other hand, are advised to take proactive measures themselves and to exercise particular caution when selecting the websites on which they publish their works, if they wish to protect their works shared online from data scraping and thus indirectly from being used for AI training. Of course, authors will face practical difficulties at the latest when they license their works to customers who use them on the internet themselves, as licensees will then be required to include an effective, machine-readable reservation of use on the relevant websites. If this is enforceable at all in practice, it is advisable to use robots.txt for the time being due to the uncertainties involved with natural language reservations.

IP-UPDATES ON LINKEDIN

Für innovative Unternehmen in Deutschland und weltweit: HARTE-BAVENDAMM ist spezialisiert auf das Recht des geistigen Eigentums und das Wettbewerbsrecht.

contact

You might also be interested in this

Artikel

The key to successful trademark registration

Caroline Koch

A trademark is one of a company’s most valuable assets. It stands for quality, recognizability, and trust among customers. In addition, a trademark enables a clear differentiation from the competition and strengthens the brand presence in the long term. These and many other advantages speak in favor of protecting a sign as a trademark at an early stage.

Artikel

Enforcement of EUIPO cost decisions – A practical guide

Ruven Appelkamp

The enforceability of EUIPO cost decisions is an issue that often plays a secondary role in trademark law practice. However, according to Art. 110 EUTMR and Art. 71 of the EUDR, any decision by the Office that fixes costs constitutes a directly enforceable title in all EU Member States.

Artikel

Hamburg Higher Regional Court specifies text and data mining restrictions for AI training

Michael Wittlinger

On the ruling of the Hanseatic Higher Regional Court of 10 December 2025 (Ref. 5 U 104/24). We summarise the extensive decision, analyse it, and provide an outlook on the further proceedings.

Artikel

GEMA wins against OpenAI in a landmark case before the Regional Court of Munich

Michael Wittlinger

Almost exactly one year after filing its lawsuit against OpenAI, GEMA achieved a major victory against the US AI company before the Regional Court of Munich. The court ordered OpenAI to cease and desist, provide information and pay damages.

All blog posts

LET'S TALK
IP