Not Legal Advice on AI Training Data in Japan

Community Article Published May 25, 2024

This article was originally written and will be kept up to date at the Shisa Wiki page on Training Data in Japan

As the title suggests, this is not legal advice, but it's my current best understanding of the present legal landscape for AI training data in Japan. My background is as an software developer with a long-standing (25y+) interest in IP law as it applies to (free/open source software, GNU, Creative Commons, software patents, etc.) and I've spelunked through the applicable sections of the Japanese Civil and Commercial Codes, but am not an expert in Japanese law.

Copyright

Currently, per Japanese copyright law (PDF), re-affirmed as current policy in April 2023 by Keiko Nagaoka, the Japanese Minister of Education, Culture, Sports, Science, and Technology, states that all works are permitted to be used for the purposes of AI training.

In March 2024, the Japan Agency for Cultural Affairs (ACA) published their latest draft document on AI and Copyright (see also this summary. METI has their own documents/working group as well). See also the notes of the Japanese AI Strategy Council.

Here's some more analysis and color on this:

2023-07-11 Legal Issues in Generative AI under Japanese Law - 3 lawyers of Japanese law-firm Nishimura & Asahi give an overview
2024-02-24 The US should look at Japan’s unique approach to generative AI copyright law - an policy editorial that does a also good job covering the state of AI training in Japan (as an argument for the US to adopt a similar policy)
2024-03-12 Japan’s New Draft Guidelines on AI and Copyright: Is It Really OK to Train AI Using Pirated Materials? - on the latest guidelines published by the ACA. "The committee essentially embraced Article 30-4 allowing the ingestion and analysis of copyrighted materials for AI learning to promote creative innovations in AI. It removes the need of acquiring consent from copyright holders, as long as it would not have a “material impact on the relevant markets” and that the AI usage does not “violate the interests of the copyright holders.”"
2024-05-01 Report on AI and Copyright Issues by Japanese Government - a full English summary of the latest ACA report
UPDATE: 2024-05 General Understanding on AI and Copyright in Japan Overview (PDF) - this is a new EN presentation published by the Legal Subcommittee under the Copyright Subdivision of the Cultural Council of the Agency of Cultural Affairs and summarizes the current thinking. It re-affirms 30-4, however expressly warns about collecting data from piracy distribution sites, and also covers infringement at the usage stage (which understandably is more stringent). It touches also on copyrightability of AI generated material which largely falls within the standard norms (AI generated work is generally deemed non-creative works and to that extent are not considered copyrighted works).

Terms of Service and Synthetic Data

In Japanese AI Twitter, I've noticed a lot of confusion/worries about using synthetic data generated by models due to Terms of Service violations (eg, OpenAI's Terms of Service and the like). It's important to understand that Terms of Service (TOS) is a contract that binds two agreeing parties (see privity of contract or the Japanese term 契約上の関係 (Keiyaku-jō no Kankei)) and a third party cannot be bound to (or break) a TOS they haven't agreed to. Note, that Terms of Service (as its name implies) specifically regulates "access and use" of the service (not the generated output itself).

While as a matter of course, everyone should respect the TOS that they agreed to with their service provider (or suffer potential liability/consequences), any data generated by a third party, whether synthetic or not, simply falls within the same copyright laws/policies in your jurisdiction and does not have any additional licensing or legal terms automatically applied to it.

Notes:

There has been a recent trend of using synthetic data generated from completely open models (eg Mistral or CALM2-7B models). While this allows a developer to train their own models without TOS worries, from a practical standpoint, the current state of open models are much weaker, and currently generate poorer synthetic data without necessarily providing much other legal benefit.
eg, as mentioned, due to the contractual nature of TOS, the idea of TOS transitivity or any downstream "data contamination" doesn't apply, but if it did, using any open models won't help, as they all contain large amounts of TOS constrained data (including in OpenAI's models, of course). Note, if an argument is made that it's the responsibility (of either party) to police/control all downstream usage of generated content in perpetuity/ad infinitum, that would fall under Article 133 of the Japanese Civil Code: "第百三十三条不能の停止条件を付した法律行為は、無効とする。" - "A juristic act that is subject to an impossible suspension condition shall be invalid."

Upvote