Not Legal Advice on AI Training Data in Japan

Community Article Published May 25, 2024

This article was originally written and will be kept up to date at the Shisa Wiki page on Training Data in Japan

As the title suggests, this is not legal advice, but it's my current best understanding of the present legal landscape for AI training data in Japan. My background is as an software developer with a long-standing (25y+) interest in IP law as it applies to (free/open source software, GNU, Creative Commons, software patents, etc.) and I've spelunked through the applicable sections of the Japanese Civil and Commercial Codes, but am not an expert in Japanese law.

Copyright

Currently, per Japanese copyright law (PDF), re-affirmed as current policy in April 2023 by Keiko Nagaoka, the Japanese Minister of Education, Culture, Sports, Science, and Technology, states that all works are permitted to be used for the purposes of AI training.

In March 2024, the Japan Agency for Cultural Affairs (ACA) published their latest draft document on AI and Copyright (see also this summary. METI has their own documents/working group as well). See also the notes of the Japanese AI Strategy Council.

Here's some more analysis and color on this:

Terms of Service and Synthetic Data

In Japanese AI Twitter, I've noticed a lot of confusion/worries about using synthetic data generated by models due to Terms of Service violations (eg, OpenAI's Terms of Service and the like). It's important to understand that Terms of Service (TOS) is a contract that binds two agreeing parties (see privity of contract or the Japanese term 契約上の関係 (Keiyaku-jō no Kankei)) and a third party cannot be bound to (or break) a TOS they haven't agreed to. Note, that Terms of Service (as its name implies) specifically regulates "access and use" of the service (not the generated output itself).

While as a matter of course, everyone should respect the TOS that they agreed to with their service provider (or suffer potential liability/consequences), any data generated by a third party, whether synthetic or not, simply falls within the same copyright laws/policies in your jurisdiction and does not have any additional licensing or legal terms automatically applied to it.

Notes:

  • There has been a recent trend of using synthetic data generated from completely open models (eg Mistral or CALM2-7B models). While this allows a developer to train their own models without TOS worries, from a practical standpoint, the current state of open models are much weaker, and currently generate poorer synthetic data without necessarily providing much other legal benefit.

  • eg, as mentioned, due to the contractual nature of TOS, the idea of TOS transitivity or any downstream "data contamination" doesn't apply, but if it did, using any open models won't help, as they all contain large amounts of TOS constrained data (including in OpenAI's models, of course). Note, if an argument is made that it's the responsibility (of either party) to police/control all downstream usage of generated content in perpetuity/ad infinitum, that would fall under Article 133 of the Japanese Civil Code: "第百三十三条 不能の停止条件を付した法律行為は、無効とする。" - "A juristic act that is subject to an impossible suspension condition shall be invalid."