Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they’re extracting general patterns and concepts - the “Bob Dylan-ness” or “Hemingway-ness” - not copying specific text or images.

This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in “vector space”. When generating new content, the AI isn’t recreating copyrighted works, but producing new expressions inspired by the concepts it’s learned.

This is fundamentally different from copying a book or song. It’s more like the long-standing artistic tradition of being influenced by others’ work. The law has always recognized that ideas themselves can’t be owned - only particular expressions of them.

Moreover, there’s precedent for this kind of use being considered “transformative” and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.

While it’s understandable that creators feel uneasy about this new technology, labeling it “theft” is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn’t make the current use of copyrighted works for AI training illegal or unethical.

For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744

  • helenslunch@feddit.nl
    link
    fedilink
    English
    arrow-up
    0
    ·
    11 days ago

    Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology.

    Or maybe they’re not talking about copyright law. They’re talking about basic concepts. Maybe copyright law needs to be brought into the 21st century?

    • masterspace@lemmy.ca
      link
      fedilink
      English
      arrow-up
      0
      ·
      12 days ago

      How do you feel about Meta and Microsoft who do the same thing but publish their models open source for anyone to use?

        • masterspace@lemmy.ca
          link
          fedilink
          English
          arrow-up
          0
          ·
          12 days ago

          I mean we’re having a discussion about what’s fair, my inherent implication is whether or not that would be a fair regulation to impose.

      • WalnutLum@lemmy.ml
        link
        fedilink
        English
        arrow-up
        0
        ·
        11 days ago

        Those aren’t open source, neither by the OSI’s Open Source Definition nor by the OSI’s Open Source AI Definition.

        The important part for the latter being a published listing of all the training data. (Trainers don’t have to provide the data, but they must provide at least a way to recreate the model given the same inputs).

        Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.

        They are model-available if anything.

        • masterspace@lemmy.ca
          link
          fedilink
          English
          arrow-up
          0
          ·
          11 days ago

          For the purposes of this conversation. That’s pretty much just a pedantic difference. They are paying to train those models and then providing them to the public to use completely freely in any way they want.

          It would be like developing open source software and then not calling it open source because you didn’t publish the market research that guided your UX decisions.

          • WalnutLum@lemmy.ml
            link
            fedilink
            English
            arrow-up
            0
            ·
            11 days ago

            You said open source. Open source is a type of licensure.

            The entire point of licensure is legal pedantry.

            And as far as your metaphor is concerned, pre-trained models are closer to pre-compiled binaries, which are expressly not considered Open Source according to the OSD.

          • Arcka@midwest.social
            link
            fedilink
            English
            arrow-up
            0
            ·
            11 days ago

            Tell me you’ve never compiled software from open source without saying you’ve never compiled software from open source.

            The only differences between open source and freeware are pedantic, right guys?

    • LibertyLizard@slrpnk.net
      link
      fedilink
      English
      arrow-up
      0
      ·
      12 days ago

      Pirating isn’t stealing but yes the collective works of humanity should belong to humanity, not some slimy cabal of venture capitalists.

      • General_Effort@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        12 days ago

        Yes, that’s exactly the point. It should belong to humanity, which means that anyone can use it to improve themselves. Or to create something nice for themselves or others. That’s exactly what AI companies are doing. And because it is not stealing, it is all still there for anyone else. Unless, of course, the copyrightists get there way.

      • WaxedWookie@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        12 days ago

        Unlike regular piracy, accessing “their” product hosted on their servers using their power and compute is pretty clearly theft. Morally correct theft that I wholeheartedly support, but theft nonetheless.

        • LibertyLizard@slrpnk.net
          link
          fedilink
          English
          arrow-up
          0
          ·
          12 days ago

          Is that how this technology works? I’m not the most knowledgeable about tech stuff honestly (at least by Lemmy standards).

          • WaxedWookie@lemmy.world
            link
            fedilink
            English
            arrow-up
            0
            ·
            11 days ago

            There’s self-hosted LLMs, (e.g. Ollama), but for the purposes of this conversation, yeah - they’re centrally hosted, compute intensive software services.

        • ProstheticBrain@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          0
          ·
          edit-2
          12 days ago

          ingredients to a recipe may well be subject to copyright, which is why food writers make sure their recipes are “unique” in some small way. Enough to make them different enough to avoid accusations of direct plagiarism.

          E: removed unnecessary snark

          • oxomoxo@lemmy.world
            link
            fedilink
            English
            arrow-up
            0
            ·
            11 days ago

            I think there is some confusion here between copyright and patent, similar in concept but legally exclusive. A person can copyright the order and selection of words used to express a recipe, but the recipe itself is not copy, it can however fall under patent law if proven to be unique enough, which is difficult to prove.

            So you can technically own the patent to a recipe keeping other companies from selling the product of a recipe, however anyone can make the recipe themselves, if you can acquire it and not resell it. However that recipe can be expressed in many different ways, each having their own copyright.

          • General_Effort@lemmy.world
            link
            fedilink
            English
            arrow-up
            0
            ·
            12 days ago

            In what country is that?

            Under US law, you cannot copyright recipes. You can own a specific text in which you explain the recipe. But anyone can write down the same ingredients and instructions in a different way and own that text.

              • General_Effort@lemmy.world
                link
                fedilink
                English
                arrow-up
                0
                ·
                10 days ago

                No, you cannot patent an ingredient. What you can do - under Indian law - is get “protection” for a plant variety. In this case, a potato.

                That law is called Protection of Plant Varieties and Farmers’ Rights Act, 2001. The farmer in this case being PepsiCo, which is how they successfully sued these 4 Indian farmers.

                Farmers’ Rights for PepsiCo against farmers. Does that seem odd?

                I’ve never met an intellectual property freak who didn’t lie through his teeth.

  • finley@lemm.ee
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    12 days ago

    “but how are we supposed to keep making billions of dollars without unscrupulous intellectual property theft?! line must keep going up!!”

  • gencha@lemm.ee
    link
    fedilink
    English
    arrow-up
    0
    ·
    11 days ago

    So if I watch all Star Wars movies, and then get a crew together to make a couple of identical movies that were inspired by my earlier watching, and then sell the movies, then this is actually completely legal.

    It doesn’t matter if they stole the source material. They are selling a machine that can create copyright infringements at a click of a button, and that’s a problem.

    This is not the same as an artist looking at every single piece of art in the world and being able to replicate it to hang it in the living room. This is an army of artists that are enslaved by a single company to sell any copy of any artwork they want. That army works as long as you feed it electricity and free labor of actual artists.

    Theft actually seems like a great word for what these scammers are doing.

    If you run some open source model on your own machine, that’s a different story.

  • mm_maybe@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    0
    ·
    11 days ago

    The problem with your argument is that it is 100% possible to get ChatGPT to produce verbatim extracts of copyrighted works. This has been suppressed by OpenAI in a rather brute force kind of way, by prohibiting the prompts that have been found so far to do this (e.g. the infamous “poetry poetry poetry…” ad infinitum hack), but the possibility is still there, no matter how much they try to plaster over it. In fact there are some people, much smarter than me, who see technical similarities between compression technology and the process of training an LLM, calling it a “blurry JPEG of the Internet”… the point being, you wouldn’t allow distribution of a copyrighted book just because you compressed it in a ZIP file first.

  • mriormro@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    11 days ago

    You know, those obsessed with pushing AI would do a lot better if they dropped the patronizing tone in every single one of their comments defending them.

    It’s always fun reading “but you just don’t understand”.

  • spacesatan@lazysoci.al
    link
    fedilink
    English
    arrow-up
    0
    ·
    12 days ago

    I’m I the only person that remembers that it was “you wouldn’t steal a car” or has everyone just decided to pretend it was “you wouldn’t download a car” because that’s easier to dunk on.

  • kibiz0r@midwest.social
    link
    fedilink
    English
    arrow-up
    0
    ·
    12 days ago

    Not even stealing cheese to run a sandwich shop.

    Stealing cheese to melt it all together and run a cheese shop that undercuts the original cheese shops they stole from.

    • TheKMAP@lemmynsfw.com
      link
      fedilink
      English
      arrow-up
      0
      ·
      11 days ago

      Whatever happened to copying isn’t stealing?

      I think the crux of the conversation is whether or not the world is better with ChatGPT. I say yes. We can tackle the disinformation in another effort.

      • calcopiritus@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        11 days ago

        When you copy to consume yourself it’s way different than when you copy to sell the copy for a lower price.

        • TheKMAP@lemmynsfw.com
          link
          fedilink
          English
          arrow-up
          0
          ·
          11 days ago

          They’re not selling the copy, bruh. They’re selling a technology that very few understand. Smart people pretend they get it, but they don’t. That’s how rare the math is.

          • calcopiritus@lemmy.world
            link
            fedilink
            English
            arrow-up
            0
            ·
            11 days ago

            So because you don’t understand it, everything it does should be legal?

            It’s not rare maths. There are trns of thousands of AI experts. And most CS graduates (millions) have a good understanding on how they work, just not the specifics of the maths.

            Yeah, they’re not selling a copy, they are just selling a subscription to a copying machine loaded with the information needed to make a copy. Totally different.

            I should start a business of printers and attach a USB with the PNG of a dollar bill. And of course my printers won’t have any government mandated firmware that disables printing fake money.

            I’m not printing fake money! It’s my clients! Totally legal.

  • LarmyOfLone@lemm.ee
    link
    fedilink
    English
    arrow-up
    0
    ·
    11 days ago

    The joke is of course that “paying for copyright” is impossible in this case. ONLY the large social media companies that own all the comments and content that has accumulated by the community have enough data to train AI models. Or sites like stock photo libraries or deviantart who own the distribution rights for the content. That means all copyright arguments practically argue that AI should be owned by big corporations and should be inaccessible to normal people.

    Basically the “means of generation” will be owned by the capitalists, since they are the only ones with the economic power to license these things.

    That is basically the worst case scenario. Not only will the value of work diminish greatly, the advances in productivity will also be only accessible to big capitalists.

    Of course, that is basically inevitable anyway. Why wouldn’t they want this? It’s just sad seeing the stupid morons arguing for this as if they had anything to gain.

    • sunzu2@thebrainbin.org
      link
      fedilink
      arrow-up
      0
      ·
      11 days ago

      It’s just sad seeing the stupid morons arguing for this as if they had anything to gain.

      The real money shot here… How did we get to a point where people will argue against common working slave good?

      There is a pattern too… Iraq, Afghanistan, israeli genocide, bailouts. Anytime there is money to be made for the regime, we got solid 30% of population working as hard for zealots.

      Them 2 decades later when the two wars failed, we can’t find a single guy who support either war around 🤡

      The same is somehow now shilling we “shouldn’t invafe ukraine but Israeli needs tools to defend themselves”

    • mm_maybe@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      0
      ·
      11 days ago

      I’m getting really tired of saying this over and over on the Internet and getting either ignored or pounced on by pompous AI bros and boomers, but this “there isn’t enough free data” claim has never been tested. The experiments that have come close (look up the early Phi and Starcoder papers, or the CommonCanvas text-to-image model) suggested that the claim is false, by showing that a) models trained on small, well-curated datasets can match and outperform models trained on lazily curated large web scrapes, and b) models trained solely on permissively licensed data can perform on par with at least the earlier versions of models trained more lazily (e.g. StarCoder 1.5 performing on par with Code-Davinci). But yes, a social network or other organization that has access to a bunch of data that they own, or have licensed, could almost certainly fine-tune a base LLM trained solely on permissively licensed data to get a tremendously useful tool that would probably be safer and more helpful than ChatGPT for that organization’s specific business, at vastly lower risk of copyright claims or toxic generated content, for that matter.

      • LarmyOfLone@lemm.ee
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        11 days ago

        Thanks for the info. But lets say you want to train a (future) AI to spot and tag disinformation and misinformation. You’d need to use and curate actual data from social media sites and articles.

        If copyright is extended to learning from and analyzing publicly available data, such an AI will only be possible by licensing that data. Which will be monetize to maximize profit, first some lump sum, then later “per gb” and then later “per use”.

        I’m sure open source AI will make due and for many applications there is enough free data, but I can imagine a lot of cases where there wont. Anything that requires “commercially successful” media, articles, newspapers, screenplays, movies, books, social media posts and comments, images, photos, video clips…

        We’re basically setting up a world where the intellectual wealth of our civilization is being transformed into a commodity and then will be transferred into the hands of a few rich capitalists.

        And even if there is acceptable amount of free data, if the principle is that data needs to be specifically licensed to learn and train and derive AI works from it - that makes free data use expensive too. It needs to be specifically vetted and is still vulnerable to be sued for mistakes or outrageous claims of copyright. Similar to patents, the uncertainty requires higher capitalization for any startup to defend against lawsuits.

  • Nimo@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    11 days ago

    I hate to say this but “let the market decide” if Ai is something the consumer wants/needs they’ll pay for it otherwise let it die.

  • A1kmm@lemmy.amxl.com
    link
    fedilink
    English
    arrow-up
    0
    ·
    12 days ago

    The argument seem most commonly from people on fediverse (which I happen to agree with) is really not about what current copyright laws and treaties say / how they should be interpreted, but how people view things should be (even if it requires changing laws to make it that way).

    And it fundamentally comes down to economics - the study of how resources should be distributed. Apart from oligarchs and the wannabe oligarchs who serve as useful idiots for the real oligarchs, pretty much everyone wants a relatively fair and equal distribution of wealth amongst the people (differing between left and right in opinion on exactly how equal things should be, but there is still some common ground). Hardly anyone really wants serfdom or similar where all the wealth and power is concentrated in the hands of a few (obviously it’s a spectrum of how concentrated, but very few people want the extreme position to the right).

    Depending on how things go, AI technologies have the power to serve humanity and lift everyone up equally if they are widely distributed, removing barriers and breaking existing ‘moats’ that let a few oligarchs hoard a lot of resources. Or it could go the other way - oligarchs are the only ones that have access to the state of the art model weights, and use this to undercut whatever they want in the economy until they own everything and everyone else rents everything from them on their terms.

    The first scenario is a utopia scenario, and the second is a dystopia, and the way AI is regulated is the fork in the road between the two. So of course people are going to want to cheer for regulation that steers towards the utopia.

    That means things like:

    • Fighting back when the oligarchs try to talk about ‘AI Safety’ meaning that there should be no Open Source models, and that they should tightly control how and for what the models can be used. The biggest AI Safety issue is that we end up in a dystopian AI-fueled serfdom, and FLOSS models and freedom for the common people to use them actually helps to reduce the chances of this outcome.
    • Not allowing ‘AI washing’ where oligarchs can take humanities collective work, put it through an algorithm, and produce a competing thing that they control - unless everyone has equal access to it. One policy that would work for this would be that if you create a model based on other people’s work, and want to use that model for a commercial purpose, then you must publicly release the model and model weights. That would be a fair trade-off for letting them use that information for training purposes.

    Fundamentally, all of this is just exacerbating cracks in the copyright system as a policy. I personally think that a better system would look like this:

    • Everyone gets a Universal Basic Income paid, and every organisation and individual making profit pays taxes in to fund the UBI (in proportion to their profits).
    • All forms of intellectual property rights (except trademarks) are abolished - copyright, patents, and trade secrets are no longer enforced by the law. The UBI replaces it as compensation to creators.
    • It is illegal to discriminate against someone for publicly disclosing a work they have access to, as long as they didn’t accept valuable consideration to make that disclosure. So for example, if an OpenAI employee publicly released the model weights for one of OpenAI’s models without permission from anyone, it would be illegal for OpenAI to demote / fire / refuse to promote / pay them differently on that basis, and for any other company to factor that into their hiring decision. There would be exceptions for personally identifiable information (e.g. you can’t release the client list or photos of real people without consequences), and disclosure would have to be public (i.e. not just to a competitor, it has to be to everyone) and uncompensated (i.e. you can’t take money from a competitor to release particular information).

    If we had that policy, I’d be okay for AI companies to be slurping up everything and training model weights.

    However, with the current policies, it is pushing us towards the dystopic path where AI companies take what they want and never give anything back.

      • A1kmm@lemmy.amxl.com
        link
        fedilink
        English
        arrow-up
        0
        ·
        11 days ago

        I agree that this is a major concern, especially if non-renewable energy is used, and until the production process for computer technology and solar panels is much more of a circular economy. More renewable energy and circular economies, and following the sun for AI training and inference (it isn’t going to be low latency anyway, so if you need AI inference in the northern hemisphere night, just do it on the other side of the world) could greatly decrease the impact.

  • Veneroso@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    12 days ago

    We have hundreds of years of out of copyright books and newspapers. I look forward to interacting with old-timey AI.

    “Fiddle sticks! These mechanical horses will never catch on! They’re far too loud and barely more faster than a man can run!”

    “A Woman’s place is raising children and tending to the house! If they get the vote, what will they demand next!? To earn a Man’s wage!?”

    That last one is still relevant to today’s discourse somehow!?

    • Xatolos@reddthat.com
      link
      fedilink
      English
      arrow-up
      0
      ·
      11 days ago

      I don’t feel it is. They aren’t saying that their physical requirements should be free (computers, engineers, programmers, electricity, etc…) which is what is being used for the analogy (cheese, ingredients, etc…).

      It would be better to claim “I run a sandwich shop and couldn’t afford to run it if I had to pay for every recipe, idea, and technique I use in the business.”

      Now, it’s not as simple as this, and I’m not claiming it is. But this example isn’t anywhere near correct. It’s like the old claim that pirating something is the same as stealing it. The usage on one thing doesn’t equal the loss of something physical.

      It’s one of those reasons why laws about this are difficult. Too strict and no one would be able to do “fan”-anything and many other issues (“if it uses AI” takes out many digital tools, etc…), too loose and you don’t really have laws at all.

    • Womble@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      11 days ago

      Yep, its definitely not possible that nice small businesses like universal and sony would sue without an actual case in order to try and crush competitors with costs.

    • soul@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      11 days ago

      In the same way that a person can learn the material and also use that knowledge to potentially plagiarize it, though. It’s no different in that sense. What is different is the speed of learning and both the speed and capacity of recall. However, it doesn’t change the fundamental truths of OP’s explanation.

      Also, when you’re talking specifically about music, you’re talking about a very limited subset of note combinations that will sound pleasing to human ears. Additionally, even human composers commonly struggle to not simply accidentally reproduce others’ work, which is partly why the music industry is filled with constant copyright litigation.