• Neshura@bookwormstory.social
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 month ago

      pretty much, AI (LLMs specifically) are just fancy statistical models which means that when they ingest data without reasoning behind it (think the many hallucinations of AI our brains manage to catch and filter out) it corrupts the entire training process. The problem is that AI can not distinguish other AI text from human text anymore so it just ingests more and more “garbage” which leads to worse results. There’s a reason why progress in the AI models has almost completely stalled compared to when this craze first started: the companies have an increasingly hard time actually improving the models because there is more and more garbage in the training data.

      • oce 🐆@jlai.lu
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        1 month ago

        There’s actually a lot of human intervention in the mix. Data labelers for source data, also domain experts who will rectify answers after a first layer of training, some layers of prompts to improve common answers. Without those domain experts, the LLM would never have the nice looking answers we are getting. I think the human intervention is going to increase to counter the AI pollution in the data sources. But it may not be economically viable anymore eventually.

        This is a nice deep dive of the different steps to make today’s LLMs: https://youtube.com/watch?v=7xTGNNLPyMI

        • themachinestops@lemmy.dbzer0.comOP
          link
          fedilink
          English
          arrow-up
          0
          ·
          edit-2
          1 month ago

          Make an account on twitter and reddit, and use chatgpt to generate content. AI models will scrape the data and use it to for training, basically Ouroboros also known as model collapse.

          • CheeseNoodle@lemmy.world
            link
            fedilink
            English
            arrow-up
            0
            ·
            29 days ago

            No need to bother, reddit is already full of entire threads of GPT posts, the megacorps are killing their own product for us.