LLMs are solving MCAT, the bar test, SAT etc like they’re nothing. At this point their performance is super human. However they’ll often trip on super simple common sense questions, they’ll struggle with creative thinking.

Is this literally proof that standard tests are not a good measure of intelligence?

  • GBU_28@lemm.ee
    link
    fedilink
    English
    arrow-up
    0
    ·
    8 months ago

    Everyone knew this.

    Obviously 1:1 mentoring, optional cohort/Custom grouping, experiential, self paced, custom versioned assignment learning is best but that’s simply not practical for a massive system.

  • Carrolade@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    8 months ago

    Those tests are not for intelligence. They’re testing whether you’ve done the pre-requisite work and acquired the skills necessary to continue advancing towards your desired career.

    Wouldn’t want a lawyer that didn’t know anything about how the law works, after all, maybe they just cheated through their classes or something.

  • dustyData@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    8 months ago

    We already knew that intelligence is a complex and multifaceted property, and that being very intelligent and being very good at taking tests are two distinct (albeit loosely correlated) skills. It’s just a too convenient measurement despite it’s many flaws.

    This is only news if you’re an ignorant techbro who doesn’t pay attention to any other field except computer programming.

  • yesman@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    8 months ago

    Intelligence cannot be measured. It’s a reification fallacy. Inelegance is colloquial and subjective.

    If I told you that I had an instrument that could objectively measure beauty, you’d see the problem right away.

    • KevonLooney@lemm.ee
      link
      fedilink
      arrow-up
      0
      ·
      edit-2
      8 months ago

      But intelligence is the capacity to solve problems. If you can solve problems quickly, you are by definition intelligent.

      the ability to apply knowledge to manipulate one’s environment or to think abstractly as measured by objective criteria (such as tests)

      https://www.merriam-webster.com/dictionary/intelligence

      It can be measured by objective tests. It’s not subjective like beauty or humor.

      The problem with AI doing these tests is that it has seen and memorized all the previous questions and answers. Many of the tests mentioned are not tests of reasoning, but recall: the bar exam, for example.

      If any random person studied every previous question and answer, they would do well too. No one would be amazed that an answer key knew all the answers.

      • kromem@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        8 months ago

        This isn’t quite correct. There is the possibility of biasing the results with the training data, but models are performing well at things they haven’t seen before.

        For example, this guy took an IQ test, rewrote the visual questions as natural language questions, and gave the test to various LLMs:

        https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-100-iq

        These are questions with specific wording that the models won’t have been trained on given he wrote them out fresh. Old models have IQ results that are very poor, but the SotA model right now scores a 100.

        People who are engaging with the free version of ChatGPT and think “LLMs are dumb” is kind of like talking to a moron human and thinking “humans are dumb.” Yes, the free version of ChatGPT has around a 60 IQ on that test, but it also doesn’t represent the cream of the crop.

        • KevonLooney@lemm.ee
          link
          fedilink
          arrow-up
          0
          ·
          8 months ago

          Maybe, but this is giving the AI a lot of help. No one rewrites visual questions for humans who take IQ tests. That spacial reasoning is part of the test.

          In reality, no AI would pass any test because the first part is writing your name on the paper. Just doing that is beyond most AIs because they literally don’t have to deal with the real world. They don’t actually understand anything.

          • kromem@lemmy.world
            link
            fedilink
            English
            arrow-up
            0
            ·
            8 months ago

            They don’t actually understand anything.

            This isn’t correct and has been shown not to be correct in research over and over and over in the past year.

            The investigation reveals that Othello-GPT encapsulates a linear representation of opposing pieces, a factor that causally steers its decision-making process. This paper further elucidates the interplay between the linear world representation and causal decision-making, and their dependence on layer depth and model complexity.

            https://arxiv.org/abs/2310.07582

            Sizeable differences exist among model capabilities that are not captured by their ranking on popular LLM leaderboards (“cramming for the leaderboard”). Furthermore, simple probability calculations indicate that GPT-4’s reasonable performance on k=5 is suggestive of going beyond “stochastic parrot” behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training.

            We introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems that are challenging for typical prompting methods. Core to the framework is a self-discovery process where LLMs select multiple atomic reasoning modules such as critical thinking and step-by-step thinking, and compose them into an explicit reasoning structure for LLMs to follow during decoding. SELF-DISCOVER substantially improves GPT-4 and PaLM 2’s performance on challenging reasoning benchmarks such as BigBench-Hard, grounded agent reasoning, and MATH, by as much as 32% compared to Chain of Thought (CoT).

            Just a few of the relevant papers you might want to check out before stating things as facts.

      • yesman@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        8 months ago

        This is a semantic argument.

        Have you never felt smarter or dumber depending on the situation? If so, did your ability to think abstractly, apply knowledge, or manipulate your environment change? Intelligence is subjective (and colloquial) like beauty and humor.

      • decerian@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        8 months ago

        But intelligence is the capacity to solve problems. If you can solve problems quickly, you are by definition intelligent

        To solve any problems? Because when I run a computer simulation from a random initial state, that’s technically the computer solving a problem it’s never seen before, and it is trillions of times faster than me. Does that mean the computer is trillions of times more intelligent than me?

        the ability to apply knowledge to manipulate one’s environment or to think abstractly as measured by objective criteria (such as tests)

        If we built a true super-genius AI but never let it leave a small container, is it not intelligent because WE never let it manipulate its environment? And regarding the tests in the Merriam Webster definition, I suspect it’s talking about “IQ tests”, which in practice are known to be at least partially not objective. Just as an example, it’s known that you can study for and improve your score on an IQ test. How does studying for a test increase your “ability to apply knowledge”? I can think of some potential pathways, but we’re basically back to it not being clearly defined.

        In essence, what I’m trying to say is that even though we can write down some definition for “intelligence”, it’s still not a concept that even humans have a fantastic understanding of, even for other humans. When we try to think of types of non-human intelligence, our current models for intelligence fall apart even more. Not that I think current LLMs are actually “intelligent” by however you would define the term.

        • Tar_Alcaran@sh.itjust.works
          link
          fedilink
          arrow-up
          0
          ·
          8 months ago

          Does that mean the computer is trillions of times more intelligent than me?

          And in addition, is an encyclopedia intelligent because it holds many answers?

    • Feathercrown@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      8 months ago

      It shows that it’s well-read but not that it isn’t intelligent. It says relatively little about its intelligence (although the tests do require some).

  • hperrin@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    edit-2
    8 months ago

    Standard tests don’t measure intelligence. They measure things like knowledge and skill. And ChatGPT is very knowledgeable and highly skilled.

    IQ tests have the goal of measuring intelligence.

  • steventrouble@programming.dev
    link
    fedilink
    arrow-up
    0
    ·
    edit-2
    8 months ago

    To say that ChatGPT is “not intelligent” is to ignore the hard work of all the stupid humans in the world.

    There are a lot of good points brought up in this thread, and I agree ChatGPT acts more intelligent than it really is. But I think we should also call out that many humans are quite stupid, some more so than LLMs. Many humans spread and believe false information more often than ChatGPT. Some humans can’t even string together coherent sentences, and other humans will happily listen to and parrot those humans as though they were speaking divine truths. Many humans can’t do basic math and logic even after 12+ years of being taught it, over and over. Intelligence is a spectrum, and ChatGPT is definitively more intelligent than a non-zero number of humans.

    • Tar_Alcaran@sh.itjust.works
      link
      fedilink
      arrow-up
      0
      ·
      8 months ago

      LLMs don’t “think” at all. They string together words based on where those words generally appear in context with other words based on input from humans.

      Though I do agree that the output from a moron is often worth less than the output from an LLM

      • Grimy@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        edit-2
        8 months ago

        This is kind of how humans operate as well though. We just string words along based on what input is given.

        We speak much too fast to be properly reflecting on it, we just regurgitate whatever comes too mind.

        To be clear, I’m not saying LLM think but that the difference between our thinking and their output isn’t the chasm it’s made out to be.

        • cynar@lemmy.world
          link
          fedilink
          English
          arrow-up
          0
          ·
          8 months ago

          The key difference is that your thinking feeds into your word choice. You also know when to mack up and allow your brain to actually process.

          LLMs are (very crudely) a lobotomised speech center. They can chatter and use words, but there is no support structure behind them. The only “knowledge” they have access to is embedded into their training data. Once that is done, they have no ability to “think” about it further. It’s a practical example of a “Chinese Room” and many of the same philosophical arguments apply.

          I fully agree that this is an important step for a true AI. It’s just a fragment however. Just like 4 wheels, and 2 axles don’t make a car.

        • starman2112@sh.itjust.works
          link
          fedilink
          arrow-up
          0
          ·
          8 months ago

          Disagree. We’re very good at using words to convey ideas. There’s no reason to believe that we speak much too fast to be properly reflecting on what we say—the speed with which we speak speaks to our proficiency with language, not a lack thereof. Many people do speak without reflecting on what they say, but to reduce all human speech down to that? Downright silly. I frequently spend seconds at a time looking for a word that has the exact meaning that will help to convey the thought that I’m trying to communicate. Yesterday, for example, I spent a whole 15 seconds or so trying to remember the word exacerbate.

          An LLM is extremely good at stringing together stock words and phrases that make it sound like it’s conveying an idea, but it will never stop to think about the definition of a word that best conveys a real idea. This is the third draft of this comment. I’ve yet to see an LLM write, rewrite, then rewrite again it’s output.

          • agamemnonymous@sh.itjust.works
            link
            fedilink
            arrow-up
            0
            ·
            8 months ago

            Kinda the same thing though. You spent time finding the right auto-complete in your head. You weighed the words that fit the sentence you’d constructed in order to find the point must frequently encountered in conversations or documents that include specific related words. We’re much more sophisticated at this process, but our whole linguistic paradigm isn’t fundamentally very different from good auto-complete.

          • steventrouble@programming.dev
            link
            fedilink
            arrow-up
            0
            ·
            edit-2
            8 months ago

            I’ve yet to see an LLM write, rewrite, then rewrite again it’s output.

            It’s because we (ML peeps) literally prevent them from deleting their own ouput. It’d be like if we stuck you in a room, and only let you interact with the outside world using a keyboard that has no backspace.

            Seriously, try it. Try writing your reply without using the delete button, or backspace, or the arrow keys, or the mouse. See how much better you do than an LLM.

            It’s hard! To say that an LLM is not capable of thought just because it makes mistakes sometimes is to ignore the immense difficulty of the problem we’re asking it to solve.

            • starman2112@sh.itjust.works
              link
              fedilink
              arrow-up
              0
              ·
              8 months ago

              To me it isn’t just the lack of an ability to delete it’s own inputs, I mean outputs, it’s the fact that they work by little more than pattern recognition. Contrast that with humans, who use pattern recognition as well as an understanding of their own ideas to find the words they want to use.

              Man, it is super hard writing without hitting backspace or rewriting anything. Autocorrect helped a ton, but I hate the way this comment looks lmao

              This isn’t to say that I don’t think a neural network can be conscious, or self aware, it’s just that I’m unconvinced that they can right now. That is, that they can be. I’m gonna start hitting backspace again after this paragraph

              • steventrouble@programming.dev
                link
                fedilink
                English
                arrow-up
                0
                ·
                edit-2
                8 months ago

                That was brilliant, thanks for actually giving it a try :D

                It’s easy for me to get pedantic about minor details, so I should shut up and mention that I see what you mean and agree with the big picture. It’s not there yet and may someday be.

                Thanks again, stranger! You made my day. Keep on being awesome

  • paddirn@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    8 months ago

    We use standardized tests because they’re cheap pieces of paper we can print out by the thousands and give out to a schoolfull of children and get an approximation of their relative intelligence among a limited range of types of intelligence. If we wanted an actual reliable measure of each kid’s intelligence type they’d get one-on-one attention and go through a range of tests, but that would cost too much (in time & money), so we just approximate with the cheap paper thing instead. Probably we could develop better tests that accounted for more kinds of intelligence, but I’m guessing those other types of intelligence aren’t as useful to capitalism, so we ignore them.

  • originalfrozenbanana@lemm.ee
    link
    fedilink
    arrow-up
    0
    ·
    8 months ago

    Citation needed that LLMs are passing these tests like they’re nothing.

    LLMs don’t have intelligence, they are sentence generators. Sometimes those sentences are correct, sometimes they’re gobbledygook.

    For instance, they fabricate real-looking but nevertheless totally fake citations in research papers https://www.nature.com/articles/s41598-023-41032-5

    To your point we already know standardized tests are biased and poor tools to measure intelligence. Partly that’s because they don’t actually measure intelligence- they often measure rote knowledge. We don’t need LLMs to make that determination, we already can.

    • EdibleFriend@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      8 months ago

      Talked about this a few times over the last few weeks but here we go again…

      I am teaching myself to write and had been using chatgpt for super basic grammar assistance. Seemed like an ideal thing, toss a sentence I was iffy about into it and ask it what it thought. After all I wasn’t going to be asking it some college level shit. A few days ago I asked it about something I was questionable on. I honestly can’t remember the details but it completely ignored the part of the sentence I wasn’t sure about and told something else was wrong. What it said was wrong was just…not wrong. The ‘correction’ it gave me was some shit a third grader would look at and say ‘uhhhhh…I’m gonna ask someone else now…’

      • Ottomateeverything@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        8 months ago

        That’s because LLMs aren’t intelligent. They’re just parrots that repeat what they’ve heard before. This stuff being sold as an “AI” with any “intelligence” is extremely misleading and causing people to think it’s going to be able to do things it can’t.

        Case in point, you were using it and trusting it until it became very obvious it was wrong. How many people never get to that point? How much has it done wrong before then? Etc.

    • givesomefucks@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      8 months ago

      OP picked standardized tests that only require memorization because they have zero idea what a real IQ test like the WAIS is like.

      Also how those IQ tests work. You kind of have to go in “blind” to get an accurate result. And LLM can’t do anything “blind” because you have to train them.

      A chatbots can’t even take a real IQ test, if we trained a chatbots to take a real IQ test, it would be a pointless test

      • JackGreenEarth@lemm.ee
        link
        fedilink
        English
        arrow-up
        0
        ·
        8 months ago

        Nobody is a blank slate. Everyone has knowledge from their past experience, and instincts from their genetics. AIs are the same. They are trained on various things just as humans have experienced various things, but they can be just as blind as each other on the contents of the test.

        • Ottomateeverything@lemmy.world
          link
          fedilink
          arrow-up
          0
          ·
          edit-2
          8 months ago

          You’re entirely missing the point.

          The requirements and basis of IQ tests are they are problems you haven’t seen before. An LLM works by recognizing existing data and returning what came next in the training set.

          LLMs work directly in opposition of how an IQ text works.

          Things like past experience are all the shit IQ tests need to avoid in order to be accurate. And they’re exactly what LLMs work off of.

          By definition, LLMs have no IQ.

        • givesomefucks@lemmy.world
          link
          fedilink
          English
          arrow-up
          0
          ·
          8 months ago

          No, they wouldn’t.

          Because real IQ tests arent just multiple choice exams

          You would have to train it to handle the different tasks, and training it at the tasks would make it better at the tasks, raising their scores.

          I don’t know if the issue is you don’t know about how IQ tests work, or what LLM can do.

          But it’s probably both instead of one or the other.

  • MrJameGumb@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    8 months ago

    There has been plenty of proof that standardized testing doesn’t work long before ChatGTP ever existed. Institutions will keep using them though because that’s what they’ve always done and change is hard

    • underwire212@lemm.ee
      link
      fedilink
      arrow-up
      0
      ·
      8 months ago

      Not disagreeing with you; how do you suggest a way for admissions to reliably compare applicants with each other? A 3.5 at one school can mean something completely different than a 3.5 at another school.

      Something like the SAT is far from perfect, but it is a way one number that means the same thing across applicants.

      • yetAnotherUser@feddit.de
        link
        fedilink
        arrow-up
        0
        ·
        8 months ago

        There shouldn’t even be admission based on what you score in some random test. My (non-US) university accepted everyone who applied, at least for my field of study. Does that mean many people drop out after a semester or two? Absolutely, but there are countless people completing their studies who would have never gotten a chance to do so otherwise. Why shouldn’t they be allowed to prove themselves?

      • ArbiterXero@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        8 months ago

        I think this is the point, because Harvard got rid of the SAT requirement, and then just brought it back.

        It’s a really terrible measure .

        But it is an equal measure, despite what it measuring moderately meaningless.

        I don’t think we have a better answer yet, because everything else lacks any sort of comparable equivalency .

        And I say this as an ADHD sufferer who is at a huge disadvantage on standardised testing

    • Rhaedas@fedia.io
      link
      fedilink
      arrow-up
      0
      ·
      8 months ago

      Long before. Even in 1930 the eugenics-motivated creator Carl Brigham recanted his original conclusions only years ago that had led to the development of the SAT, but by then the colleges had totally invested in a quick and easy way to score students, even if it was inaccurate. Change is hard, but I think the bigger influence here was money since it hadn’t been around that long at that point.

  • elint@programming.dev
    link
    fedilink
    arrow-up
    0
    ·
    8 months ago

    No. It may be proof that standardized tests are not useful measures of LLM intelligence, but human brains operate differently from LLMs, so these tests may still be very useful measures of human intelligence.

  • Paragone@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    8 months ago

    such tests are not standardized tests of intelligence, they are standardized tests of specific-competencies.

    Thomas Armstrong’s got a book “7 Kinds of Smart, revised”, on 9 intelligences ( he kept the same title, but added 2 more ).

    Social/relational intelligence was not included in IQ because it is one that girls have, but us guys tend to not have, so the men who devised IQ … just never considered it to have any validity/significance.

    Just as it is much easier to make a ML that can operate a commuter-train fuel-efficiently, than it is to get a human, with general function, to compete at that super-specialized task, each specialized-competency-test is going to become owned by some AI.

    Full-self-driving being the possible exception, simply because there are waaaaay too many variables, & general competence seems to be required for that ( people deliberately driving into AI-managed vehicles, people throwing footballs at AI-managed vehicles, etc, it’s lunacy to think that AI’s going to get that kind of nonsense perfect.

    I’d settle for 25% better-than-us. )

    Just because an AI can do aviation-navigation more-perfectly than I can, doesn’t mean that the test should be taken off potential-pilots, though:

    Full-electrical-system-failures do happen in aviation.

    Carrington-event level of jamming is possible, in-flight.


    • Intelligence is “climbing the ladder efficiently”.

    • Wisdom is knowing when you’re climbing the wrong ladder, & figuring-out how to discover which ladder you’re supposed to be climbing.

    Would you remove competence-at-soccer tests for pro sports-teams?

    “Oh, James Windermere’s an excellent athlete to add to our soccer-club! Look at his triathelon ratings!”…

    … “but he doesn’t even understand soccer??”

    … “he doesn’t need to: we got rid of that requirement, because AI got better than humans, so we don’t need it anymore”.

    idiotic, right?

    It doesn’t matter if an AI is better than a human at a particular competency:

    if a kind-of-work requires that competency, then test the human for it.

  • halva@discuss.tchncs.de
    link
    fedilink
    arrow-up
    0
    ·
    8 months ago

    LLMs have a good time with standardized tests like SAT precisely because they’re standardized, i.e. there’s enough information on the internet for them to parrot on them

    Try something more complex and free-form and where a human might have to work a little more to break it down into actual little subtasks with their intelligence - and then solve it, LLMs in the best case scenario will just say they don’t know how to do it, and in the worst case scenario they’ll hallucinate some actual bullshit.