I’m looking at people running Deepseek’s 671B R1 locally using NVME drives at 2 tokens a second. So why not skip the FLASH and give me a 1TB drive using a NVME controller and only DRAM behind it? The ultimate transistor count is lower on the silicon side. It would be slower than typical system memory but the same as any NVME with a DRAM cache. The controller architecture for safe page writes in Flash and the internal boost circuitry for pulsing each page is around the same complexity as the memory management unit and constant refresh of DRAM and it’s power stability requirements. Heck, DDR5 and up is putting power regulation on the system memory already IIRC. Anyone know why this is not a thing or where to get this thing?

  • CitricBase@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    19 days ago

    You are asking for 1TB of RAM. Keying it to M.2 wouldn’t make it any cheaper or better than keying it to regular DDR5. I don’t think that even just a tenth of that would physically fit onto an NVMe drive, even if someone wanted it to.

    Put in that context, do you begin to see now why that isn’t a thing that exists?

    • 𞋴𝛂𝛋𝛆@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      0
      ·
      19 days ago

      DRAM is 1 transistor per bit. Flash is 3 in a larger overall footprint, but these are deposited in multiple stacked layers. The actual dies are not much different if the packaging is removed. Most decent quality NVME drives have a large amount of DRAM cache already onboard. It is trivial to make the entire drive DRAM. This is a product that should exist for the niche of AI models and it probably does, but I’m unable to find it in the modern dystopian internet.

    • 𞋴𝛂𝛋𝛆@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      0
      ·
      19 days ago

      I’m talking about volatile memory instead of persistent storage. I think we’re on different pages, no pun intended - lie

      • UberKitten@lemmy.blahaj.zone
        link
        fedilink
        English
        arrow-up
        0
        ·
        19 days ago

        a ram drive is volatile memory. you can get higher performance out of DRAM chips attached to the CPU memory controller, versus putting them behind the PCIe bus using NVME. for applications that only work with file systems, a RAM drive works around that limitation. so, why bother?

        • 𞋴𝛂𝛋𝛆@lemmy.worldOP
          link
          fedilink
          English
          arrow-up
          0
          ·
          19 days ago

          Enormous model access. It is too slow for real time but it is possible to load a 671 billion parameter model in 2 bit quantization with 4 experts at barely over 2 tokens a second. That is the hyperbolic extreme. It really means that I go from my present limit around 70B and can double it. The more interesting aspect is what is possible with a larger mixture of experts model. The new Llama 4 models are exploring this concept. I already run a 8×7B model most of the time because it is so fast and nearly on par with a 70B. One of the areas that is really opening up right now is a MoE built out of something like a 3B and run at full model weights on a 16 GB GPU. It would be possible to do a lot with a model like that because fine tuning individual 3B models is accessible on that same 16GB GPU. Then it becomes possible to start storing your own questions and answers and using them for training in niche subjects, then stitch them together in your own FrankenMoE. Like the main training set used to teach a model to think like Deepseek R1 is only 600 questions long. If you actually read most of this kind of training dataset it is half junk most of the time. If a person that knows how to prompt well saves their own dataset, much better results will follow. A very large secondary nonvolatile storage makes it much more reasonable to load and unload 200 - 400 GB a few dozen times a day. With an extensive agentic toolset, up that by an order of magnitude. If the toolset is automated with several models being swapped out, raise that another order of mag

      • e0qdk@reddthat.com
        link
        fedilink
        arrow-up
        0
        ·
        19 days ago

        What they mean is you can just do something like mount -t tmpfs -o size=1G tmpfs /mnt/ramdisk to get a file system using regular RAM already.

        • solrize@lemmy.world
          link
          fedilink
          arrow-up
          0
          ·
          19 days ago

          There are few motherboards with enough dram channels to handle a TB of dram. That’s basically why ram drives existed back in the day, and they are still potentially sort of relevant for the same reason. No a TB of ram wouldn’t fit on an m.2 card, but a box of ram with a pcie connector is still an interesting idea. Optane also aimed for this, but it never got enough traction to be viable, so it was scrapped.

          • e0qdk@reddthat.com
            link
            fedilink
            arrow-up
            0
            ·
            19 days ago

            You can get motherboards with enough slots if you’re willing to pay enterprise prices for them. I have a system with 1TB of RAM at work that I use as a fast data cache. I just mount tmpfs on it, write hundreds of gigs of data into it (overwriting it all every few hours), and it works great. Cost was somewhere in the $10~15K (US) range a few years ago, IIRC. Steep for an individual, sure, but not crazy for an organization.

            • solrize@lemmy.world
              link
              fedilink
              arrow-up
              0
              ·
              19 days ago

              Last I saw you had to use super premium ultra dense memory modules to get 1tb into a motherboard. Maybe that’s less so now. But the hope would be to use commodity ram and CPUs etc. 10k for a 1tb system is pretty good. Last I looked it was a lot more.

              • e0qdk@reddthat.com
                link
                fedilink
                arrow-up
                0
                ·
                19 days ago

                Pretty sure you can do much better than that now (plus or minus tariff insanity) – quick check on Amazon, NewEgg, etc. suggests ballpark of $5K for 1TB RAM (Registered DDR4) + probably compatible motherboard + CPU.

        • 4am@lemm.ee
          link
          fedilink
          arrow-up
          0
          ·
          edit-2
          19 days ago

          I think OP is looking for fast, temporary storage; like a buffer or a ramdisk.

          But, PCIe bus is slower than the memory bus. Just use actual RAM for this.

        • Dudewitbow@lemmy.zip
          link
          fedilink
          arrow-up
          0
          ·
          19 days ago

          OP is basically looking for the hypothetical endgame intel optane thats almost as fast as ram. he doesnt however know that the optane project is functionally dead.

    • tal@lemmy.today
      link
      fedilink
      English
      arrow-up
      0
      ·
      19 days ago

      because ram drives are easy to accomplish the same thing in software for applications that need it

      I don’t know if this is what OP is going for, but I’ve wanted to do something similar to what he’s talking about myself to exceed the maximum amount of memory that a motherboard supported. Basically, I wanted to stick more memory on a system – and I was fine with access to it being slower than to the on-motherboard memory – to act as a very large read cache.

      A RAM drive will let you use memory that your motherboard supports as a drive. But it won’t let you stick even more physical DRAM into a system, above-and-beyond what the motherboard can handle.

  • Shadow@lemmy.ca
    link
    fedilink
    arrow-up
    0
    ·
    19 days ago

    Modern flash is already faster than your pci bus, and it’s cheaper than dram. Using ram doesn’t add anything.

    It uses to be a thing before modern flash chips, you’d have battery backed dram pci-e cards.

    • 𞋴𝛂𝛋𝛆@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      0
      ·
      19 days ago

      It adds an additional memory controller on a different bus and infinite read/write cycling to load much larger AI models.

      • Shadow@lemmy.ca
        link
        fedilink
        arrow-up
        0
        ·
        19 days ago

        Memory connected via the pci bus to the CPU, would be too slow for application use like that.

        Apple had to use soldered in ram for their unified memory because the length of the traces on the mobo news to be so tightly controlled. Pci is way too slow comparatively.

        • 𞋴𝛂𝛋𝛆@lemmy.worldOP
          link
          fedilink
          English
          arrow-up
          0
          ·
          19 days ago

          Not at all. An NVME already works as I clearly stated in the post. The speed is irrelevant with very large models. They are MoEs so they get loaded and moved around in large blocks once per inference. The only issue is cycling a NVME. They will still work, it would just be nice to not worry about the limited cycle life. I am setting up agentic toolsets where models will get loaded and offloaded a lot. I do this regularly with 40-50GB models already. I want to double or quadruple this amount.

        • MHLoppy@fedia.io
          link
          fedilink
          arrow-up
          0
          ·
          19 days ago

          Memory connected via the pci bus to the CPU, would be too slow for application use like that.

          https://www.intel.com/content/www/us/en/content-details/842211/optimizing-system-memory-bandwidth-with-micron-cxl-memory-expansion-modules-on-intel-xeon-6-processors.html

          The experimental results presented in this paper demonstrate that Micron’s CZ122 CXL memory modules used in software level ratio based weighted interleave configuration significantly enhance memory bandwidth for HPC and AI workloads when used on systems with Intel’s 6th Generation Xeon processors.

          Found via Wendell: YouTube

          edit: typo

    • MHLoppy@fedia.io
      link
      fedilink
      arrow-up
      0
      ·
      19 days ago

      Using ram doesn’t add anything.

      It would improve access latency vs flash though, despite less difference in raw bandwidth

  • whaleross@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    edit-2
    19 days ago

    There are server motherboards that have slots for 1TB of RAM, if that would help? They are not cheap but maybe you could find one second hand.

    • Cocodapuf@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      19 days ago

      At this moment, you can buy consumer motherboards with 8 ram slots. Filling them with 64gb dimms you can reach half a TB. It’s definitely more affordable than server parts, but it’s still pricey. I guess you’re looking at a $1500 system rather than a $20,000 system.

    • 𞋴𝛂𝛋𝛆@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      0
      ·
      19 days ago

      Back nearly 2 years ago, the cheapest option I found for a good GPU was a 2022 laptop with a 3080Ti. That got me a 16gb GPU because the mobile version is better specs than the discrete 3080. Unfortunately I max out at 64GB of addressable sysmem. I have dual NVME drives though.

      Most people here probably don’t know about Deepspeed and Zero 3. I looked it up to try and share the reference but the Deepspeed package has expanded so much in what it does that the functionality is obscured. I only know about it as an option in Oobabooga Textgen WebUI/llama.cpp. There, Deepspeed is how I can offload larger models onto the NVME when they do not fit on both GPU and sysmem.

      • whaleross@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        19 days ago

        I’m not sure where your mind went while writing this, but my comment was a suggestion for a possible solution to your original question about massive sets of fast volatile memory for storage. Maybe you need to consider changing platform and adapt your project to what is accessible if you want to make this happen. Unless you can find that exotic device that may or may not exist in the first place and afford it. I mean, is running it on your existing laptop really the show stopper requirement?

        • 𞋴𝛂𝛋𝛆@lemmy.worldOP
          link
          fedilink
          English
          arrow-up
          0
          ·
          19 days ago

          What is your mindset from casual neutral conversation to a personal negativity. This isn’t tech support. I know more than 90% of people here. I am sharing ideas because such information can be hard to find in search results. If such casual conversations offend you or you find it difficult to talk without making things personal, feel free to block me. I’m just some physically disabled guy in involuntary social isolation where this is my only place external human contact. I expect everyone to behave like the would in an public commons. I do not appreciate random negativity from strangers interacting with casual conversation.

          • surewhynotlem@lemmy.world
            link
            fedilink
            arrow-up
            0
            ·
            19 days ago

            You’re coming off like a wacko. I’m being objective, not mean, since I have no stake in this conversation. Just FYI

          • whaleross@lemmy.world
            link
            fedilink
            arrow-up
            0
            ·
            19 days ago

            What on earth are you on about? What negativity? I’m trying to suggest solutions and point out that maybe the solution is somewhere else than your initial idea?

            • 𞋴𝛂𝛋𝛆@lemmy.worldOP
              link
              fedilink
              English
              arrow-up
              0
              ·
              19 days ago

              I’m not sure where your mind went while writing this

              This is a negative and personal statement without any basis. It is rude to make any unsolicited personal inference about my thoughts.

              Maybe you need to consider changing platform

              I explained my personal constraints and methodologies, along with the packages that enable the workflow. You then wholely dismiss that without reason or understanding to maintain this position as if it is the only path and repeat your original take with a dismissive attitude.

              You have no idea what I am thinking, and I did not invite your opinion on that. I told you what is possible and works, along with enough supporting information to find this information, or for anyone else to find such information if they read this. You responded to that with an unsolicited comment and dogma. I find that offensive.

  • Carobu@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    19 days ago

    I THINK you could accomplish something like this by making your swap file absurdly huge (like 1tb) and then setting your application to use the \tmp or any of the other folders that are technically in RAM in Linux. The only issue is I don’t know if you could tell it to only use the m.2 and it would obviously be somewhat random with where it locates data in the m.2 vs actual RAM. Maybe if you set your swap to a different device and ONLY told it to use that?

    I suspect that still probably wouldn’t be exactly what you want. What you actually want is Intel optane.

  • notabot@lemm.ee
    link
    fedilink
    arrow-up
    0
    ·
    19 days ago

    It’s not exactly what you’re talking about, but it looks like there is a way to do this over PCIe 5.0: CXL 1.1. The m.2 bus exposes up to 4 PCIe lanes, so it may be possible, with the right combination of motherboard that supports CXL, and a CXL module with an m.2 interface, to do what you want. I haven’t dug much deeper than that, but maybe it’ll be of some help.

  • moody@lemmings.world
    link
    fedilink
    arrow-up
    0
    ·
    19 days ago

    M.2, not to my knowledge, but I know they make PCIE cards with RAM slots on them. LTT had a video on that some years ago, I imagine newer models have probably been made since then.