I’m looking at people running Deepseek’s 671B R1 locally using NVME drives at 2 tokens a second. So why not skip the FLASH and give me a 1TB drive using a NVME controller and only DRAM behind it? The ultimate transistor count is lower on the silicon side. It would be slower than typical system memory but the same as any NVME with a DRAM cache. The controller architecture for safe page writes in Flash and the internal boost circuitry for pulsing each page is around the same complexity as the memory management unit and constant refresh of DRAM and it’s power stability requirements. Heck, DDR5 and up is putting power regulation on the system memory already IIRC. Anyone know why this is not a thing or where to get this thing?

  • 𞋴𝛂𝛋𝛆@lemmy.worldOP
    link
    fedilink
    English
    arrow-up
    0
    ·
    2 months ago

    It adds an additional memory controller on a different bus and infinite read/write cycling to load much larger AI models.

    • Shadow@lemmy.ca
      link
      fedilink
      arrow-up
      0
      ·
      2 months ago

      Memory connected via the pci bus to the CPU, would be too slow for application use like that.

      Apple had to use soldered in ram for their unified memory because the length of the traces on the mobo news to be so tightly controlled. Pci is way too slow comparatively.

      • 𞋴𝛂𝛋𝛆@lemmy.worldOP
        link
        fedilink
        English
        arrow-up
        0
        ·
        2 months ago

        Not at all. An NVME already works as I clearly stated in the post. The speed is irrelevant with very large models. They are MoEs so they get loaded and moved around in large blocks once per inference. The only issue is cycling a NVME. They will still work, it would just be nice to not worry about the limited cycle life. I am setting up agentic toolsets where models will get loaded and offloaded a lot. I do this regularly with 40-50GB models already. I want to double or quadruple this amount.

      • MHLoppy@fedia.io
        link
        fedilink
        arrow-up
        0
        ·
        2 months ago

        Memory connected via the pci bus to the CPU, would be too slow for application use like that.

        https://www.intel.com/content/www/us/en/content-details/842211/optimizing-system-memory-bandwidth-with-micron-cxl-memory-expansion-modules-on-intel-xeon-6-processors.html

        The experimental results presented in this paper demonstrate that Micron’s CZ122 CXL memory modules used in software level ratio based weighted interleave configuration significantly enhance memory bandwidth for HPC and AI workloads when used on systems with Intel’s 6th Generation Xeon processors.

        Found via Wendell: YouTube

        edit: typo