Hello,

Today my washing machine completely broke down. My parents desperately tried to get it working, but it resulted in the circuit breakers tripping and my server (an old Dell Wyse thin client) experiencing a hard power off.

When I tried to turn it back on, I received these errors on the screen.

I ran a memtest, and it completed without any issues. I also created a disk image backup just in case.

Is there any chance of getting this machine running again, or is it only fit for utilization?

  • catloaf@lemm.ee
    link
    fedilink
    English
    arrow-up
    0
    ·
    6 months ago

    That’s weird. It’s getting as far as Linux, so hopefully you have a backup you can restore and everything will be fine. If not, you can probably still pull your data off and reinstall.

    Also, usually thin clients have eMMC chips instead of SSDs. Those are designed for low write lifetimes. I would be very cautious about trusting any important data to them, especially if you’re not monitoring their health.

    • My Password Is 1234@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      0
      ·
      6 months ago

      Unfortunately, I didn’t have a backup, but I still managed to recover ALL the important files from this server, even with half of the file system sectors damaged. God, thank you. This is another lesson for the future to regularly make backups!

      • catloaf@lemm.ee
        link
        fedilink
        English
        arrow-up
        0
        ·
        6 months ago

        Really, so there was filesystem corruption? I’d definitely check the health of that eMMC chip if you can.

        • My Password Is 1234@lemmy.worldOP
          link
          fedilink
          English
          arrow-up
          0
          ·
          6 months ago

          So, the flash memory wasn’t built into the terminal, it was a 2.5-inch SSD drive that I yanked out of its plastic case to fit into the terminal’s SATA slot.

          Once unplugged it, I dumped the disk image using the dd command onto my computer, and then I worked on that image to recover the data.

    • 8Bitz0@discuss.tchncs.de
      link
      fedilink
      arrow-up
      0
      ·
      6 months ago

      First, you gotta check for ultraviolet, ghost writing, and freezing temps.

      (I really hope somebody gets that reference)

  • RestrictedAccount@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    6 months ago

    Not an expert, but I don’t think replacing the TPM chip is an option.

    How did you run a memory test? Do you get a command line?

      • Skull giver@popplesburger.hilciferous.nl
        link
        fedilink
        arrow-up
        0
        ·
        6 months ago

        Most TPMs I’ve seen are part of the CPU. Dedicated TPMs were a thing with old hardware, but fTPMs have been built into the CPU for ages. Some laptops have dedicated TPMs soldered onto the motherboard, but I don’t know anyone who actually bought a physical TPM for a socketed board.

        Socketed TPMs generally don’t come with a motherboard or computer, they’re usually something you’d buy as an extension to the motherboard. If OP had one, I’m pretty sure they’d know.

  • Shadow@lemmy.ca
    link
    fedilink
    arrow-up
    0
    ·
    6 months ago

    With the hw MCE errors, it’s probably toast.

    You could try reseating or swapping the ram around, if it’s socketed

    • remotelove@lemmy.ca
      link
      fedilink
      arrow-up
      0
      ·
      6 months ago

      Yeah, I would think memory as well due to the screen artifacts in that low res mode. (That depends on how x86 memory is mapped these days, I suppose.)

  • Quetzalcutlass@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    6 months ago

    You’re able to run MemTest? That’d suggest it’s not actually fried if it can still run things.

    Check your BIOS/UEFI to see if Secure Boot was re-enabled. If your CMOS battery died and you didn’t notice, your machine config could have reset to its default values during the power loss.

  • mvirts@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    6 months ago

    I honestly thought your washing machine was throwing the MCE when i opened the post 😹

  • Kualk@lemm.ee
    link
    fedilink
    arrow-up
    0
    ·
    edit-2
    6 months ago

    There’s so much incompetent advice here.

    CPU is fine.

    Linux is booting and tries to connect to TPM (trusted platform module).

    It has nothing to do with graphics card. Fact it is booting means CPU is most likely likely unaffected.

    TPM is most likely fried.

    Linux can run without TPM. Plenty of old boards were shipped with TPM socket, but without TPM itself.

    Best option is get manual for your motherboard and pull out that TPM.

    Any passwords stored there are lost, if you used it.

    If TPM is fine, then board pathway to it may be damaged. If that’s the case and you really need it, then board replacement is your option. But that’s only after good TPM was tried.

    • phx@lemmy.ca
      link
      fedilink
      arrow-up
      0
      ·
      6 months ago

      In some cases a wipe/reset of the TPM from the BIOS might do it as well, is it’s still functional but scrambled

  • Djtecha@lemm.ee
    link
    fedilink
    arrow-up
    0
    ·
    6 months ago

    My guess is this reset your bios as flipped a tpm setting on. Maybe see if you can disable all tpm/secure boot and see if it carries on.

  • Skull giver@popplesburger.hilciferous.nl
    link
    fedilink
    arrow-up
    0
    ·
    6 months ago

    Most likely there was either a hard voltage surge or a hard voltage drop. This could damage the PSU irreperably, or any parts connected to it and the computer. Make sure to use this thing with a grounded electrical socket.

    Looks like two RAM addresses failed when the machine crashed, and so did TPM communication. If your machine has a dedicated TPM, you can try removing it and see if that resolves the issue.

    The red dots everywhere are not a good sign. They could be signs of an iGPU failure, or yet more damaged RAM.

    You’ve said your RAM passed a memtest, so I’d start thinking the problem is conditional. Perhaps the CPU cache is broken (resulting in bad RAM) or the motherboard itself only makes good contact once it heats up and expands from running for a while.

    It’s also possible that one of the capacitors blew. Check the motherboard for any leaking/burned/puffed up capacitors. If you see damaged capacitors on the motherboard, someone with soldering skills may be able to replace them to get the machine back up and running. Same with resistors and other physical components. If you see damage in surface mount components (the very small black squares on the motherboard, RAM, or CPU) you likely won’t be able to fix the issue. If you want to try anyway, see if you can find) order the exact spec component online (you may need an electrical diagram of the motherboard for this) and look up how to reflow solder with an oven.

    You could try another power supply as well, bad power supplies can cause all kinds of issues like these.

    There are also capacitors in the power supply. Do not touch those, even after powering down the computer and removing the plug. Just the slightest bad touch and you will get a severe electric shock risking instant death on the spot, unless you’ve been educated in dealing with high voltage soldering and have the appropriate equipment. Power supplies are designed to be completely safe as as long as they’re grounded and you don’t go poking inside them. Leave that thing closed up and connected to ground, and you probably have no need to worry about that stuff while working on the computer.

  • I_Miss_Daniel@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    6 months ago

    All the red dots look like some kind of GPU failure. I think the TPM error is a symptom of a bigger hardware issue that is insurmountable.

    A live cd or usb might help as others have stated.

  • A Basil Plant@lemmy.world
    link
    fedilink
    arrow-up
    0
    ·
    edit-2
    6 months ago

    You haven’t given us much information about the CPU. That is very important when dealing with Machine Check Errors (MCEs).

    I’ve done a bit of work with MCEs and AMD CPUs, so I’ll help with understanding what may be going wrong and what you probably can do.

    I’ve done a bit of searching from the microcode & the Dell Wyse thin client that you mentioned. From what I can garner, are you using a Dell Wyse 5060 Thin Client with an AMD steppe Eagle GX-424 [1]? This is my assumption for the rest of this comment.

    Machine Check Errors (MCEs) are hard to decipher find out without the right documentation. As far as I can tell from AMD’s Data Sheet for the G-Series [2], this CPU belongs to family 16H.

    You have two MCEs in your image:

    • CPU Core 0, Bank 4: f600000000070f0f
    • CPU Core 1, Bank 1: b400000001020103

    Now, you can attempt to decipher these with a tool I used some time ago, MCE-Ryzen-Decoder [4]; however, you may note that the name says Ryzen - this tool only decodes MCEs of Ryzen architectures. However, MCE designs may not change much between families, but I wouldn’t bank (pun not intended) on it because it seems that the G-Series are an embedded SOC compared to the Ryzen CPUs which are not. However, I gave it a shot and the tool spit out that you may have an issue in:

    $ python3 run.py 04 f600000000070f0f
    Bank: Read-As-Zero (RAZ)
    Error:  ( 0x7)
    
    $ python3 run.py 01 b400000001020103
    Bank: Instruction Fetch Unit (IF)
    Error: IC Full Tag Parity Error (TagParity 0x2)
    

    Wouldn’t bank (pun intended this time) on it though.

    What you can do is to go through the AMD Family 16H’s BIOS and Kernel Developer Guide [3] (Section 2.16.1.5 Error Code). From Section 2.16.1.1 Machine Check Registers, it looks like Bank 01 corresponds to the IC (Instruction Cache) and Bank 04 corresponds to the NB (Northbridge). This means that the CPU found issues in the NB in core 0 and the IC in core 1. You can go even further and check what those exact codes decipher to, but I wouldn’t put in that much effort - there’s not much you can do with that info (maybe the NB, but… too much effort). There are some MSRs that you can read out that correspond to errors of these banks (from Table 86: Registers Commonly Used for Diagnosis), but like I said, there’s not much you can do with this info anyway.

    Okay, now that the boring part is over (it was fun for me), what can you do? It looks like the CPU is a quad core CPU. I take it to mean that it’s 4 cores * 2 SMT threads. If you have access to the linux command line parameters [5], say via GRUB for example, I would try to isolate the two faulty cores we see here: core 0 and core 1. Add isolcpus=0,1 to see the kernel boots. There’s a good chance that we see only two CPU cores failing, but others may also be faulty but the errors weren’t spit out. It’s worth a shot, but it may not work.

    Alternatively, you can tell the kernel to disable MCE checks entirely and continue executing; this can be done with the mce=off command line parameter [6] . Beware that this means that you’re now willingly running code on a CPU with two cores that have been shown to be faulty (so far). isolcpus will make sure that the kernel doesn’t execute any “user” code on those cores unless asked to (via taskset for example)

    Apart from this, like others have pointed out, the red dots on the screen aren’t a great sign. Maybe you can individually replace defective parts, or maybe you have to buy a new machine entirely. What I told you with this comment is to check whether your CPU still works with 2 SMT threads faulty.

    Good luck and I hope you fix your server 🤞.

    [1] https://www.dell.com/support/manuals/en-us/wyse-5060-thin-client/5060_wie10_ug/system-specifications?guid=guid-cbeecec5-25ac-4103-8b4b-7d3a975e91f0&lang=en-us

    [2] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/datasheets/52259_KB_G-Series_Product_Data_Sheet.pdf

    [3] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/programmer-references/52740_16h_Models_30h-3Fh_BKDG.pdf

    [4] https://github.com/DimitriFourny/MCE-Ryzen-Decoder

    [5] https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html

    [6] https://elixir.bootlin.com/linux/v6.9.2/source/Documentation/arch/x86/x86_64/boot-options.rst

    • wizzor@sopuli.xyz
      link
      fedilink
      arrow-up
      0
      ·
      6 months ago

      Amazing. I’m not OP and have no use for this info, but it was fun to learn it still.

    • My Password Is 1234@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      0
      ·
      6 months ago

      Yes, this is exactly the Dell Wyse 5060 with an AMD GX-424CC processor. This thin client is already old, which is why I decided to purchase a newer one with a better processor.

      Anyway, thank you for your analysis! I learned a lot of new things. I will try to get it running with your advice and let you know how it goes.

      However, this server will probably no longer be needed, since half of its cores are damaged. Previously, its computing power was fully utilized (the load was almost always 4.0), and it handled my tasks very well with four cores. Therefore, I cannot imagine using it with only half of its power available 😁