I see a lot about source codes being leaked and I’m wondering how it that you could make something like an exact replica of Super Mario Bros without the source code or how you can’t take the finished product and run it back through the compilation software?

  • fenynro@lemmy.world
    cake
    link
    fedilink
    arrow-up
    0
    ·
    edit-2
    1 year ago

    The long answer involves a lot of technical jargon, but the short answer is that the compilation process turns high level source code into something that the machine can read, and that process usually drops a lot of unneeded data and does some low-level optimization to make things more efficient during actual processing.

    One can use a decompiler to take that machine code and attempt to turn it back into something human readable, but will usually be missing data on variable names, function calls, comments, etc. and include compiler-added optimizations which makes it nearly impossible to reconstruct the original code

    It’s sort of the code equivalent of putting a sentence into Google translate and then immediately translating it back to the original. You often end up with differences in word choice that give you a good general idea of intent, but it’s impossible to know exactly which words were in the original sentence.

  • ryathal@sh.itjust.works
    link
    fedilink
    arrow-up
    0
    ·
    1 year ago

    Code can be decompiled, but generally the end result isn’t human readable. Just having the decompiler version isn’t that valuable. Having the source code as written is more helpful because you get the context of what things were named and how it was organized.

    Decompiled code is a bit like reading a book with all the nouns being random letters and verbs being random numbers.

    • TheVillageGuy@kbin.social
      link
      fedilink
      arrow-up
      0
      ·
      1 year ago

      Not completely random, every noun/verb would be translatable to a specific word/name. But also characters, there’d be many characters whose names, intentions and goals, relationships/links would also be in the same unreadable state. The storyline would likely not be chronological, but several actions and decisions by all kinds of actors would intertwine. It would be very hard to translate into a readable story, let alone so that it makes sense

  • howrar@lemmy.ca
    link
    fedilink
    arrow-up
    0
    ·
    1 year ago

    The best and simplest explanation I’ve seen: The machine code tells the computer what to do while the source code tells the human why it’s doing it.

    Your computer doesn’t need all the “why” information to run the game, so the compilation process gets rid of it. What you’re left with are instructions on exactly what computations to do, and that’s all the computer needs.

    For example, you can see in the machine code that two numbers are being added together. What do those numbers mean and why are we adding them? The source code can tell you that this is code that controls movement, one of the numbers is a velocity, the other is the player’s current position.

  • Dark Arc@social.packetloss.gg
    link
    fedilink
    English
    arrow-up
    0
    ·
    1 year ago

    I actually work on a C++ compiler… I think I should weigh in. The general consensus here that things are lossy is correct but perhaps non-obvious if you’re not familiar with the domain.

    When you compile a program you’re taking the source, turning into a graph that represents every aspect of the program, and then generating some kind of IR that then gets turned into machine code.

    You lose things like code comments because the machine doesn’t care about the comments right off the bat.

    Then you lose local variable and function parameter names because the machine doesn’t care about those things.

    Then you lose your class structure … because the machine really just cares about the total size of the thing it’s passing around. You can recover some of this information by looking at the functions but it’s not always going to be straight forward because not every constructor initializes everything and things like unions add further complexity … and not every memory allocation uses a constructor. You won’t get any names of any data members/fields though because … again the machine doesn’t care.

    So what you’re left with is basically the mangled names of functions and what you can derive from how instructions access memory.

    The mangled names normally tell you a lot, the namespace, the class (if any), and the argument count and types. Of course that’s not guaranteed either, it’s just because that’s how we come up with unique stable names for the various things in your program. It could function with a bunch of UUIDs if you setup a table on the compilers side to associate everything.

    But wait! There’s more! The optimizer can do some really wild things in the name of speed… Including combining functions. Those constructors? Gone, now they’re just some more operations in the function bodies. That function you wrote to help improve readability of your code? Gone. That function you wrote to deduplicate code? Gone. That eloquent recursive logic you wrote? Gone, now it’s the moral equivalent of a giant mess of goto statements. That template code that makes use of dozens of instantiated functions? Those functions are gone now too; instead it’s all the instantiated logic puked out into one giant function. That piece of logic computing a value? Well the compiler figured out it’s always 27, so the logic to compute it? Gone.

    Now all of that stuff doesn’t happen every time, particularly not all of those things are always possible optimizations or good optimizations … But you can see how incredibly difficult it is to reconstruct a program once it’s been compiled and gone through optimization. There’s a very low chance if you do reconstruct it, that it will look anything like what you started with.

    • Treczoks@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      1 year ago

      Just wait until you see the crazy optimizers for embedded systems. They take the complete code of a system into consideration, and, in a number of compile passes, reuses code snippets from app, libraries, and OS layer to create one big tangled mess that is hard to follow even if you have the source code…

  • spudwart@spudwart.com
    link
    fedilink
    English
    arrow-up
    0
    ·
    1 year ago

    Well, actually it can be. It just takes a lot more to decompile code than compile it. Depending on the objective accuracy.

    Example: the Super Mario 64 Decompilation project. This was a project that used various debug data that was left in the rom to decompile the game back to a source code that compiled a byte accurate version of the rom. This took about 3 years and a lot of skilled developers to accomplish.

    Side note: Super Mario Bros wasn’t built using a compiled language, but rather Assembly. So technically that would be a Disassembly not a Decompilation.

  • Moondance@sh.itjust.works
    link
    fedilink
    arrow-up
    0
    ·
    1 year ago

    The compilation process discards information in the process leaving a many to one effect. A good decompiler allows one to retrieve a program that is functionally equivalent to the source code but not exactly the source code.

  • Rikudou_Sage@lemmings.world
    link
    fedilink
    arrow-up
    0
    ·
    1 year ago

    As I’ve read somewhere once: it’s easy to make a burger out of a cow. Making a cow out of a burger is slightly harder.

    That means that compiling code is a lossy process - the original code is lost in the process and can never be recovered because it doesn’t exist anywhere anymore.

  • amio@kbin.social
    link
    fedilink
    arrow-up
    0
    ·
    1 year ago

    The general difference is that you lose out on metadata - names, comments and organization that helps the source code in whatever programming language make sense, but which is not needed to actually execute the desired behavior on your CPU. Usually stuff like sensible names for bits of your code - functions/reusable logic, storage locations for “health” or “armor” or “current powerup”, movement states, types of objects etc.

    However, most of these are just another kind of number to the computer itself, so a lot of compilation processes strip a lot of this information. You could still reverse engineer it, but you’re missing context (like all those names) from the original code and that makes the work potentially pretty difficult. Bear in mind that reading actual original source code is sometimes cryptic enough, then compare “if player is dead, show game over screen” to if (sdfdfgsdfg == jgdfg) { lkghku(); } because the “decompiler” has to invent some kind of name for everything that’s missing. Now you have to deal with thousands of jfdsghklgs, and figure out what it all means.

  • You can certainly decompile things back down to machine code, but there could be gaps and things lost in translation between the programming language used to create the program, and the machine code that results when you take it apart again.

    When you program, like actually write the code, you’re using one language. When you compile it, you’re passing it off to an interpreter into another language. There could be even more layers of this depending on what you’re doing.

    Now think about what happens when you open a translator, enter some words, translate it to one language, and then another, and back to the original. It comes out all wrong; the same thing happens with code. There’s nuance and flavor imparted by the language itself that isn’t kept through the interpretation of that language to the language that actually is used by the computer to do its tasks.