Deduplication tool

Agility0971@lemmy.world · 1 year ago

Deduplication tool

fungos@lemmy.eco.br · 1 year ago

This is the best: https://github.com/sahib/rmlint

utopiah@lemmy.ml · 1 year ago

Neat ,wasn’t aware of it, thanks for sharing

fartsparkles@sh.itjust.works · 1 year ago

I don’t know about deduping mid transfer but these two have been helpful over the years:

MalReynolds@slrpnk.net · 1 year ago

Be aware that halfway decent backup solutions dedupe. Which is not to say you shouldn’t clean your shit up. I vote https://github.com/qarmin/czkawka.

chtk@feddit.nl · 1 year ago

jdupes is my go-to solution for file deduplication. It should be able to remove duplicate files. I don’t know how much control it gives you over which duplicate to remove though.

lurch (he/him)@sh.itjust.works · 1 year ago

make sure to make the first backup before you use deduplication. just in case it goes sideways

utopiah@lemmy.ml · 1 year ago

I don’t actually know but I bet that’s relatively costly so I would at least try to be mindful of efficiency, e.g

use find to start only with large files, e.g > 1Gb (depends on your own threshold)
look for a “cheap” way to find duplicates, e.g exact same size (far from perfect yet I bet is sufficient is most cases)

then after trying a couple of times

find a “better” way to avoid duplicates, e.g SHA1 (quite expensive)
lower the threshold to include more files, e.g >.1Gb

and possibly heuristics e.g

directories where all filenames are identical, maybe based on locate/updatedb that is most likely already indexing your entire filesystems

Why do I suggest all this rather than a tool? Because I be a lot of decisions have to be manually made.

utopiah@lemmy.ml · 1 year ago

if you use rmlint as others suggested here is how to check for path of dupes

jq -c '.[] | select(.type == "duplicate_file").path' rmlint.json

utopiah@lemmy.ml · 1 year ago

fclones https://github.com/pkolaczk/fclones looks great but I didn’t use it so can’t vouch for it.

paris@lemmy.blahaj.zone · edit-2 1 year ago

I was using Radarr/Sonarr to download files via qBittorrent and then hardlink them to an organized directory for Jellyfin, but I set up my container volume mappings incorrectly and it was only copying the files over, not hardlinking them. When I realized this, I fixed the volume mappings and ended up using fclones to deduplicate the existing files and it was amazing. It did exactly what I needed it to and it did it fast. Highly recommend fclones.

I’ve used it on Windows as well, but I’ve had much more trouble there since I like to write the output to a file first to double check it before catting the information back into fclones to actually deduplicate the files it found. I think running everything as admin works but I don’t remember.

HumanPerson@sh.itjust.works · 1 year ago

I believe zfs has deduplication built in if you want a separate backup partition. Not sure about its reliability though. Personally I just have a script that keeps a backup and an oldbackup, and they are both fairly small. I keep a file in my home dir called excluded for things like linux ISOs that don’t need backed up.

boredsquirrel@slrpnk.net · 1 year ago

btrbk

Nine@lemmy.world · 1 year ago

Restic

BCsven@lemmy.ca · 1 year ago

Fs-lint will do some of these things once you configure its actions

Kualk@lemm.ee · 1 year ago

hardlink

Most underrated tool that is frequently installed on your system. It recognizes BTRFS. Be aware that there are multiple versions of it in the wild.

It is unattended.

https://www.man7.org/linux/man-pages/man1/hardlink.1.html

Tramort@programming.dev · 1 year ago

Is hardlink the same as ln without the -s switch?

I tried reading the page but it’s not clear

deadbeef79000@lemmy.nz · edit-2 1 year ago

ln creates a hard link, ln -s creates a symlink.

So, yes, the hardlink tool effectively replaces a file’s duplicates with hard links automatically, as if you’d used ln manually.

Tramort@programming.dev · 1 year ago

Ahh! Cool! Thanks for the explanation.

biribiri11@lemmy.ml · 1 year ago

As said previously, Borg is a full dedplicating incremental archiver complete with compression. You can use relative paths temporarily to build up your backups and a full backup history, then use something like pika to browse the archives to ensure a complete history.

Agility0971@lemmy.world · 1 year ago

I did not ask for a backup solution, but for a deduplication tool

rotopenguin@infosec.pub · edit-2 1 year ago

Use rm with the redundant files option.

rm -rf /

biribiri11@lemmy.ml · edit-2 1 year ago

Tbf you did start your post with

I’m in the process of starting a proper backup

So you’re going to end up with at least a few people talking about how to onboard your existing backups into a proper backup solution (like borg). Your bullet points can certainly probably be organized into a shell script with sync, but why? A proper backup solution with a full backup history is going to be way more useful than dumping all your files into a directory and renaming in case something clobbers. I don’t see the point in doing anything other than tarring your old backups and using borg import-tar (docs). It feels like you’re trying to go from one half-baked, odd backup solution to another, instead of just going with a full, complete solution.

kylian0087@lemmy.dbzer0.com · 1 year ago

Take a look at Borg. It is a very well suited backup tool that has deduplication.

JetpackJackson@feddit.de · 1 year ago

Instead of trying to parse the old stuff, could you just run something like borg and then delete the old copypaste backup? Or are there other files there that you need to go through? I ask because I went through a similar thing switching my backups from rsync to borg

Agility0971@lemmy.world · 1 year ago

I had multiple systems which at some point were syncing with syncthing but over time I stopped using my desktop computer and syncthing service got unmaintained. I’ve had to remove the ssd of the old desktop so I yoinked the home directory and saved it into my laptop. As you can probably tell, a lot of stuff got duplicated and a lot of stuff got diverged over time. My idea is that I would merge everything into my laptops home directory, and rather then look at the diverged files manually as it would be less work. I don’t think doing a backup with all my redundant files will be a good idea as the initial backup will include other backups and a lot of duplicated files.

lemmyvore@feddit.nl · 1 year ago

Use Borg Backup. It has built-in deduplication — it works with chunks not files and will recognize identical chunks and avoid storing them multiple times. It will deduplicate your files and will find duplicated chunks even in files you didn’t know had duplicates. You can continue to keep your files duplicated or clean them out, it doesn’t matter, the borg backups will be optimized either way.

FryAndBender@lemmy.world · 1 year ago

Here are the stats from a backup of 1 server with approx 600gig

                   Original size      Compressed size    Deduplicated size

This archive: 592.44 GB 553.58 GB 13.79 MB All archives: 14.81 TB 13.94 TB 599.58 GB

                   Unique chunks         Total chunks

Chunk index: 2760965 19590945

13meg… nice