dehydrate-fs is a family of tools for separating out files from disk images for the efficient storage of both.
Go to file
2022-07-29 17:40:16 +00:00
.gitignore Initial commit. 2022-07-29 17:40:16 +00:00
dehydrate.sh Initial commit. 2022-07-29 17:40:16 +00:00
map.sh Initial commit. 2022-07-29 17:40:16 +00:00
README.md Initial commit. 2022-07-29 17:40:16 +00:00
rehydrate.sh Initial commit. 2022-07-29 17:40:16 +00:00

Dehydrate-fs (Beta)

dehydrate-fs is a family of tools for separating out files from disk images for the efficient storage of both. The project currently exists as a minimum viable product supporting only the ext2/3/4 filesystems.

Quickstart

#Generate the filesystem map
map "$file" mapfile.dat

#Dehydrate & compress the filesystem
dehydrate "$file" mapfile.dat | zip -1 "$file".dhd.zip -

#Rehydrate the filesystem
funzip "$file".dhd.zip | rehydrate mapfile.dat "$file".rhd

#Compare results
cmp "$file" "$file".rhd

Installation

The scripts may be ran directly. Please ensure you have perl and e2fsprogs available.

Usage

map FILE [MAPFILE]

Create a mapping of files in the partition image and extract their contents. If MAPFILE is not specified, the output is written to STDOUT. Files are placed in ./pool/ and are named with their sha256sum.

map accepts an environment variable THRESHOLD for minimum filesize bytes. It defaults to 1048576.

dehydrate FILE MAPFILE [OUTPUT]

Create a copy of FILE with zeros written to the locations specified by MAPFILE. If OUTPUT is not specified, the output is written to STDOUT. To prevent terminal corruption, the program will not run if STDOUT is a terminal.

It is recommended that you stream the output into a compressed archive as the dehydrated file is the same size as the input. zip is recommended, but xz performs similarly enough. gzip does not appear to be appropriate unless higher-quality compression is desired.

dehydrate "$file" "$mapfile" | zip -1 "$file".dhd.zip -

rehydrate MAPFILE [OUTPUT]

Read from STDIN, replacing specified subsections with file data according to MAPFILE. rehydrate requires that the file contents are available under ./pool/. If OUTPUT is not specified, the output is written to STDOUT. To prevent terminal corruption, the program will not run if STDOUT is a terminal.

FAQ:

Why is this necessary when chunk-based deduplicating storage systems exist?

To my knowledge, most chunk-based deduplicating storage systems operate at a very coarse level that isn't suitable for collections of average-sized or fragmented files.

What is the danger of dataloss?

The tools are written entirely in bash with very little mind paid to error handling. It is assumed the user will verify data integrity before committing irreversible actions. That said, the pseudo-formats are developed specifically to be as simple as possible. Dehydrated ext2/3/4 filesystem images are mountable using native tools and the mapfile format is trivial to parse.

Why the hell is it programmed entirely in bash?!

Because I could.

No seriously, why?

I am not a clever man. Even toy programs for interacting with ext2/3/4 make my head swim. Too many details, not enough visible intent. I prefer shell scripting for this reason.