2011-06-18 : I want a back up system that supports distributed, replicated backups, with a data store that avoids duplicating data.
- 2011-06-18 - Backup notes - API and data structures
- 2011-06-19 - Backup notes - API and data structures
- 2011-07-24 - Backup notes - state machine
- There are two types of tasks - PendingRecords and PendingFragments.
- FileUploadTracker → FcidUploadTracker → PendingFragment
- Threat Analysis
- I was thinking it would be easy to download a FileSet if it was broken into chunks and
stored in the FileStore. In that case (and in any case), it would be nice if
it could be deduplicated.
(One estimate showed the backup metadata was 0.3% the size of the original data. My
current disk usage is about 245GiB so a complete backup will require about 735MiB. Most of that
won't change on a daily basis, so deduplication would be very effective for daily backups.)
I can sort the records and use hash-based ids, so the FileSet will tend to be the same.
However, security makes things challenging. Even if I always use the same password, random IVs
will prevent deduplication. However, the IV must not be based on the file contents, or that
opens us up to precalculated password attack. The best solution
seems to be to look at the previous FileSet
and reuse the IV if the record is the same. This means we'll have to download the previous
FileSet even when doing a deep scan (which otherwise doesn't require data from the server)
(but I suppose in this case the additional overhead is not significant).
Reusing the IV is safe because we are actually just copying the identical record. We're not
reusing the same IV on different data. The original IV will still be random.
Chunk based back-up
C o m m e n t s :