Skip to content
This repository has been archived by the owner on May 18, 2020. It is now read-only.

Use plain files instead of zip for crypto blobs #26

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jfhs
Copy link

@jfhs jfhs commented Jul 2, 2019

This PR reverts changes done in #4 as it doesn't seem to scale for larger number of blobs.
However it makes a small change which should remove problems with viewing/managing crypto folder: now files are expected to be in folders based on first bytes of the hash, i.e:

crypto/AABBCC01234567890AABBCC01234567890.bin

will now be stored in

crypto/AA/BB/CC/01234567890AABBCC01234567890.bin

There is a helper script in orbital repo to rearrange files from plain structure to this one.
From the current dump, there should be at most 2 files in one folder with 3 "levels".

There seem to be some overhead (I guess it depends on underlying fs) due to many small files: with latest dump zip folder is 1.2G while on ext4 unpacked files occupy 1.5G.

On the bright side code is trivial and there is no need for some special tool to manage blobs.

Note: this is untested as I can't seem to run orbital on linux properly (it is stuck way before anything touching samu)

@AlexAltea
Copy link
Owner

The issue with storing files as-is, even in a nested folder structure, is that it doesn't scale well either. Essentially, it replaces one set of issues (e.g. poor performance during modifications) with another set of issues (e.g. poor performance during copying).

I'm thankful that you decided to tackle this problem, but ultimately the solution is finding (or designing) a file format optimized for fast access/modifications (unlike ZIP files) of many small buffers indexed using >=128-bit keys, and can be easily copied around (unlike raw filesystem storage).

Also, I'm reopening issue https://github.com/AlexAltea/orbital/issues/19, since it's clear we are hitting bottlenecks again.

@jfhs
Copy link
Author

jfhs commented Jul 3, 2019

@AlexAltea as I can conclude from discussion in https://github.com/AlexAltea/orbital/issues/19 such format should be something like: raw content + search structure (I proposed BST).

I almost went to implement it but realized that this is what most of file systems implement underneath anyway, and they are usually ready to serve millions of files with ease (look at any node_modules, chances are it has that number of files inside). We just need to leverage tree structure that is provided by FS to improve search/processing speed. It is understandably slow if we just dump all files to one folder as for most of the tools this makes it a huge array vs organized tree.

That said, I'm not sure why you think this is much slower for our use case:

  • Fast random-access reading times - this should be fine (though as mentioned above, I haven't checked) as this is normal expectation of any fs
  • Dynamic growth/shrinkage to avoid wasting space - there is a bit of waste (depending on FS) due to small file size but it is not drastic (25% on ext4)
  • Compression for individual blobs (optionally) - can be added by adding different suffix to file name and unpacking it while reading (i.e. .xz) this way blob is still accessible
  • Standardized or widely-used format - plain old files :)
  • Fast write/update times given a static batch of blobs (no need for random-access writes though) - this is just copying (or even simply moving output of dumper) file, should be fast (see below).

Here are some results of tests I ran on my machine:

# moving NEW files from PUP decryption to existing dump (this one took you 10 minutes, I guess)
$ time python arrange_dump.py 
Maximum number of files in folder: 1

real	0m0.262s
user	0m0.104s
sys	0m0.157s

# copying latest dump (PUP+selfs)
$ time cp -r dump dump_copy

real	0m2.444s
user	0m0.174s
sys	0m2.223s

# copying latest zipped dump (PUP+selfs)
$ time cp ~/Downloads/blobs_pup.zip .

real	0m0.472s
user	0m0.001s
sys	0m0.471s

Yes, copy time is significantly slower, but it isn't listed as a requirement and I don't see why it would be needed often.

If you still feel like having custom format is worth it, I can do it instead. But from my point of view, simplicity of plain files outweights any minor speed gains we get on "maintenance" tasks. If it would speed up emu at runtime, then this is a different story, but for that we need someone to benchmark it at least vs zip implementation :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants