This is a guide demonstration of how I use rclone to expand the contents of a Backblaze snapshot on B2 into another B2 bucket.
Me! Seriously, I wrote this for my own recall/notes in the future but I thought I'd share it
To really answer the question, this is for people who want to do something similar and can use this as a guide. It is not a "tool" per se. It is not designed to be an easy or user-friendly process.
I use Python to do it on a VPS. Python is super readable so it should be easy enough to (lightly) customize if you don't know Python. I would say this demonstration is for people who are willing to play around and learn it. It is not turn-key.
No idea! I am using a Debian VPS and my restore was from a macOS backup. I suspect any tool would work
You need rclone and FUSE so you can rclone mount. This is not a guide on either of those.
I am also assuming you've already set up rclone with B2 and/or an additional remote.
I also use the awsome tqdm
library but you can ignore that if you don't want it.
Yes! You will be downloading from B2 so you pay egress. It is also very inneficient. I have no idea how bad but I'd imagine it isn't great! So expect to pay more egress than you're actually using.
My test restore is small but my main use is for a 200+gb restore. I want to use my VPS's bandwidth but my VPS is small! (~10gb free). So while I am paying more for egress than, say, downloading the restore (and especially more than if I were to request a USB drive or download right from Backblaze Personal), it saves me the bandwidth.
Besides just putting it into its own B2 bucket, this process is useful to seed a different backup tool (including rclone, but really any)
Yes! There are two places. The first and best is to filter which files
you include. The second is with rclone filters but I do not suggest that as you waste the time and expense to extract the files.
I bet! Please share. I like learning new things. This is just what I worked out!
import os,sys
import shutil
import time
import subprocess
import operator
import signal
from pathlib import Path
from zipfile import ZipFile
from tqdm import tqdm # This is 3rd party. $ python -m pip install tqdm
print(subprocess.check_output(['rclone','version']).decode())
print(sys.version)
rclone v1.53.3 - os/arch: linux/amd64 - go version: go1.15.5 3.8.3 (default, Jul 2 2020, 16:21:59) [GCC 7.3.0]
Here we mount the restore bucket. Note, do not add any caching unless you have the scratch space. Since my restore is bigger than my free space, I do not! This is basically a super vanilla rclone mount. In fact, when I tested with different advanced options, it failed.
There are two ways to do this. The first is to use a new terminal and create the mount there. That works fine but I will instead do it all within Python and subprocess
. With subprcocess, the arguments are passed as a list. This is actually really great since you do not have to deal with escaping. And it's easier to comment! If you do run it on a seperate terminal, screen
is your friend.
mountdir = Path('~/mount').expanduser()
mountdir.mkdir(exist_ok=True)
rclone_remote = 'b2:b2-snapshots-7f7799daad93/' # already set up B2. Found the bucket with `rclone lsf b2:`
restore_zip = 'bzsnapshot_2020-12-17-07-06-19.zip' # found with `rclone lsf b2:b2-snapshots-7f7799daad93/`
cmd = ['rclone',
'-vv', # Optional but may be useful later
'mount',rclone_remote,str(mountdir),
'--read-only',]
stdout,stderr = open('stdout','wb'),open('stderr','wb') # writable in bytes mode. I usually use context managers but I will need this to stay open
mount_proc = subprocess.Popen(cmd,stdout=stdout,stderr=stderr)
Make sure it mounted. This is optional
print('Waiting for mount ',flush=True)
for ii in range(10):
if os.path.ismount(mountdir):
break
if mount_proc.poll() is not None:
raise ValueError('did not mount')
time.sleep(1)
print('.',end='',flush=True)
else:
print('ERROR: Mount did not activate. Kill proc and exiting',file=sys.stderr,flush=True)
mount_proc.kill()
sys.exit(2)
print('mounted')
Waiting for mount ......mounted
Python's zipfile
will not read the entire file in order to get a listing or even some random file inside. Don't believe me? See the bottom!
What we need to do now is get a list of the files and use manual inspection to decide what to cut. Backblaze uses the full path
with ZipFile(mountdir/restore_zip) as zf:
files = zf.infolist() # could also do namelist() but we will want the sizes later
len(files)
2149
Pick a random file to get the path. We will use this later
files[1000].filename
'Macintosh HD/Users/jwinkMAC/PyFiSync/Papers/Sorted/2010/2018/melchers2018structural.pdf'
Identify and save the prefix as you want it removed
restore_prefix = 'Macintosh HD/Users/jwinkMAC/' # We will need this later to reupload. This s
This is actually super easy! Just search though files
to find the file you want. Let's assume it is the 1000th file still
restore_file = files[1000]
restore_dir = Path('~/restore').expanduser()
with ZipFile(mountdir/restore_zip) as zf:
zf.extract(restore_file,path=str(restore_dir))
Inside the zip file is the full prefixed file (from root). I don't want that
# Optional. Remove prefix
src = restore_dir / restore_file.filename
dst = restore_dir / os.path.relpath(src,restore_dir / restore_prefix)
dst.parent.mkdir(parents=True,exist_ok=True)
shutil.move(src,dst)
PosixPath('/home/jwink3101/restore/PyFiSync/Papers/Sorted/2010/2018/melchers2018structural.pdf')
Now, this could almost certainly use improvement. We will do the following:
copy
(not sync
) to push those filesrestore_prefix
so we do not keep that junkNote that we may be able to optimize this by better backfilling the batches but I am not sure if there is any advantages with sequential reading so I will go one file after the other. It may be moot.
# Tool to gather the files into batches
def group_to_size(seq,maxsize,key=None):
"""
Group seq by size up to but not to exceed
maxsize (unless a single item does)
Example:
>>> list(group_to_size([10,20,10,90,40,50,99,2,101,0,30,90,11],100))
[(10, 20, 10), (90,), (40, 50), (99,), (2,), (101,), (0, 30), (90,), (11,)]
"""
s = 0
curr = []
for item in seq:
s0 = key(item) if callable(key) else item
if s + s0 > maxsize: # Yield if will be pushed over
yield tuple(curr)
curr = []
s = 0
s += s0
curr.append(item)
if curr:
yield tuple(curr) # Anything remaining
maxsize = 512 * 1024 * 1024 # 512 mb or 536870912 bytes
# dest_remote = 'b2:mynewbuckets/whatever'
dest_remote = '/home/jwink3101/restore/tmp/'
scratch = Path('~/scratch').expanduser().absolute()
scratch.mkdir(parents=True,exist_ok=True)
# This is there you can filter stuff
# filtered = (f for f in files if ...)
filtered = files # No filter
batches = group_to_size(filtered,maxsize,key=operator.attrgetter('file_size'))
with ZipFile(mountdir/restore_zip) as zf:
for ib,batchfiles in enumerate(batches):
print('batch',ib,'# files',len(batchfiles))
# Extract all of the files
for file in tqdm(batchfiles):
zf.extract(file,path=str(scratch))
print('calling rclone')
cmd = ['rclone',
'move', # use move so they get deleted
str(scratch / restore_prefix), dest_remote,
'--transfers','20', # and/or other flags. all optional.
]
subprocess.check_call(cmd)
0%| | 0/185 [00:00<?, ?it/s]
batch 0 # files 185
100%|██████████| 185/185 [00:30<00:00, 6.16it/s] 0%| | 0/146 [00:00<?, ?it/s]
calling rclone batch 1 # files 146
100%|██████████| 146/146 [00:32<00:00, 4.47it/s] 0%| | 0/97 [00:00<?, ?it/s]
calling rclone batch 2 # files 97
100%|██████████| 97/97 [00:29<00:00, 3.24it/s]
calling rclone
1%| | 2/187 [00:00<00:12, 15.09it/s]
batch 3 # files 187
100%|██████████| 187/187 [00:31<00:00, 5.87it/s] 0%| | 0/126 [00:00<?, ?it/s]
calling rclone batch 4 # files 126
100%|██████████| 126/126 [00:28<00:00, 4.39it/s] 0%| | 0/141 [00:00<?, ?it/s]
calling rclone batch 5 # files 141
100%|██████████| 141/141 [00:27<00:00, 5.05it/s]
calling rclone
0%| | 0/78 [00:00<?, ?it/s]
batch 6 # files 78
100%|██████████| 78/78 [00:15<00:00, 4.90it/s] 0%| | 0/54 [00:00<?, ?it/s]
calling rclone batch 7 # files 54
100%|██████████| 54/54 [00:29<00:00, 1.84it/s] 0%| | 0/138 [00:00<?, ?it/s]
calling rclone batch 8 # files 138
100%|██████████| 138/138 [00:29<00:00, 4.68it/s] 0%| | 0/730 [00:00<?, ?it/s]
calling rclone batch 9 # files 730
100%|██████████| 730/730 [00:29<00:00, 24.90it/s] 0%| | 0/101 [00:00<?, ?it/s]
calling rclone batch 10 # files 101
100%|██████████| 101/101 [00:30<00:00, 3.30it/s] 0%| | 0/137 [00:00<?, ?it/s]
calling rclone batch 11 # files 137
100%|██████████| 137/137 [00:28<00:00, 4.84it/s] 0%| | 0/29 [00:00<?, ?it/s]
calling rclone batch 12 # files 29
100%|██████████| 29/29 [00:11<00:00, 2.58it/s]
calling rclone
mount_proc.send_signal(signal.SIGINT)
mount_proc.wait() # Hopefully this works. Otherwise you may need to kill it manually
stdout.close()
stderr.close()
Python's ZipFile will read into a zip file without reading the entire file. It does need to "seek" in the file, hence the mount, but rclone handles that like a champ.
How do I know I'm not downloading the entire file? Well, you could look at the rclone logs. The other way is to make a file-object that will be verbose about what's going on. Note that ZipFile
takes either a filename or a file-like object
import io
class VerboseFile(io.FileIO):
def read(self,*args,**kwargs):
print('read',*args,**kwargs)
r = super(VerboseFile,self).read(*args,**kwargs)
print(' len:',len(r))
return r
def seek(self,*args,**kwargs):
print('seek',*args,**kwargs)
return super(VerboseFile,self).seek(*args,**kwargs)
def close(self,*args,**kwargs):
print('close')
return super(VerboseFile,self).close(*args,**kwargs)
Then, insetad of
with ZipFile(mountdir/restore_zip) as zf:
...
do
with ZipFile(VerboseFile(mountdir/restore_zip)) as zf:
...
and you'll be able to see everything