%load_ext autoreload
%autoreload 2
The central data structure provided by the library is the BlobPath
type
This type would abstract away the internals of how the file is stored and works in a cloud agnostic manner
Note that you would need to install the aws
extra to work with S3 paths:
pip install 'blob-path[aws]'
from blob_path.backends.s3 import S3BlobPath
from pathlib import PurePath
bucket_name = "narang-public-s3"
object_key = PurePath("hello_world.txt")
region = "us-east-1"
blob_path = S3BlobPath(bucket_name, region, object_key)
The blob path is simply a path representation, like pathlib.Path
, its not required that the file should exist or not
You can check for existence using exists
blob_path.exists()
True
The main method that BlobPath
provides is open
, it mimicks the builtin open
function to some extent
This method is the central abstraction, many operations are handled in a generic way using this method
Lets write something to the object in our bucket
with blob_path.open("w") as f:
f.write("hello world")
# the file would exist in S3 now, you should check it out
blob_path.exists()
True
S3 and other cloud storage blob paths can be fully serialised and deserialised.
You can pass around these path objects across processes (and servers) and easily locate the file
# a single blob path can be serialised using the method `serialise`
blob_path.serialise()
{'kind': 'blob-path-aws', 'payload': {'bucket': 'narang-public-s3', 'region': 'us-east-1', 'object_key': ['hello_world.txt']}}
# lets deserialise them
# deserialise is a separate function and you can pass it any kind of blob path and it would correctly deserialise it
from blob_path.deserialise import deserialise
deserialised_s3_blob = deserialise(
{
"kind": "blob-path-aws",
"payload": {
"bucket": "narang-public-s3",
"region": "us-east-1",
"object_key": ["hello_world.txt"],
},
}
)
deserialised_s3_blob
kind=blob-path-aws bucket=narang-public-s3 region=us-east-1 object_key=hello_world.txt
Lets try another path backend, the LocalRelativeBlobPath
, this path models a local FS relative path, which is always rooted at a single root directory
Consider you store all the application files inside a single path "/tmp/my-apps-files"
In this case, instead of using pathlib.Path
, you could use LocalRelativeBlobPath
(this allows you to easily switch between using a cloud storage or a local storage for your files)
from blob_path.backends.local_relative import LocalRelativeBlobPath
# PurePath is a simple path representation, but it does not care whether its actually a path or not in your FS
# Its useful for logically representing various data structures, as an example, you could represent S3 object keys as `PurePaths`
from pathlib import PurePath
relpath = PurePath("local") / "storage.txt"
local_blob = LocalRelativeBlobPath(relpath)
local_blob.exists()
--------------------------------------------------------------------------- Exception Traceback (most recent call last) Cell In[8], line 1 ----> 1 local_blob.exists() File ~/Desktop/personal/blob-path/src/blob_path/backends/local_relative.py:74, in LocalRelativeBlobPath.exists(self) 73 def exists(self) -> bool: ---> 74 return (self._p()).exists() File ~/Desktop/personal/blob-path/src/blob_path/backends/local_relative.py:94, in LocalRelativeBlobPath._p(self) 93 def _p(self) -> Path: ---> 94 return _get_implicit_base_path() / self._relpath File ~/Desktop/personal/blob-path/src/blob_path/backends/local_relative.py:110, in _get_implicit_base_path() 109 def _get_implicit_base_path() -> Path: --> 110 base_path = Path(get_implicit_var(BASE_VAR)) 111 base_path.mkdir(exist_ok=True, parents=True) 112 return base_path File ~/Desktop/personal/blob-path/src/blob_path/implicit.py:30, in get_implicit_var(var) 28 result = _PROVIDER(var) 29 if result is None: ---> 30 raise Exception( 31 "tried fetching implicit variable from environment " 32 + f"but the var os.environ['{var}'] does not exist" 33 ) 34 return result Exception: tried fetching implicit variable from environment but the var os.environ['IMPLICIT_BLOB_PATH_LOCAL_RELATIVE_BASE_DIR'] does not exist
Uh oh, we got an error, that too really early ;_;
It says that we have not defined IMPLICIT_BLOB_PATH_LOCAL_RELATIVE_BASE_DIR
in our environment
This environment variable stores the root directory of your relative paths
from pathlib import Path
import os
os.environ["IMPLICIT_BLOB_PATH_LOCAL_RELATIVE_BASE_DIR"] = str(
Path.home() / "tmp" / "local_fs_root"
)
# it passes now, and says that the file does not exist
local_blob.exists()
True
So why is LocalRelativeBlobPath
taking the root directory as an environment variable? Could we pass it in __init__
?
We could argue about this, but then the path is pretty much the same as any absolute path. Even the serialised representation of LocalRelativeBlobPath
leaves out the root directory (its not part of the path representation)
These variables which modify the behavior of BlobPath
are called implicit variables. They are by default, picked from the environment
Fetching the root directory from environment has multiple benefits
Implicit variables will change the behavior and location of your blobs implicitly (hah! perfect naming). Every implicit variable follows the naming convention: IMPLICIT_BLOB_PATH_<BACKEND>_...
Currently, only LocalRelativeBlobPath
has implicit variables
Let's do a simple copy operation between an S3 path and a local path
import shutil
# the long way
with deserialised_s3_blob.open("r") as fr:
with local_blob.open("w") as fw:
shutil.copyfileobj(fr, fw)
with local_blob.open("r") as f:
print(f.read())
hello world
Lets use a shortcut now.
Whenever possible, prefer shortcuts from the library for your operations
Currently, they only provide ease-of-use, but we can later optimise away special cases (like copying between two S3 blobs can be triggered using a remote copy with boto3, without copying data in your local machine)
# delete first for the example
local_blob.delete()
deserialised_s3_blob.cp(local_blob)
with local_blob.open("r") as f:
print("local blob content copied from s3:", f.read())
# using a shortcut from the library
# this shortcut provides more convenience, any of the `src` or `dest` can be `pathlib.Path` too
# this makes it easy to deal with normal paths in your FS
from blob_path.shortcuts import cp
local_blob.delete()
cp(deserialised_s3_blob, local_blob)
with local_blob.open("r") as f:
print("copied using shortcut:", f.read())
local blob content copied from s3: hello world copied using shortcut: hello world
Lets play a bit with an Azure path now, if you want, you can change it to any of the other paths, this to simply show that everything works with same with Azure paths
We will copy data from the S3 path to the Azure path now
You will need to install the azure extra
pip install 'blob-path[azure]'
from blob_path.backends.azure_blob_storage import AzureBlobPath
from pathlib import PurePath
destination = AzureBlobPath("narang99blobstore", "testcontainer", PurePath("copied") / "from" / "s3.txt")
deserialised_s3_blob.cp(destination)
destination.exists()
True