Using S3 Just Like a Local File System in Python

Because python also implements them.

So if you happen to currently run a python app an write things to a local file via:with open(path, “w”) as f:write_to(f)you can write this to S3 simply by replacing it by:with s3.

open(bucket + path, “w”) as f:write_to(f).

Of course S3 has good python integration with boto3, so why care to wrap a POSIX like module around it?Why would you want to do that?For me, the question is what kind of model you want to have.

s3fs supports a simpler implementation, as you usually move from storing local, to storing in the cloud.

This change doesn’t require you to change how you think about your app and your data.

If you happen to hit the limits of s3fs it of course makes sense to either sync data in batch to S3 or use a different form of loading your persistent data.

My prime use case for this is machine learning.

Loading and saving models in evaluation stages and loading and saving data in various stages of processing are typically done first on a single machine, and then in the cloud, possibly on more than one machine with the need for some form of persistent storage.

The pros of the two concepts.

Pro s3fs:less space needed on device, instance, host, containersimpler mental modelPro load via boto3:speed in execution.

Data is stored locally, no bandwidth on load to memorypossibly more immutable, data only changes on sync, not on every read/write.

Installing and Starting Outs3fs is pip-installable, so just runpip install s3fs, import s3fs into your script and you’re ready to go.

All actions require you to “mount” the S3 filesystem, which you can do viafs = s3fs.

S3FileSystem(anon=False) # accessing all buckets you have access to with your credentialsorfs = s3fs.

S3FileSystem(anon=True) # accessing all public buckets.

you can test things with a simple list.

Remember s3fs uses the POSIX standard, so all usual Unix commands like ls, cat, touch should work:fs.

ls(“…”) #displays contents of a bucketwill work as doesfs.

touch(“…”/”test.

txt”) # should put a 0 byte file into your bucket.

Now let’s do something useful with this.

Example 1: A CLI to Upload a Local FolderThis CLI uses fire, a super slim CLI generator, and s3fs.

It syncs all data recursively in some tree to a bucket.

In the console you can now runpython filename.

py to_s3 local_folder s3://bucketto start the CLI.

Note this assumes you have your credentials stored somewhere.

Somewhere means somewhere where boto3 looks for it.

Boto resolves credentials in order:1.

things passed to S3FileSystem via access_token, which is then passed to boto.

client(),…2.

Environment variables, which I usually use,3.

/4.

Shared credentials, config files, etc.

…(see the boto3 documentation for more information)Example 2: Writing, Loading Machine Learning ModelsAnother good use case is to save andload machine learning models.

They usually are too big to store in subversion, but you do want to save them regularly.

Because, well sometimes things crash.

One thing you could do for instance is to save:1.

the model object, pickle2.

parameters in some dict used to create it, describe it, whatever you want to remember.

Possibly in JSON so you can read it in plain text.

3.

results, scores etc.

Here’s a small example.

Example 3: Writing a Pandas DataFrame to S3Another common use case it to write data after preprocessing to S3.

Suppose we just did a bunch of word magic on a dataframe with texts, like converting it to bag-of-words, tf-idf etc.

We then would want to save this DataFrame, and possibly the Tokenizer, to S3 with the following code:That’s it!.Start playing around with this module.

You can find the module and the three gists here:s3fs: https://github.

com/dask/s3fsGists: https://gist.

github.

com/sbalnojan.. More details

Leave a Reply