Pipe data from STDIN to Amazon S3

On Fri, 22 Nov 2013 22:00:29 +0100 by Falco Nordmann

When looking for a solution to store offsite backups during my travels, I was interested in uploading incremental, as well as encrypted backups to Amazon S3, since the API is simple to use, and the price model for their service is affordable, especially when using Amazon's Lifecycle management to transfer S3 objects to Amazon Glacier automatically). When looking deeper into this topic and evaluating available tools, I discovered that most of the available S3 clients where not able to stream data you pipe into them via STDOUT / STDIN directly (using the RAM) to the S3 bucket, but need to store the whole data as a file on your harddrive before being able to upload. The main reason for the lack of this feature is that Amazon's S3 API expects the size (Content-Length) of the data to be stored, before receiving the data itself. After doing further research for tools facing this problem I found js3tream, which tries to get around this problem by splitting the inputstream into chunks of fixed size and storing them as seperate objects in the given S3 bucket. But when trying to work with this tool, I got nothing but a bunch of exceptions, and even when my Java is acceptable I was not able to spot the root of the problem. However, I have noticed that Amazon has been supporting a more suitable way to upload files in multiple chunks since 2010, than js3tream tends to use. Amazon's S3 API provides the opportunity to upload an object in multiple parts, and expects the Content-Length of each part uploaded instead of the Content-Length for the whole object. Using this mechanism, it is possible to split the inputstream into chunks and upload each chunk until the end of the inputstream is reached. When searching for tools supporting this feature, I only found clients being able to read files, instead of reading from STDIN. While s3cmd is able to stream stored data from S3 to STDOUT, it is not able to stream data from STDIN to S3, even though they have announced to integrate this feature. Since I could not find any client suitable for my needs, I wrote some a python script, using boto's S3 interface to upload data read from STDIN chunkwise to S3 using Amazon's multipart upload API.

The script can be found on github. To use it for incremental backups, you can make use of the tips given in the tar HowTo of the js3tream project. To encrypt your backups you can pipe the data generated by tar through gpg.

# tar -g /etc/backup/home/diff -C / -vcpjO /home | \
> gpg -r com-example-backup-home -e | \
> 2s3 -k /etc/backup/home/aws-key -b com-example-backup-home -o backup.0

This example assumes that the file, keeping track of the changes in your filesystem, as well as the file containing the AWS credentials for the access to your S3 bucket are stored in /etc/backup/home/, and you generated a GPG keypair named com-example-backup-home in the users GPG keyring (# gpg –gen-key)

By running this command and incrementing the objects name (backup.1, backup.2, etc) every run, you can construct an incremental, encrypted backup of your user's home directories.

To restore your backup, you can read the S3 object(s) from S3 using s3cmd and pipe the data through GPG and tar the other way round.

# s3cmd get s3://com-example-backup-home/backup.0 - | gpg -d | tar -g /dev/null -C / -xvj
# s3cmd get s3://com-example-backup-home/backup.1 - | gpg -d | tar -g /dev/null -C / -xvj
...

Before using s3cmd, you have to configure it to connect to your S3 bucket. Have a look into the s3tools documentation for further information.

Comments

Write a comment