Uploading a large number of files to Amazon S3

By Chris Lamb

I recently had to upload a large number (~1 million) of files to Amazon S3.

My first attempts revolved around s3cmd (and subsequently s4cmd) but both projects seem to based around analysing all the files first rather than blindly uploading them. This not only requires a large amount of memory, non-trivial experimentation, fiddling and patching is also needed to avoid unnecessary stat(2) calls. I even tried a simple find | xargs -P 5 s3cmd put [..] but I just didn't trust the error handling correctly.

I finally alighted on s3-parallel-put, which worked out well. Here's a brief rundown on how to use it:

  1. First, change to your source directory. This is to ensure that the filenames created in your S3 bucket are not prefixed with the directory structure of your local filesystem — whilst s3-parallel-put has a --prefix option, it is ignored if you pass a fully-qualified source, ie. one starting with a /.
  2. Run with --dry-run --limit=1 and check that the resulting filenames will be correct after all:
$ export AWS_ACCESS_KEY_ID=FIXME
$ export AWS_SECRET_ACCESS_KEY=FIXME
$ /path/to/bin/s3-parallel-put \
    --bucket=my-bucket \
    --host=s3.amazonaws.com \
    --put=stupid \
    --insecure \
    --dry-run --limit=1 \
    .
[..]
INFO:s3-parallel-put[putter-21714]:./yadt/profile.Profile/image/circle/807.jpeg -> yadt/profile.Profile/image/circle/807.jpeg
[..]
  1. Remove --dry-run --limit=1, and let it roll.

Chris Lamb is a freelance software developer and the current Debian Project Leader. You can read other posts by me, see software I have written or read more about me. You can also follow me @lolamby.


Tags: Personal Hacks

Planets: ALUG UWCS WUGLUG Debian

Friday 9th January 2015


Six comments