I recently had to upload a large number (~1 million) of files to Amazon S3.
My first attempts revolved around s3cmd (and subsequently s4cmd) but both projects seem to based around analysing all the files first rather than blindly uploading them. This not only requires a large amount of memory, non-trivial experimentation, fiddling and patching is also needed to avoid unnecessary stat(2) calls. I even tried a simple find | xargs -P 5 s3cmd put [..] but I just didn't trust the error handling correctly.
I finally alighted on s3-parallel-put, which worked out well. Here's a brief rundown on how to use it:
- First, change to your source directory. This is to ensure that the filenames created in your S3 bucket are not prefixed with the directory structure of your local filesystem — whilst s3-parallel-put has a --prefix option, it is ignored if you pass a fully-qualified source, ie. one starting with a /.
- Run with --dry-run --limit=1 and check that the resulting filenames will be correct after all:
$ export AWS_ACCESS_KEY_ID=FIXME $ export AWS_SECRET_ACCESS_KEY=FIXME $ /path/to/bin/s3-parallel-put \ --bucket=my-bucket \ --host=s3.amazonaws.com \ --put=stupid \ --insecure \ --dry-run --limit=1 \ .
[..] INFO:s3-parallel-put[putter-21714]:./yadt/profile.Profile/image/circle/807.jpeg -> yadt/profile.Profile/image/circle/807.jpeg [..]
- Remove --dry-run --limit=1, and let it roll.