Uploading a large number of files to Amazon S3

I recently had to upload a large number (~1 million) of files to Amazon S3.

My first attempts revolved around s3cmd (and subsequently s4cmd) but both projects seem to based around analysing all the files first rather than blindly uploading them. This not only requires a large amount of memory, non-trivial experimentation, fiddling and patching is also needed to avoid unnecessary stat(2) calls. I even tried a simple find | xargs -P 5 s3cmd put [..] but I just didn't trust the error handling correctly.

I finally alighted on s3-parallel-put, which worked out well. Here's a brief rundown on how to use it:

  1. First, change to your source directory. This is to ensure that the filenames created in your S3 bucket are not prefixed with the directory structure of your local filesystem — whilst s3-parallel-put has a --prefix option, it is ignored if you pass a fully-qualified source, ie. one starting with a /.
  2. Run with --dry-run --limit=1 and check that the resulting filenames will be correct after all:
$ export AWS_ACCESS_KEY_ID=FIXME
$ export AWS_SECRET_ACCESS_KEY=FIXME
$ /path/to/bin/s3-parallel-put \
    --bucket=my-bucket \
    --host=s3.amazonaws.com \
    --put=stupid \
    --insecure \
    --dry-run --limit=1 \
    .
[..]
INFO:s3-parallel-put[putter-21714]:./yadt/profile.Profile/image/circle/807.jpeg -> yadt/profile.Profile/image/circle/807.jpeg
[..]
  1. Remove --dry-run --limit=1, and let it roll.

Comments (6)

Anonymous

Why --insecure?

Jan. 9, 2015, 8:45 p.m. #
Good spot. It doesn't use SSL which I simply guessed would be faster. This worked for me as the assets were all public anyway.
Gabriel Dib

Hey C,

Part of a project I am doing consists in uploading around 4 million files to a S3 server, let's say 6TB of data. Although the uplink speed is not my problem it looks like S3 clients are. I was wondering if you would mind chatting on the topic. Thanks,
GD

Oct. 14, 2015, 12:12 a.m. #
I don't mind but does this post not address exactly that?
Adam

Hi Chris

I am having an exactly same situation and I believe your article above would help me. However when I run the command, it throws a boto exception on connection as below.

Please note, in my situation, the bucket is located in AWS ap-southeast-1 region (Sydney) and I am executing this command in Australia.

Do you have any idea about below error?

root@vmd001 [/path/to/folder/test]# /path/to/folder/s3-parallel-put --bucket=vmd001 --put=add --insecure --dry-run --limit=1 .
Traceback (most recent call last):
File "/path/to/folder/s3-parallel-put", line 420, in <module>
sys.exit(main(sys.argv))
File "/path/to/folder/s3-parallel-put", line 391, in main
bucket = connection.get_bucket(options.bucket)
File "/usr/lib/python2.6/site-packages/boto/s3/connection.py", line 502, in get_bucket
return self.head_bucket(bucket_name, headers=headers)
File "/usr/lib/python2.6/site-packages/boto/s3/connection.py", line 549, in head_bucket
response.status, response.reason, body)
boto.exception.S3ResponseError: S3ResponseError: 301 Moved Permanently

March 8, 2016, 9:37 p.m. #
How new is this bucket? Have you tried without --insecure?
Chris

running

# /path/to/s3-parallel-put --bucket=my-cdn --host=s3-eu-west-1.amazonaws.com --put=stupid --dry-run --limit=1 .

Getting

Traceback (most recent call last):
File "/path/to/s3-parallel-put/s3-parallel-put", line 421, in <module>
sys.exit(main(sys.argv))
File "/path/to/s3-parallel-put", line 392, in main
bucket = connection.get_bucket(options.bucket)
File "/usr/lib/python2.6/site-packages/boto/s3/connection.py", line 506, in get_bucket
return self.head_bucket(bucket_name, headers=headers)
File "/usr/lib/python2.6/site-packages/boto/s3/connection.py", line 553, in head_bucket
response.status, response.reason, body)
boto.exception.S3ResponseError: S3ResponseError: 301 Moved Permanently

Looks like boto doesn't support buckets not in us-east-1

Nov. 1, 2016, 4:47 p.m. #
> boto doesn't support buckets not in us-east-1 Doubt it; probably the bucket is too new.
Chuck

You need to use --bucket-region option on commandline
i.e. --bucket-region=eu-west-1

Feb. 2, 2017, 9:21 p.m. #
Lior Zimmerman

Why not just use linux' parallel command line app?

This is how I use it:

parallel -i aws s3 cp "{}" ${bucket_name}/"{}" -- `find -name "*.tar.gz"`

March 9, 2017, 12:11 a.m. #