thread for tips on collecting stuff from s3 buckets.
rclone has been recommended over the aws cli tool, so someone else can chime in with info on that.
aws cli tool
Downloading
Download a directory:
aws s3 sync --no-sign-request "s3://{bucket-name}/some/path/" ./output_directory
Download a single file
aws s3 sync --no-sign-request "s3://{bucket-name}/some/path/" ./output_directory --exclude='*' --include='name-of-file.txt'
Caveats
- If you cancel an in-progress sync with ctrl+c it WILL delete the temporary files by default.
aws cli tool downloads files in parallel rather than sequentially, and if a dataset is removed before that is finished the partial files can be made unusable. recommend downloading files sequentially in a loop by making some list of files in e.g. file_list.txt (one file per line)
while read p; do
aws s3 sync --no-sign-request "s3://some-bucket-name/some/path/" ./output_directory --exclude='*' --include="$p"
done <file_list.txt
rclone
Download a bucket
rclone copy aws:nrel-pds-nsrdb/v3/ ./ \
--multi-thread-write-buffer-size 64Mi \
--multi-thread-streams 16 \
--progress
Download a set of files in the bucket
rclone copy aws:nrel-pds-nsrdb/v3/ ./ \
--include 'nsrdb_{2006,2007,2008,2009,2010,2011}.h5'