thread for tips on collecting stuff from s3 buckets.

rclone has been recommended over the aws cli tool, so someone else can chime in with info on that.

aws cli tool

Downloading

Download a directory:

aws s3 sync --no-sign-request "s3://{bucket-name}/some/path/" ./output_directory

Download a single file

aws s3 sync --no-sign-request "s3://{bucket-name}/some/path/" ./output_directory --exclude='*' --include='name-of-file.txt'

Caveats

  • If you cancel an in-progress sync with ctrl+c it WILL delete the temporary files by default.
  • aws cli tool downloads files in parallel rather than sequentially, and if a dataset is removed before that is finished the partial files can be made unusable. recommend downloading files sequentially in a loop by making some list of files in e.g. file_list.txt (one file per line)
while read p; do
  aws s3 sync --no-sign-request "s3://some-bucket-name/some/path/" ./output_directory --exclude='*' --include="$p"
done <file_list.txt

rclone

Download a bucket

rclone copy aws:nrel-pds-nsrdb/v3/ ./ \
  --multi-thread-write-buffer-size 64Mi \
  --multi-thread-streams 16 \ 
  --progress 

Download a set of files in the bucket

rclone copy aws:nrel-pds-nsrdb/v3/ ./ \
  --include 'nsrdb_{2006,2007,2008,2009,2010,2011}.h5'