A Million Songs on AWS Redshift

➫ ls -lh track_metadata*.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_aa.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_ab.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_ac.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_ad.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_ae.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_af.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_ag.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_ah.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_ai.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_aj.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_ak.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_al.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_am.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_an.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_ao.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_ap.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_aq.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_ar.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_as.csv.gz -rw-rw-r– 1 mark mark 1.8M Jan 23 08:48 track_metadata_at.csv.gz Uploading in Parallel to S3 You want to make sure your S3 bucket is created in the same region that youll be launching your Redshift instance in..Amazons us-east-1 region has the cheapest price on Redshift clusters so Ill use that region to create my S3 bucket in..➫ s3cmd –configure ….Default Region [US]: US ….➫ s3cmd mb s3://track_metadata_example Bucket 's3://track_metadata_example/' created The following will use GNUs parallel command to upload eight files at a time to Amazon S3..➫ find track_metadata_*gz | parallel -j8 s3cmd put {/} s3://track_metadata_example/ Redshift needs a manifest of the files well be loading in..➫ vi songs.manifest { "entries": [ {"url": "s3://track_metadata_example/track_metadata_aa.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_ab.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_ac.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_ad.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_ae.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_af.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_ag.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_ah.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_ai.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_aj.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_ak.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_al.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_am.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_an.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_ao.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_ap.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_aq.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_ar.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_as.csv.gz", "mandatory": true}, {"url": "s3://track_metadata_example/track_metadata_at.csv.gz", "mandatory": true} ] } The manifest itself will also need to live on S3..➫ s3cmd put songs.manifest s3://track_metadata_example/ Launching the Redshift Cluster The cheapest Redshift cluster I could find on Amazons pricing page was a dc1.large in us-east-1 at $0.25 / hour..It comes with 2 vCPUs, 15 GB of RAM, 0.16 TB of SSD-backed storage and throughput support of 0.20GB / second..Below Ill set the environment variables for my AWS access credentials and set the master username and password for my Redshift instance..➫ read AWS_ACCESS_KEY_ID ➫ read AWS_SECRET_ACCESS_KEY ➫ export AWS_ACCESS_KEY_ID ➫ export AWS_SECRET_ACCESS_KEY ➫ read MASTER_USERNAME ➫ read MASTER_PASSWORD ➫ export MASTER_USERNAME ➫ export MASTER_PASSWORD Before creating the Redshift cluster make sure the AWS CLI tools default region is the same region your S3 bucket is located in..You can run the configure command to update any settings and ensure theyre as you expect.. More details

Leave a Reply