Hadoop Up and Running

The output from the above job will look something like the following: creating new scratch bucket mrjob-3568101f09d5f75c using s3://mrjob-3568101f09d5f75c/tmp/ as our scratch dir on S3 creating tmp directory /tmp/calc_pi_job.mark.20151229.160731.659742 writing master bootstrap script to /tmp/calc_pi_job.mark.20151229.160731.659742/b.py creating S3 bucket 'mrjob-3568101f09d5f75c' to use as scratch space Copying non-input files into s3://mrjob-3568101f09d5f75c/tmp/calc_pi_job.mark.20151229.160731.659742/files/ Waiting 5.0s for S3 eventual consistency Creating Elastic MapReduce job flow Auto-created instance profile mrjob-5cc7d6cec347fa81 Auto-created service role mrjob-84fd09862fa415d0 Job flow created with ID: j-3B6TE1AW5RVS5 Created new job flow j-3B6TE1AW5RVS5 Job launched 31.2s ago, status STARTING: Provisioning Amazon EC2 capacity Job launched 62.5s ago, status STARTING: Provisioning Amazon EC2 capacity Job launched 93.7s ago, status STARTING: Provisioning Amazon EC2 capacity Job launched 125.0s ago, status STARTING: Provisioning Amazon EC2 capacity Job launched 156.2s ago, status STARTING: Provisioning Amazon EC2 capacity Job launched 187.5s ago, status STARTING: Provisioning Amazon EC2 capacity Job launched 218.7s ago, status STARTING: Provisioning Amazon EC2 capacity Job launched 250.0s ago, status STARTING: Provisioning Amazon EC2 capacity Job launched 281.2s ago, status STARTING: Provisioning Amazon EC2 capacity Job launched 312.4s ago, status STARTING: Provisioning Amazon EC2 capacity Job launched 343.7s ago, status STARTING: Configuring cluster software Job launched 374.9s ago, status BOOTSTRAPPING: Running bootstrap actions Job launched 406.2s ago, status BOOTSTRAPPING: Running bootstrap actions Job launched 437.5s ago, status RUNNING: Running step (calc_pi_job.mark.20151229.160731.659742: Step 1 of 1) Job launched 468.8s ago, status RUNNING: Running step (calc_pi_job.mark.20151229.160731.659742: Step 1 of 1) Job launched 500.0s ago, status RUNNING: Running step (calc_pi_job.mark.20151229.160731.659742: Step 1 of 1) Job completed..Running time was 74.0s (not counting time spent waiting for the EC2 instances) ec2_key_pair_file not specified, going to S3 Fetching counters from S3….Waiting 5.0s for S3 eventual consistency Counters from step 1: (no counters found) Streaming final output from s3://<throw away s3 bucket>/test1/ removing tmp directory /tmp/calc_pi_job.mark.20151229.160731.659742 Removing all files in s3://mrjob-3568101f09d5f75c/tmp/calc_pi_job.mark.20151229.160731.659742/ Removing all files in s3://mrjob-3568101f09d5f75c/tmp/logs/j-3B6TE1AW5RVS5/ Terminating job flow: j-3B6TE1AW5RVS5 Now normally you can then download the log files and grep out the estimated value of Pi: $ s3cmd get –recursive s3://mrjob-3568101f09d5f75c/tmp/logs/ $ cd j-3B6TE1AW5RVS5 $ find..-type f -name '*.gz' -exec gunzip "{}" ; $ grep -r 'Estimated value of Pi is' * | wc -l 0 While the estimated value of Pi was printed to the stdout log file on the master node and that file should have been shipped to S3 it wasnt for some reason..If the cluster is not set to auto-terminate you can SSH in and see the value sitting in the stdout log file on the master node itself: $ aws emr ssh –key-pair-file emr.pem –region us-east-1 –cluster-id j-3B6TE1AW5RVS5 $ cat /mnt/var/log/hadoop/steps/s-3QSFO02HYEA8/stdout Number of Maps = 10 Samples per Map = 100000 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job Job Finished in 186.566 seconds Estimated value of Pi is 3.14155200000000000000 Ive discussed ways of trying to fix or work around this issue in a ticket I raised with the mrjob community on GitHub..Our efforts are ongoing.. More details

Leave a Reply