Distributed Data Pre-processing using Dask, Amazon ECS and Python (Part 1)

You can verify this by switching to ECS Console -> Click Clusters -> Click Fargate-Dask-Cluster and on the tasks tab, there should be 2 running tasks:Now that the Dask Cluster is ready, I will create SageMaker Notebook so I can start using the cluster..To do this, I switch to SageMaker Console -> Notebook Instances -> Create Notebook Instance.Then I will select the same VPC and Subnets that were selected earlier in the CloudFormation template:NOTE: You can select any other subnet and security group as long as you enable access between the SageMaker notebook and the Dask Cluster.Then, I create a new python3 notebook by clicking on New -> conda_python3..Dask packages are installed by default on the SageMaker notebook but it is important to make sure that the package is updated to the latest version..To verify that, I will run the conda update command on the notebook:NOTE: If the client version is lower than the scheduler and worker version, you will encounter errors when initiating the client.The next step will be creating the client and connecting to the dask cluster by running the below code:Notice that I used DNS name of the scheduler that was automatically assigned using ECS Service Discovery Functionality that uses Route 53 auto naming API actions to manage Route 53 DNS entriesNow let’s do some operations on the data using the cluster but before that, I will scale up the number of workers in the cluster to 7 workers..To do this, I run one command in the notebook as below:After a few seconds, the worker tasks status in Fargate Console will be ‘RUNNING’..I will restart the Dask Client to make sure that we utilize the parallelism nature of the cluster.Now we have a Cluster of 14 cores of CPU and 12 GB of memory (2 CPU cores and 2 GB of memory for each of the 7 worker)..Let’s do some compute and memory intensive operations and generate some insights..I am loading a dask dataframe with the data and computing the trip distance and grouping by the number of passengers.The results start to show up after about 2.5 minutes after parallelizing the task across 7 different workers and loading more than 10 GB of data in parallel.Visualization:Screenshots from Bokeh Server at the Scheduler Task that shows the operations being threaded across the workers..The dashboard can be accessed in the browser from the scheduler IP address and port 8787:The following screenshot is showing the resources (CPU and Memory) utilization for each worker:Now you should be ready to do some pre-processing magic!In part 2, I will show some code on running analytics and pre-processing/Feature Engineering using the Dask Cluster we created.Thanks for reading the post..Feedback and constructive criticism is always welcomed..I will be reading all of your comments.-Will. More details

Leave a Reply