Serverless Web scrap using AWS Lambda and S3 — Python

Get the data using the scrap function, add the date to the file name since this will run everyday and I need to identify the file, and put it in S3 using boto:Seems like it should work fine right?.Wrong.There are many dependencies on this project already: urlopen, datetime, boto3, BeautifulSoup, pandas, and reAll of these dependencies need to be installed along with the function..There are a few ways to do this, but its best to package all of this together using Serverless Framework..I would go into the details about Serverless, but I think the post by Michael Lavers is a great resource and better than anything I could write on the topic.Once Serverless is setup, we need to now add this new function to our project created from this post with one file containing both web_scrap() and handler(), change the .yaml file to include the [named file].handlerfunctions: scraper: handler: scraper.handlerand rerun:sls deployThis will add all of the dependency packages without us doing a thing!.It’s great honestly, and saves so much time.Now that the function is up on Lambda, all we need to do is add a Cron trigger from CloudWatch:And test it out to see that the file gets added to S3..If the .csv was successfully added to the bucket, then you’re good to go..Now you have a serverless function that will scrap a webpage how ever often you’d like..Since Amazon’s free tier is 1 Million free requests per month, it hasn’t cost me anything to collect this data.Easy right?Happy data gathering!. More details

Leave a Reply