HTML to PDF using a Chrome puppet in the cloud

HTML to PDF using a Chrome puppet in the cloudKeith CoughtreyBlockedUnblockFollowFollowingApr 30I’m going to take you through the process of setting up a headless chrome browser that you can run on AWS and use an API to do most of the things a browser can do.

Our target for today is to have chrome navigate to a URL, wait for the page to fully-load and then create a PDF.

The chromium team have released the headless chrome node API Puppeteer.

GoogleChrome/puppeteerHeadless Chrome Node API.

Contribute to GoogleChrome/puppeteer development by creating an account on GitHub.

github.

comPuppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol.

Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

There is also a really useful site where you can go and try puppeteer: https://try-puppeteer.

appspot.

com/.

There sample code they provide to create a pdf looks like this:const browser = await puppeteer.

launch(); const page = await browser.

newPage(); await page.

goto('https://news.

ycombinator.

com', {waitUntil: 'networkidle2'}); await page.

pdf({ path: 'hn.

pdf', format: 'letter' });await browser.

close();The API being used above is very well documented here.

Looking at page.

pdf we see that the function takes and array of options and returns a promise which resolves with a PDF buffer.

The options give you a good deal of control.

You can set a path to save the pdf if you don’t want to consume the buffer, control headers, footers and page formatting, among other things.

Building and deploying to AWSBefore we get started you will need node8.

10 and npm installed on your machine and you will need an AWS account to deploy your code to.

AWS Lambda has a reasonably generous free tier — see AWS Lambda PricingServerlessI’m going to use the serverless framework, which I find to be the easiest way to deploy to AWS.

If you haven’t used serverless before, start by installing the cli:npm install -g serverlessYou then need to set up your AWS credentials:How to create AWS Access KeysOnce you’ve finished the setup, create your project.

serverless create –template aws-nodejs –path .

/lambda-puppeteerThis will create the lambda-puppeteer folder containing a basic javascript lambda deployment project.

My preference is to use typescript rather than plain javascript so we will convert the project to typescript below.

The serverless template aws-nodejs-typescript could be used above but it creates a project that misses out a number of useful comments and it includes webpack, which we don’t need.

cd lambda-puppeteerThe serverless.

yml files contains all the configuration necessary to deploy you project and the template creates a project that can be deployed and tested straightaway.

serverless deploy -vNow test your function and look at the logs with these commands:serverless invoke -f hello -lserverless logs -f hello -tChromium and puppeteer coreLambda has a 50Mb deployment limit (unless using layers) but the community has provided an easy way to deploy everything needed in a package of about 35Mb.

We will use this library to get the chromium dependencies we need:alixaxel/chrome-aws-lambdaChromium Binary for AWS Lambda.

Contribute to alixaxel/chrome-aws-lambda development by creating an account on GitHub.

github.

comInitialise node package manager:npm initJust accept the defaults for the project setup.

Add chromium:npm i chrome-aws-lambda –saveand puppeteer-core, which is a version of Puppeteer that doesn’t download Chromium by default:npm i puppeteer-core –saveUsing typescriptThere are a number of ways to configure your project for typescript such as using the serverless-plugin-typescript.

In this case we’re going to manually convert the project in five steps:1.

install typescriptnpm i –save-dev typescript2.

rename handler.

js to handler.

ts3.

install node types:npm i @types/node4.

Add a tsconfig.

json file with the following content:5.

Add these two scripts to package.

json:"scripts": { "build": "tsc", "deploy": "npm run build && serverless deploy", .

},Here we’ve added a deploy command that will compile typescript and do a serverless deploy.

You could also run tests as part of the deploy by defining a test script and changing deploy to npm run build && npm run test && serverless deploy.

Implementing the serviceOur pdf service will have the following interface:export interface PdfService { getPdf(url: string): Promise<Buffer>;}We expose a single function that accepts a URL parameter and returns a promise of a Buffer containing the PDF of the content of the URL.

Create a file named pdf-service.

ts and add the interface code above to it.

The implementation of the interface looks like this:Add the implementation code above to pdf-service.

ts so that it contains both the interface and the implementation.

This code expands on the simple example near the beginning of this post.

One thing to note is the waitUntil options I have included.

This setting determines when to consider navigation has succeeded and it defaults to load.

When you specify an array of event strings, navigation is considered to be successful after all events have been fired.

load – consider navigation to be finished when the load event is fired.

domcontentloaded – consider navigation to be finished when the DOMContentLoaded event is fired.

networkidle0 – consider navigation to be finished when there are no more than 0 network connections for at least 500 ms.

So capturing the pdf does not proceed until the last of these three have completed.

Wiring up to an https endpointTo make our service callable, we change the handler code to:Here we convert the buffer returned from our PdfService to a base64 string.

Finally, we add an https endpoint /pdf to call our function by replacing the functions section of serverless.

yml with:functions: pdfReport: handler: lib/handler.

pdfReport events: – http: path: pdf method: get integration: lambdaNote that the handler path of lib matches the outDir specified in tsconfig.

json above.

Deploy your service using the deploy script we defined in package.

json:npm run deployAfter the deployment has finished we can call our pdf service by going to the url allocated by the serverless deploy, for example:https://<your project id and region>.

amazonaws.

com/dev/pdf?url=https://example.

comIf all is well, this should return a long base64 text response.

If we use an online base64 to pdf converter (eg base64.

guru) to convert the text of the response to a pdf we can see the result.

Returning application/pdfBy changing some settings in API gateway you can have your endpoint return the correct Content-Type to be displayed as a PDF.

There is a serverless plugin that is meant to automate these settings:serverless-plugin-custom-binaryEnable binary support for API Gatewaywww.

npmjs.

comI wasn’t able to get it to work but it may work for you.

However, I was able to make the change manually following these instructions, but it’s not ideal to have configuration outside of your serverless deployment.

Adding header and footerYou can add your own HTML markup to create custom page headers and footers.

One thing to note is that none of the stylesheets from the page are available so any styling needs to be done inline.

The header and footer markup can contain the following classes used to inject printing values into them:date formatted print datetitle document titleurl document locationpageNumber current page numbertotalPages total pages in the documentHere’s an example of adding a footer containing page numbers:This is how it looks on the page:Of course using page.

pdf is just one example of the many things you can do with chrome using the puppeteer API.

That completes today’s post.

Remember to delete your AWS resources when you’ve finished using serverless remove .

In my next post we’ll add PDF password protection using a command-line tool and in the third post of the series I will cover calling the PDF service from an AWS step function.

.

. More details

Leave a Reply