Tag Archives: Puppeteer

Running Puppeteer under Docker

Puppeteer is an API enabling browser automation via Chrome/Chromium. Quoting its homepage:

“Puppeteer is a Node.js library which provides a high-level API to control Chrome/Chromium over the DevTools Protocol. Puppeteer runs in headless mode by default, but can be configured to run in full (non-headless) Chrome/Chromium.”

I’ve already shown it in action a couple of years ago in my article, “Gathering Net Salary Data with Puppeteer“, where I used it for web scraping. Browser automation is also very common for automated testing of web applications, and may also be used for a lot of other things.

As with any other piece of software, it is sometimes convenient to package a Puppeteer script in a Docker container. However, since deploying a browser is fundamentally more complicated than your average API, Puppeteer and Docker are a little tricky to get working together. In this article, we’ll see why this combination is problematic and how to solve it.

A Minimal Puppeteer Example

Before we embark upon our Docker journey, we need a simple Puppeteer program we can test with. The easiest thing we can do is use Puppeteer to open a webpage and take a screenshot of it. We can do this quite easily as follows.

First, we need to create a folder, install prerequisites, and create a file in which to put our JavaScript code. The following bash script takes care of all this, assuming you already have Node.js installed. Please note that I am working on Linux Kubuntu 22.04, so if you’re using a radically different operating system, the steps may vary a little.

mkdir pupdock
cd pupdock
npm install puppeteer
touch main.js

Next, open main.js with your favourite text editor or IDE and use the following code (adapted from “Gathering Net Salary Data with Puppeteer“):

const puppeteer = require('puppeteer');
 
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://www.programmersranch.com');
  await page.screenshot({ path: 'screenshot.png' });
  await browser.close();
})();

To see that it works, run this script using the following command:

node main.js

A file called screenshot.png should be saved in the same folder:

A screenshot of my older tech blog, Programmer’s Ranch, complete with Blogger’s stupid cookie banner.

An Initial Attempt with Docker

Now that we know that the script works, let’s try and make a Docker image out of it. Add a Dockerfile with the following contents:

FROM node:19-alpine3.16
WORKDIR /puppeteer
COPY main.js package.json package-lock.json ./
RUN npm install
CMD ["node", "main.js"]

What we’re doing here is starting from a recent (at the time of writing this article) Node Docker image. Then we set the current working directory to a folder called /puppeteer. We copy the script along with the list of package dependencies (Puppeteer, basically) into the image, install those dependencies, and set up the image to execute node main.js when it is run.

We can build the Docker image using the following command, giving it the tag of “pupdock” so that we can easily find it later. The sudo command is only necessary if you’re running on Linux.

sudo docker build -t pupdock .

Once the image is built, we can run a container based on this image using the following command:

sudo docker run pupdock

Unfortunately, we hit a brick wall right away:

/puppeteer/node_modules/puppeteer-core/lib/cjs/puppeteer/node/BrowserRunner.js:300
            reject(new Error([
                   ^

Error: Failed to launch the browser process! spawn /root/.cache/puppeteer/chrome/linux-1108766/chrome-linux/chrome ENOENT


TROUBLESHOOTING: https://pptr.dev/troubleshooting

    at onClose (/puppeteer/node_modules/puppeteer-core/lib/cjs/puppeteer/node/BrowserRunner.js:300:20)
    at ChildProcess.<anonymous> (/puppeteer/node_modules/puppeteer-core/lib/cjs/puppeteer/node/BrowserRunner.js:294:24)
    at ChildProcess.emit (node:events:512:28)
    at ChildProcess._handle.onexit (node:internal/child_process:291:12)
    at onErrorNT (node:internal/child_process:483:16)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)

Node.js v19.8.1

The error and stack trace are both pretty cryptic, and it’s not clear why this failed. There isn’t much you can do other than beg Stack Overflow for help.

Fortunately, I’ve done this already myself, and I can tell you that it fails because Chrome (which Puppeteer is trying to launch) needs more security permissions than Docker provides by default. We’ll need to learn a bit more about Docker security to make this work.

Dealing with Security Restrictions

So how do we give Chrome the permissions it needs, without compromising the security of the system it is running on? We have a few options we could try.

  • Disable the sandbox. Chrome uses a sandbox to isolate potentially harmful web content and prevent it from gaining access to the underlying operating system (see Sandbox and Linux Sandboxing docs to learn more). Many Stack Overflow answers suggest getting around errors by disabling this entirely. Unless you know what you’re doing, this is probably a terrible idea. It’s far better to relax security a little to allow exactly the permissions you need than to disable it entirely.
  • Use the Puppeteer Docker image. Puppeteer’s documentation on Docker explains how to use a Puppeteer’s own Docker images (available on the GitHub Container Registry) to run arbitrary Puppeteer scripts. While there’s not much info on how to work with these (e.g. which folder to mount as a volume in order to grab the generated screenshot), what stands out is that this approach requires the SYS_ADMIN capability, which exposes more permissions than we need.
  • Build your own Docker image. Puppeteer’s Troubleshooting documentation (also available under Chrome Developers docs and Puppeteer docs) has a section on running Puppeteer in Docker. We’ll follow this method, as it is the one that worked best for me.

A Second Attempt Based on the Docs

Our initial attempt with a simple Dockerfile didn’t go very well, but now we have a number of other Dockerfiles we could start with, including the Puppeteer Docker image’s Dockerfile, a couple in the aforementioned doc section on running Puppeteer in Docker (one built on a Debian-based Node image, and another based on an Alpine image), and several other random ones scattered across Stack Overflow answers.

My preference is the Alpine one, not only because Alpine images tend to be smaller than their Debian counterparts, but also because I had more luck getting it to work across Linux, Windows Subsystem for Linux and an M1 Mac than I did with the Debian one. So let’s replace our Dockerfile with the one in the Running on Alpine section, with a few additions at the end:

FROM alpine

# Installs latest Chromium (100) package.
RUN apk add --no-cache \
      chromium \
      nss \
      freetype \
      harfbuzz \
      ca-certificates \
      ttf-freefont \
      nodejs \
      yarn

# Tell Puppeteer to skip installing Chrome. We'll be using the installed package.
ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser

# Puppeteer v13.5.0 works with Chromium 100.
RUN yarn add puppeteer@13.5.0

# Add user so we don't need --no-sandbox.
RUN addgroup -S pptruser && adduser -S -G pptruser pptruser \
    && mkdir -p /home/pptruser/Downloads /app \
    && chown -R pptruser:pptruser /home/pptruser \
    && chown -R pptruser:pptruser /app

# Run everything after as non-privileged user.
USER pptruser

WORKDIR /puppeteer
COPY main.js ./
CMD ["node", "main.js"]

Because the given Dockerfile already installs the puppeteer dependency and that’s all we need here, I didn’t even bother to do the usual npm install here, although a more complex script might possibly have additional dependencies to install.

At this point, we can build the Dockerfile and run the resulting image as before:

sudo docker build -t pupdock .
sudo docker run pupdock

The result is that it still doesn’t work, but the error is something that is more easily Googled than the one we have before:

/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:237
            reject(new Error([
                   ^

Error: Failed to launch the browser process!
Failed to move to new namespace: PID namespaces supported, Network namespace supported, but failed: errno = Operation not permitted


TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

    at onClose (/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:237:20)
    at ChildProcess.<anonymous> (/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:228:79)
    at ChildProcess.emit (node:events:525:35)
    at ChildProcess._handle.onexit (node:internal/child_process:291:12)

Node.js v18.14.2

A Little Security Lesson

Googling that error about PID namespaces is what led me to the solution, but it still took a while, because I had to piece together many clues scattered in several places, including:

  • Answer by usethe4ce suggests downloading some chrome.json file and passing it to Docker, but it’s not immediately clear what this is/does.
  • Answer by Riccardo Manzan, based on the one by usethe4ce, provides an example Dockerfile based on Node (not Alpine) and also shows how to pass the chrome.json file both in docker run and docker-compose.
  • GitHub Issue by WhisperingChaos explains what that chrome.json is about.
  • Answer by hidev lists the five system calls that Chrome needs.

To understand all this confusion, we first need to take a step back and understand something about Docker security. Like the Chrome sandbox, Docker has its ways of restricting the extent to which a running Docker container can interact with the host.

Like any Linux process, a container requests whatever it needs from the operating system kernel using system calls. However, a container could be used to wreak havoc on the host if it is allowed to run whatever system calls it wants and then is successfully breached by an attacker. Fortunately, Linux provides a feature called seccomp that can restrict system calls to only the ones that are required, minimising the attack surface.

In Docker, this restriction is applied by means of a seccomp profile, basically a JSON file whitelisting the system calls to be allowed. Docker’s default seccomp profile restricts access to system calls enough that it prevents many known exploits, but this also prevents more complex applications that need additional system calls – such as Chrome – from working under Docker.

That chrome.json I mentioned earlier is a custom seccomp profile painstakingly created by one Jess Frazelle, intended to allow the system calls that Chrome needs but no more than necessary. This should be more secure than disabling the sandbox or running Chrome with the SYS_ADMIN capability.

A Third Attempt with chrome.json

Let’s give it a try. Download chrome.json and place it in the same folder as main.js, the Dockerfile and everything else. Then, run the container as follows:

sudo docker run --security-opt seccomp=path/to/chrome.json pupdock

This time, there’s no output at all – which is good, because it means the errors are gone. To ensure that it really worked, we’ll grab the screenshot from inside the stopped container. First, get the container ID by running:

sudo docker container ls -a

Then, copy the screenshot from the container to the current working directory as follows, taking care to replace 7af0d705a751 with the actual ID of the container:

sudo docker cp 7af0d705a751:/puppeteer/screenshot.png ./screenshot.png
Puppeteer worked: the screenshot contains the full length of the page.

Additional Notes

I omitted a few things to avoid breaking the flow of this article, so I’ll mention them here briefly:

  • That SYS_ADMIN capability we saw mentioned earlier belongs to another Linux security feature: capabilities, which are also related to seccomp profiles.
  • If you need to pass a custom seccomp profile in a docker-compose.yaml file, see Riccardo Manzan’s Stack Overflow answer for an example.
  • If you run into other issues, use the dumpio setting to get more verbose output. It’s a little hard to separate the real errors from the noise, but it does help. Be sure to run in headless mode (it doesn’t make sense to run the GUI browser under Docker), and disable the GPU (--disable-gpu) if you see related errors.

Conclusion

Running Puppeteer under Docker might sound like an unusual requirement, perhaps overkill, but it does open up a window onto the interesting world of Docker security. Chrome’s complexity requires that it be granted more permissions than the typical Docker container.

It is unfortunate that the intricacies around getting Puppeteer to work under Docker are so poorly documented. However, once we learn a little about Docker security – and the Linux security features that it builds on – the solution of using a custom seccomp profile begins to make sense.

Gathering Net Salary Data with Puppeteer

Tax is one of those things that makes moving to a different country difficult, because it varies wildly between countries. How much do you need to earn in that country to maintain the same standard of living?

You can, of course, use an online salary calculator to understand how much net salary you’re left with after deducting tax and social security contributions, but this only lets you sample specific salaries and doesn’t really give you enough information to assess how the impact of tax changes as you earn more. Importantly, you can’t use these tools to draw a graph for each country and compare.

Malta Salary Calculator by Darren Scerri

Fortunately, however, these tools have already done the heavy lifting by taking care of the complex calculations. To build a graph, all we really need to do is to take samples at regular intervals, say, every 1,000 Euros. Since that is very tedious to do by hand, we’ll use a browser automation tool to do this for us.

Enter Puppeteer

Puppeteer, as the homepage says, “is a Node library which provides a high-level API to control Chrome or Chromium”, which is pretty much what we need for this job. It also gives us what we need to get started. In a new folder, run the following to install the puppeteer dependency:

npm i puppeteer

Then, create a new file (e.g. netsalary.js) and add the starter code from the Puppeteer homepage. We’ll use this as a starting point:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  await page.screenshot({ path: 'example.png' });

  await browser.close();
})();

Getting Salary Data for Malta

In this particular exercise, we’ll get the salary data for Malta using Darren Scerri’s Malta Salary Calculator, which is relatively easy to work with.

Before we write any code, we need to understand the dynamics of the calculator. We do this via the browser’s developer tools.

Whenever you change the value of the gross salary input field (that has the “salary” id in the HTML), a bunch of numbers get updated, including the yearly net salary (which has the “net-yearly-result” class) which is what we’re interested in.

Just by knowing how we can reach the relevant elements, we can write our first code to retrieve the input (gross salary) and output (yearly net salary) values to make sure we know what we’re doing:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('http://maltasalary.com/');
  
  // Gross salary
  const grossSalaryInput = await page.$("#salary");
  const grossSalary = await page.evaluate(element => element.value, grossSalaryInput);
  console.log('Gross salary: ', grossSalary);
  
  // Net salary
  const netSalaryElement = await page.$('.net-yearly-result');
  const netSalary = await page.evaluate(element => element.textContent, netSalaryElement);
  console.log('Net salary: ', netSalary);

  await browser.close();
})(); 

Here, we’re using the page.$() function to locate an element the same way we would using jQuery. Then we use the page.evaluate() function to get something from that element (in this case, the value of the input field). We do the same for the net salary, with the notable difference that in the page.evaluate() function, we get the textContent property of the element instead.

If we run this (node netsalary.js), we should get the same default values we see in the online salary calculator:

We managed to retrieve the gross and net salaries from the online calculator.

Text Entry

That was easy enough, but it used the default values that are present when the page is loaded. How do we manipulate the input field so that we can enter arbitrary gross salary values and later pick up the computed net salary?

The simplest way to do this is by simulating keyboard input as follows:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('http://maltasalary.com/');
  
  const grossSalary = 30000;
  
  // Gross salary - keyboard input
  await page.focus("#salary");
  
  for (var i = 0; i < 6; i++)
    await page.keyboard.press('Backspace');
  
  await page.keyboard.type(grossSalary.toString());
  
  // Net salary
  const netSalaryElement = await page.$('.net-yearly-result');
  const netSalary = await page.evaluate(element => element.textContent, netSalaryElement);
  console.log('Net salary: ', netSalary);

  await browser.close();
})(); 

Here, we:

  1. Focus the input field, so that whatever we type goes in there.
  2. Press backspace six times to erase any existing gross salary in the field (if you check the online calculator, you’ll see it can take up to six digits).
  3. Type in the string version of our gross salary, which is a hardcoded constant with a value of 30,000.

The result I get when I run this matches what the online calculator gives me. I guess I must be doing something right for once in my life.

Net salary:  22,805.44

Pulling Net Salary Data in a Range

So now we know how to enter a gross salary and read out the corresponding net salary. How do we do this at regular intervals within a range (e.g. every 1,000 Euros between 15,000 and 140,000)? Easy. We write a loop.

In practice, there’s a little timing issue between iterations, so I also needed to nick a very handy sleep function off Stack Overflow and put a very short delay after doing the keyboard input, to give it time to update the output values.

const puppeteer = require('puppeteer');

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('http://maltasalary.com/');
  
  console.log('Gross Net');
  
  for (var grossSalary = 15000; grossSalary <= 140000; grossSalary += 1000) {
    // Gross salary - keyboard input
    await page.focus("#salary");
  
    for (var i = 0; i < 6; i++)
      await page.keyboard.press('Backspace');
  
    await page.keyboard.type(grossSalary.toString());
    await sleep(10);
  
    // Net salary
    const netSalaryElement = await page.$('.net-yearly-result');
    const netSalary = await page.evaluate(element => element.textContent, netSalaryElement);

    console.log(grossSalary, netSalary);
  }

  await browser.close();
})(); 

This has the effect of outputting a pair of headings (“Gross Net”) followed by gross and net salary pairs:

Outputting the gross and net salaries in steps of 1,000 Euros (gross) at a time.

Making a Graph

Now that we have a program that spits out pairs of gross and net salaries, we can make a graph out of this data. First, we dump all this into a file.

node netsalary.js > malta.csv

Although this is technically not really CSV data, it’s still very easy to open in spreadsheet software. For instance, when you open this file using LibreOffice Calc, you get the Text Import screen where you can choose to use space as the separator. This makes things easier given that the net salaries contain commas.

Choose Space as the separator to load the data correctly.

Once the data is in a spreadsheet, producing a chart is a relatively simple matter:

Graph showing how net salary changes with gross salary in Malta.

Now, this graph might look a little lonely, but you can already gather interesting insight by noticing its gradient and the fact that it isn’t entirely straight.

After doing this exercise for multiple countries, it’s fascinating to see how their lines compare when plotted on the same chart.

Aside from the allure of data analysis, I hope this article served to show how easy it is to use Puppeteer to perform simple browser automation, beyond the obvious UI automation testing.