Tag Archives: Docker

Running Puppeteer under Docker

Puppeteer is an API enabling browser automation via Chrome/Chromium. Quoting its homepage:

“Puppeteer is a Node.js library which provides a high-level API to control Chrome/Chromium over the DevTools Protocol. Puppeteer runs in headless mode by default, but can be configured to run in full (non-headless) Chrome/Chromium.”

I’ve already shown it in action a couple of years ago in my article, “Gathering Net Salary Data with Puppeteer“, where I used it for web scraping. Browser automation is also very common for automated testing of web applications, and may also be used for a lot of other things.

As with any other piece of software, it is sometimes convenient to package a Puppeteer script in a Docker container. However, since deploying a browser is fundamentally more complicated than your average API, Puppeteer and Docker are a little tricky to get working together. In this article, we’ll see why this combination is problematic and how to solve it.

A Minimal Puppeteer Example

Before we embark upon our Docker journey, we need a simple Puppeteer program we can test with. The easiest thing we can do is use Puppeteer to open a webpage and take a screenshot of it. We can do this quite easily as follows.

First, we need to create a folder, install prerequisites, and create a file in which to put our JavaScript code. The following bash script takes care of all this, assuming you already have Node.js installed. Please note that I am working on Linux Kubuntu 22.04, so if you’re using a radically different operating system, the steps may vary a little.

mkdir pupdock
cd pupdock
npm install puppeteer
touch main.js

Next, open main.js with your favourite text editor or IDE and use the following code (adapted from “Gathering Net Salary Data with Puppeteer“):

const puppeteer = require('puppeteer');
 
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://www.programmersranch.com');
  await page.screenshot({ path: 'screenshot.png' });
  await browser.close();
})();

To see that it works, run this script using the following command:

node main.js

A file called screenshot.png should be saved in the same folder:

A screenshot of my older tech blog, Programmer’s Ranch, complete with Blogger’s stupid cookie banner.

An Initial Attempt with Docker

Now that we know that the script works, let’s try and make a Docker image out of it. Add a Dockerfile with the following contents:

FROM node:19-alpine3.16
WORKDIR /puppeteer
COPY main.js package.json package-lock.json ./
RUN npm install
CMD ["node", "main.js"]

What we’re doing here is starting from a recent (at the time of writing this article) Node Docker image. Then we set the current working directory to a folder called /puppeteer. We copy the script along with the list of package dependencies (Puppeteer, basically) into the image, install those dependencies, and set up the image to execute node main.js when it is run.

We can build the Docker image using the following command, giving it the tag of “pupdock” so that we can easily find it later. The sudo command is only necessary if you’re running on Linux.

sudo docker build -t pupdock .

Once the image is built, we can run a container based on this image using the following command:

sudo docker run pupdock

Unfortunately, we hit a brick wall right away:

/puppeteer/node_modules/puppeteer-core/lib/cjs/puppeteer/node/BrowserRunner.js:300
            reject(new Error([
                   ^

Error: Failed to launch the browser process! spawn /root/.cache/puppeteer/chrome/linux-1108766/chrome-linux/chrome ENOENT


TROUBLESHOOTING: https://pptr.dev/troubleshooting

    at onClose (/puppeteer/node_modules/puppeteer-core/lib/cjs/puppeteer/node/BrowserRunner.js:300:20)
    at ChildProcess.<anonymous> (/puppeteer/node_modules/puppeteer-core/lib/cjs/puppeteer/node/BrowserRunner.js:294:24)
    at ChildProcess.emit (node:events:512:28)
    at ChildProcess._handle.onexit (node:internal/child_process:291:12)
    at onErrorNT (node:internal/child_process:483:16)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)

Node.js v19.8.1

The error and stack trace are both pretty cryptic, and it’s not clear why this failed. There isn’t much you can do other than beg Stack Overflow for help.

Fortunately, I’ve done this already myself, and I can tell you that it fails because Chrome (which Puppeteer is trying to launch) needs more security permissions than Docker provides by default. We’ll need to learn a bit more about Docker security to make this work.

Dealing with Security Restrictions

So how do we give Chrome the permissions it needs, without compromising the security of the system it is running on? We have a few options we could try.

  • Disable the sandbox. Chrome uses a sandbox to isolate potentially harmful web content and prevent it from gaining access to the underlying operating system (see Sandbox and Linux Sandboxing docs to learn more). Many Stack Overflow answers suggest getting around errors by disabling this entirely. Unless you know what you’re doing, this is probably a terrible idea. It’s far better to relax security a little to allow exactly the permissions you need than to disable it entirely.
  • Use the Puppeteer Docker image. Puppeteer’s documentation on Docker explains how to use a Puppeteer’s own Docker images (available on the GitHub Container Registry) to run arbitrary Puppeteer scripts. While there’s not much info on how to work with these (e.g. which folder to mount as a volume in order to grab the generated screenshot), what stands out is that this approach requires the SYS_ADMIN capability, which exposes more permissions than we need.
  • Build your own Docker image. Puppeteer’s Troubleshooting documentation (also available under Chrome Developers docs and Puppeteer docs) has a section on running Puppeteer in Docker. We’ll follow this method, as it is the one that worked best for me.

A Second Attempt Based on the Docs

Our initial attempt with a simple Dockerfile didn’t go very well, but now we have a number of other Dockerfiles we could start with, including the Puppeteer Docker image’s Dockerfile, a couple in the aforementioned doc section on running Puppeteer in Docker (one built on a Debian-based Node image, and another based on an Alpine image), and several other random ones scattered across Stack Overflow answers.

My preference is the Alpine one, not only because Alpine images tend to be smaller than their Debian counterparts, but also because I had more luck getting it to work across Linux, Windows Subsystem for Linux and an M1 Mac than I did with the Debian one. So let’s replace our Dockerfile with the one in the Running on Alpine section, with a few additions at the end:

FROM alpine

# Installs latest Chromium (100) package.
RUN apk add --no-cache \
      chromium \
      nss \
      freetype \
      harfbuzz \
      ca-certificates \
      ttf-freefont \
      nodejs \
      yarn

# Tell Puppeteer to skip installing Chrome. We'll be using the installed package.
ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser

# Puppeteer v13.5.0 works with Chromium 100.
RUN yarn add puppeteer@13.5.0

# Add user so we don't need --no-sandbox.
RUN addgroup -S pptruser && adduser -S -G pptruser pptruser \
    && mkdir -p /home/pptruser/Downloads /app \
    && chown -R pptruser:pptruser /home/pptruser \
    && chown -R pptruser:pptruser /app

# Run everything after as non-privileged user.
USER pptruser

WORKDIR /puppeteer
COPY main.js ./
CMD ["node", "main.js"]

Because the given Dockerfile already installs the puppeteer dependency and that’s all we need here, I didn’t even bother to do the usual npm install here, although a more complex script might possibly have additional dependencies to install.

At this point, we can build the Dockerfile and run the resulting image as before:

sudo docker build -t pupdock .
sudo docker run pupdock

The result is that it still doesn’t work, but the error is something that is more easily Googled than the one we have before:

/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:237
            reject(new Error([
                   ^

Error: Failed to launch the browser process!
Failed to move to new namespace: PID namespaces supported, Network namespace supported, but failed: errno = Operation not permitted


TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

    at onClose (/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:237:20)
    at ChildProcess.<anonymous> (/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:228:79)
    at ChildProcess.emit (node:events:525:35)
    at ChildProcess._handle.onexit (node:internal/child_process:291:12)

Node.js v18.14.2

A Little Security Lesson

Googling that error about PID namespaces is what led me to the solution, but it still took a while, because I had to piece together many clues scattered in several places, including:

  • Answer by usethe4ce suggests downloading some chrome.json file and passing it to Docker, but it’s not immediately clear what this is/does.
  • Answer by Riccardo Manzan, based on the one by usethe4ce, provides an example Dockerfile based on Node (not Alpine) and also shows how to pass the chrome.json file both in docker run and docker-compose.
  • GitHub Issue by WhisperingChaos explains what that chrome.json is about.
  • Answer by hidev lists the five system calls that Chrome needs.

To understand all this confusion, we first need to take a step back and understand something about Docker security. Like the Chrome sandbox, Docker has its ways of restricting the extent to which a running Docker container can interact with the host.

Like any Linux process, a container requests whatever it needs from the operating system kernel using system calls. However, a container could be used to wreak havoc on the host if it is allowed to run whatever system calls it wants and then is successfully breached by an attacker. Fortunately, Linux provides a feature called seccomp that can restrict system calls to only the ones that are required, minimising the attack surface.

In Docker, this restriction is applied by means of a seccomp profile, basically a JSON file whitelisting the system calls to be allowed. Docker’s default seccomp profile restricts access to system calls enough that it prevents many known exploits, but this also prevents more complex applications that need additional system calls – such as Chrome – from working under Docker.

That chrome.json I mentioned earlier is a custom seccomp profile painstakingly created by one Jess Frazelle, intended to allow the system calls that Chrome needs but no more than necessary. This should be more secure than disabling the sandbox or running Chrome with the SYS_ADMIN capability.

A Third Attempt with chrome.json

Let’s give it a try. Download chrome.json and place it in the same folder as main.js, the Dockerfile and everything else. Then, run the container as follows:

sudo docker run --security-opt seccomp=path/to/chrome.json pupdock

This time, there’s no output at all – which is good, because it means the errors are gone. To ensure that it really worked, we’ll grab the screenshot from inside the stopped container. First, get the container ID by running:

sudo docker container ls -a

Then, copy the screenshot from the container to the current working directory as follows, taking care to replace 7af0d705a751 with the actual ID of the container:

sudo docker cp 7af0d705a751:/puppeteer/screenshot.png ./screenshot.png
Puppeteer worked: the screenshot contains the full length of the page.

Additional Notes

I omitted a few things to avoid breaking the flow of this article, so I’ll mention them here briefly:

  • That SYS_ADMIN capability we saw mentioned earlier belongs to another Linux security feature: capabilities, which are also related to seccomp profiles.
  • If you need to pass a custom seccomp profile in a docker-compose.yaml file, see Riccardo Manzan’s Stack Overflow answer for an example.
  • If you run into other issues, use the dumpio setting to get more verbose output. It’s a little hard to separate the real errors from the noise, but it does help. Be sure to run in headless mode (it doesn’t make sense to run the GUI browser under Docker), and disable the GPU (--disable-gpu) if you see related errors.

Conclusion

Running Puppeteer under Docker might sound like an unusual requirement, perhaps overkill, but it does open up a window onto the interesting world of Docker security. Chrome’s complexity requires that it be granted more permissions than the typical Docker container.

It is unfortunate that the intricacies around getting Puppeteer to work under Docker are so poorly documented. However, once we learn a little about Docker security – and the Linux security features that it builds on – the solution of using a custom seccomp profile begins to make sense.

Filebeat, Elasticsearch and Kibana with Docker Compose

Docker is one of those tools I wish I had learned to use a long time ago. I still remember how painful it always was to set up Elasticsearch on Linux, or to set up both Elasticsearch and Kibana on Windows, and occasionally having to repeat this process occasionally to upgrade or recreate the Elastic stack.

Fortunately, Docker images now exist for all Elastic stack components including Elasticsearch, Kibana and Filebeat, so it’s easy to spin up a container, or to recreate the stack entirely in a matter of seconds.

Getting them to work together, however, is not trivial. Security is enabled by default from Elasticsearch 8.0 onwards, so you’ll need SSL certificates, and the examples you’ll find on the internet using docker-compose from the Elasticsearch 7.x era won’t work. Although the Elasticsearch docs provide an example docker-compose.yml that includes Elasticsearch and Kibana with certificates, this doesn’t include Filebeat.

In this article, I’ll show you how to tweak this docker-compose.yml to run Filebeat alongside Elasticsearch and Kibana.

  • I’ll be doing this with Elastic stack 8.4 on Linux, so if you’re on Windows or Mac, drop the sudo from in front of the commands.
  • You can find the relevant files for this article in the FekDockerCompose folder at the Gigi Labs BitBucket Repository.
  • This is merely a starting point and by no means production-ready.
  • A lot of things can go wrong along the way, so I’ve included a lot of troubleshooting steps.

The Doc Samples

The “Install Elasticsearch with Docker” page at the official Elasticsearch documentation is a great starting point to run Elasticsearch with Docker. The section “Start a multi-node cluster with Docker Compose” provides what you need to run a three-node Elasticsearch cluster with Kibana in Docker using docker-compose.

The first step is to copy the sample .env file and fill in any values you like for the ELASTIC_PASSWORD and KIBANA_PASSWORD settings, such as the following (don’t use these values in production):

# Password for the 'elastic' user (at least 6 characters)
ELASTIC_PASSWORD=elastic

# Password for the 'kibana_system' user (at least 6 characters)
KIBANA_PASSWORD=kibana

# Version of Elastic products
STACK_VERSION=8.4.0

# Set the cluster name
CLUSTER_NAME=docker-cluster

# Set to 'basic' or 'trial' to automatically start the 30-day trial
LICENSE=basic
#LICENSE=trial

# Port to expose Elasticsearch HTTP API to the host
ES_PORT=9200
#ES_PORT=127.0.0.1:9200

# Port to expose Kibana to the host
KIBANA_PORT=5601
#KIBANA_PORT=80

# Increase or decrease based on the available host memory (in bytes)
MEM_LIMIT=1073741824

# Project namespace (defaults to the current folder name if not set)
#COMPOSE_PROJECT_NAME=myproject

Next, copy the sample docker-compose.yml. This is a large file so I won’t include it here, but in case the documentation changes, you can find an exact copy at the time of writing as docker-compose-original.yml in the aforementioned BitBucket repo.

Once you have both the .env and docker-compose.yml files, you can run the following command to spin up a three-node Elasticsearch cluster and Kibana:

sudo docker-compose up

You’ll see a lot of output and, after a while, if you access http://localhost:5601/, you should be able to see the Kibana login screen:

The Kibana login screen.

Troubleshooting tip: Unhealthy containers

It can happen that some of the containers fail to start up and claim to be “unhealthy”, without offering a reason. You can find out more by taking the container ID (provided in the error in the output) and running:

sudo docker logs <containerId>

Chances are that the error you’ll see in the logs will be this:

bootstrap check failure [1] of [1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

This is in fact explained in the same documentation page and elaborated in another one. Run the following command to fix it on Linux, or refer to the documentation for other OSes:

sudo sysctl -w vm.max_map_count=262144

Adding Filebeat to docker-compose.yml

The sample docker-compose.yml consists of five services: setup, es01, es02, es03 and kibana. While the documentation already explains how to Run Filebeat on Docker, what we need here is to run it alongside Elasticsearch and Kibana. The first step to do that is to add a service for it in the docker-compose.yml, after kibana:

  filebeat:
    depends_on:
      es01:
        condition: service_healthy
      es02:
        condition: service_healthy
      es03:
        condition: service_healthy
    image: docker.elastic.co/beats/filebeat:${STACK_VERSION}
    container_name: filebeat
    volumes:
      - ./filebeat.yml:/usr/share/filebeat/filebeat.yml
      - ./test.log:/var/log/app_logs/test.log
      - certs:/usr/share/elasticsearch/config/certs
    environment:
      - ELASTICSEARCH_HOSTS=https://es01:9200
      - ELASTICSEARCH_USERNAME=elastic
      - ELASTICSEARCH_PASSWORD=${ELASTIC_PASSWORD}
      - ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES=config/certs/ca/ca.crt

The most interesting part of this is the volumes:

  • filebeat.yml: this is how we’ll soon be passing Filebeat its configuration.
  • test.log: we’re including this example file just to see that Filebeat actually works.
  • certs: this is the same as in all the other services and is part of what allows them to communicate securely using SSL certificates.

Generating a Certificate for Filebeat

The setup service in docker-compose.yml has a script that generates the certificates used by all the Elastic stack services defined there. It creates a file at config/certs/instances.yml specifying what certificates are needed, and passes that to the bin/elasticsearch-certutil command to create them. We can follow the same pattern as the other services in instances.yml to create a certificate for Filebeat:

          "  - name: es03\n"\
          "    dns:\n"\
          "      - es03\n"\
          "      - localhost\n"\
          "    ip:\n"\
          "      - 127.0.0.1\n"\
          "  - name: filebeat\n"\
          "    dns:\n"\
          "      - es03\n"\
          "      - localhost\n"\
          "    ip:\n"\
          "      - 127.0.0.1\n"\
          > config/certs/instances.yml;

Configure Filebeat

Create a file called filebeat.yml, and configure the input section as follows:

filebeat.inputs:
- type: filestream
  id: my-application-logs
  enabled: true
  paths:
    - /var/log/app_logs/*.log

Here, we’re using a filestream input to pick up any files ending in .log from the /var/log/app_logs/ folder. This path is arbitrary (as is the id), but it’s important that it corresponds to the location where we’re voluming in the test.log file in docker-compose.yml:

    volumes:
      - ./filebeat.yml:/usr/share/filebeat/filebeat.yml
      - ./test.log:/var/log/app_logs/test.log
      - certs:/usr/share/elasticsearch/config/certs

While you’re at it, create the test.log file with any lines of text, such as the following:

Log line 1
Air Malta sucks
Log line 3

Back to filebeat.yml, we also need to configure it to connect to Elasticsearch using not only the Elasticsearch username and password, but also the certificates that we are generating thanks to what we did in the previous section:

output.elasticsearch:
  hosts: '${ELASTICSEARCH_HOSTS:elasticsearch:9200}'
  username: '${ELASTICSEARCH_USERNAME:}'
  password: '${ELASTICSEARCH_PASSWORD:}'
  ssl:
    certificate_authorities: "/usr/share/elasticsearch/config/certs/ca/ca.crt"
    certificate: "/usr/share/elasticsearch/config/certs/filebeat/filebeat.crt"
    key: "/usr/share/elasticsearch/config/certs/filebeat/filebeat.key"

Troubleshooting tip: Peeking inside a container

In case you’re wondering where I got those certificate paths from, I originally looked inside the container to see where the certificates were being generated for the other services. You can get a container ID with docker ps, and then access the container as follows:

sudo docker exec -it <containerId> /bin/bash

More Advanced Filebeat Configurations

Although we’re using simple filestream input in this example to keep things simple, Filebeat can be configured to gather logs from a large variety of data sources, ranging from web servers to cloud providers, thanks to its modules.

A good way to explore the possibilities is to download a copy of Filebeat and sift through all the different YAML configuration files that are provided as reference material.

Running It All

It’s now time to run docker-compose with Filebeat running alongside Kibana and the three-node Elasticsearch cluster:

sudo docker-compose up

Troubleshooting tip: Recreating certificates

The setup script has a check that won’t create certificates again if it has already been run (by looking for the config/certs/certs.zip file). So if you’ve already run docker-compose up before, you’ll need to recreate these certificates in order to get the one for Filebeat. The easiest way to do it is by just clearing out the volumes associated with this docker-compose:

sudo docker-compose down --volumes

Troubleshooting tip: filebeat.yml permissions

It’s also possible to get the following error:

filebeat | Exiting: error loading config file: config file (“filebeat.yml”) can only be writable by the owner but the permissions are “-rw-rw-r–” (to fix the permissions use: ‘chmod go-w /usr/share/filebeat/filebeat.yml’)

The solution is, of course, to heed the error’s advice and run the following command (on your host machine, not in the container):

chmod go-w filebeat.yml

Troubleshooting tip: Checking individual container logs

The logs coming from all the different services can be overwhelming, and the verbose JSON structure doesn’t help. If you suspect there’s a problem with a specific container (e.g. Filebeat), you can see the logs for that specific service as follows:

sudo docker-compose logs -f filebeat

You can of course still use sudo docker logs <containerId> if you want, but this alternative puts the name of the service before each log line, and some terminals colour it. This at least helps to visually distinguish one line from another.

Output of sudo docker-compose logs -f filebeat.

Verifying Log Data in Kibana

You only know Filebeat really worked if you see the data in Kibana. Fire up http://localhost:5601/ in a browser and login using “elastic” as the username, and whatever password you set up in the .env file (in this example it’s also “elastic” for simplicity).

The first test I usually do is to check whether an index has actually been created at all. Because if it hasn’t, you can search all you want in Discover and you’re not going to find anything.

Click the hamburger menu in the top-left, scroll down a bit, and click on “Dev Tools”. There, enter the following query and run it (by clicking the Play button or hitting Ctrl+Enter):

GET _cat/indices

If you see an index whose name contains “filebeat” in the results panel on the right, then that’s encouraging.

GET _cat/indices shows that we have a Filebeat index.

Now that we know that some data exists, click the hamburger menu at the top-left corner again and go to “Discover” (the first item). There, you’ll be prompted to create a “data view” (if you don’t have any data, you’ll be shown a different prompt offering integrations instead). If I understand correctly, this “data view” is what used to be called an “index pattern” before.

At Discover, you’re asked to create a data view.

Click on the “Create data view” button.

Creating the data view, whatever it is.

You can give the data view a name and an index pattern. I suppose the name is arbitrary. For the index pattern, I still use filebeat-* (you’ll see the index name on the right turn bold as you type, indicating that it’s matching), although I’m not sure whether the wildcard actually makes a difference now that the index is some new thing called a data stream.

The timestamp field gets chosen automatically for you, so no need to change anything there. Just click on the “Save data view to Kibana” button. You should now be able to enjoy your lovely data.

Viewing data ingested via Filebeat in the Discover section of Kibana.

Troubleshooting tip: Time range

If you don’t see any data in Discover, it doesn’t necessarily mean something went wrong. The default time range of “last 15 minutes” means you might not see any data if there wasn’t any indexed recently. Simply adjust it to a longer period (e.g. last 2 hours).

Conclusion

The Elastic stack is a wonderful set of tools, but its power comes with a lot of complexity. Docker makes it easier to run the stack, but it’s often difficult to find guidance even on simple scenarios like this. I’m hoping that this article makes things a little easier for other people wanting to run Filebeat alongside Elasticsearch and Kibana in Docker.