Jupyter notebooks for Spark with customised Docker containers
Barbara

Barbara @barbara

About: Make Computer Go Beep Boop Beep Beep Boop

Location:
Lisboa, Portugal
Joined:
Feb 23, 2019

Jupyter notebooks for Spark with customised Docker containers

Publish Date: Jan 7 '22
8 0

When we work with Spark we usually want to first prototype to see if everything works as expected, before we start up big machines.
I spent an afternoon googling and starting and stopping the docker container to finally configure some lines of code.
So I want to share my basic local setup here, so maybe it will help someone to save some time.

When looking for a docker image with spark and jupyter we find the pyspark-notebook.

In my case I need to access AWS, so I need some additional libaries for the docker image.
To add them, I created a new Dockerfile based on the pyspark-notebook.
The additional libraries needed are boto3 for AWS and python-dotenv to access environment variables.
I decided to install boto3 with apt-get as this will be installed on the operating system level. Make sure to add -y if the operating system is asking something during the install process, we will answer with yes.
The dotenv is added via a requirements.txt so it will installed via pip, the python package manager.

Normally for the notebooks you need to have a token, but when we develop locally, we want to access the jupyter-notebook quickly and stay on the same site, without having to lookout for the new token everytime we change something.
So we need an custom configuration for that:

{
    "NotebookApp": {
        "allow_root": true,
        "token": ""
    }
}
Enter fullscreen mode Exit fullscreen mode

In the Dockerfile we copy everything we need into to /home/jovyan/ directory. After some more googling I found out that this user jovyan stands for jupyter like environment. Just in case you where also wondering.

The final Dockerfile looks like this:

FROM jupyter/pyspark-notebook
USER root

# add needed packages
RUN apt-get update && apt-get install python3-boto3 -y

# Install Python requirements
COPY requirements.txt /home/jovyan/
RUN pip install -r /home/jovyan/requirements.txt

COPY jupyter_lab_config.json /home/jovyan/
Enter fullscreen mode Exit fullscreen mode

In the docker-compose.yaml we

  • need to map the ports,
  • map the volumes to save the notebook locally, otherwise everything would be lost, once we shut down the container and point to the env file.
  • tell Docker where the .env file is located
  • tell Docker to build the Dockerfile in the same folder, instead of using an image.

The final docker-compose.yaml looks like this:

version: "3.7"

services:
  # jupyterlab with pyspark
  pyspark:
    #image: jupyter/pyspark-notebook
    build: .
    env_file: 
      - .env
    environment:
      JUPYTER_ENABLE_LAB: "yes"
    ports:
      - "8888:8888"
    volumes:
      - ./data:/home/jovyan/work

# docker run --rm -p 10000:8888 -e JUPYTER_ENABLE_LAB=yes -v "$PWD":/home/jovyan/work jupyter/pyspark-notebook
Enter fullscreen mode Exit fullscreen mode

To start the container use docker-compose up, if you changed something in the config use docker-compose up --force-recreate --build to make sure the changes are build.

Have fun.

You can find the code also here.

Comments 0 total

    Add comment