Paperless NGX

Paper is a nightmare. Environmental issues aside, there’s no excuse for it in the 21^st century. It takes up space, collects dust, doesn’t support any form of a search feature, indexing is tedious, it’s heavy and prone to damage & loss. ¹ Paperless NGX will OCR, index, and store data about your documents so they are easy to search and view, unlike that hulking metal file cabinet you have in your office.

Paperless-ngx Screenshot

What's this fork 🍴 thing about, and is it Paperless, Paperless-NG, or Paperless-NGX?

It's now.. Paperless-NGX. Paperless-ngx is a fork of paperless-ng, which itself was a fork of paperless. As I understand it, the original "forker" of paperless to paperless-ng has "gone dark", and stopped communicating, so while all are hopeful that he's OK and just busy/distracted, the community formed paperless-ngx to carry on development work under a shared responsibility model. To save some typing though, we'll just call it "Paperless", although you'll note belowe that we're using the linuxserver paperless-ngx image. (Also, if you use the automated tooling in the Premix Repo, Ansible really doesn't like the hypen!)

Paperless NGX Requirements

Ingredients

Already deployed:

Docker swarm cluster with persistent shared storage
Traefik configured per design
DNS entry for the hostname you intend to use (or a wildcard), pointed to your keepalived IP

Traefik Forward Auth or Authelia to secure your Traefik-exposed services with an additional layer of authentication

Preparation

Setup data locations

We'll need a folder to store a docker-compose configuration file and an associated environment file. If you're following my filesystem layout, create /var/data/config/paperless (for the config). We'll also need to create /var/data/paperless and a few subdirectories (for the metadata). Lastly, we need a directory for the database backups to reside in as well.

mkdir /var/data/config/paperless
mkdir /var/data/paperless
mkdir /var/data/paperless/consume
mkdir /var/data/paperless/data
mkdir /var/data/paperless/export
mkdir /var/data/paperless/media
mkdir /var/data/runtime/paperless/pgdata
mkdir /var/data/paperless/database-dump

Create environment

To stay consistent with the other recipes, we'll create a file to store environment variables in. There's more than 1 service in this stack, but we'll only create one one environment file that will be used by the web server (more on this later).

cat << EOF > /var/data/config/paperless/paperless.env
PAPERLESS_TIME_ZONE:<timezone>
PAPERLESS_ADMIN_USER=<admin_user>
PAPERLESS_ADMIN_PASSWORD=<admin_password>
PAPERLESS_ADMIN_MAIL=<admin_email>
PAPERLESS_REDIS=redis://broker:6379
PAPERLESS_DBHOST=db
PAPERLESS_TIKA_ENABLED=1
PAPERLESS_TIKA_GOTENBERG_ENDPOINT=http://gotenberg:3000
PAPERLESS_TIKA_ENDPOINT=http://tika:9998
EOF

You'll need to replace some of the text in the snippet above:

<timezone> - Replace with an entry from the timezone database (eg: America/New_York)
<admin_user> - Username of the superuser account that will be created on first run. Without this and the <admin_password> you won't be able to log into Paperless
<admin_password> - Password of the superuser account above.
<admin_email> - Email address of the superuser account above.

Paperless NGX Docker Swarm config

Create a docker swarm config file in docker-compose syntax (v3), something like the following example:

Fast-track with premix! 🚀

"Premix" is a git repository which includes necessary docker-compose and env files for all published recipes. This means that you can launch any recipe with just a git pull and a docker stack deploy 👍.

🚀 Update: Premix now includes an ansible playbook, enabling you to deploy an entire stack + recipes, with a single ansible command! (more here)

version: "3.2"
services:

  broker:
    image: redis:6.0
    networks:
      - internal

  webserver:
    image: linuxserver/paperless-ngx
    env_file: paperless.env
    volumes:
      - /var/data/paperless/data:/usr/src/paperless/data
      - /var/data/paperless/media:/usr/src/paperless/media
      - /var/data/paperless/export:/usr/src/paperless/export
      - /var/data/paperless/consume:/usr/src/paperless/consume
    deploy:
      replicas: 1
      labels:
        # traefik
        - traefik.enable=true
        - traefik.docker.network=traefik_public

        # traefikv1
        - traefik.frontend.rule=Host:paperless.example.com
        - traefik.port=8000    
        - traefik.frontend.auth.forward.address=http://traefik-forward-auth:4181
        - traefik.frontend.auth.forward.authResponseHeaders=X-Forwarded-User
        - traefik.frontend.auth.forward.trustForwardHeader=true        

        # traefikv2
        - "traefik.http.routers.paperless.rule=Host(`paperless.example.com`)"
        - "traefik.http.routers.paperless.entrypoints=https"
        - "traefik.http.services.paperless.loadbalancer.server.port=8000"
        - "traefik.http.routers.paperless.middlewares=forward-auth"
    networks:
      - internal
      - traefik_public

  gotenberg:
    image: thecodingmachine/gotenberg
    environment:
      DISABLE_GOOGLE_CHROME: 1
    networks:
      - internal

  tika:
    image: apache/tika
    networks:
      - internal

  db:
    image: postgres:13
    volumes:
      - /var/data/runtime/paperless/pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: paperless
    networks:
      - internal

  db-backup:
    image: postgres:13
    volumes:
      - /var/data/paperless/database-dump:/dump
      - /etc/localtime:/etc/localtime:ro
    environment:
      PGHOST: db
      PGDATABASE: paperless
      PGUSER: paperless
      PGPASSWORD: paperless
      BACKUP_NUM_KEEP: 7
      BACKUP_FREQUENCY: 1d
    entrypoint: |
      bash -c 'bash -s <<EOF
      trap "break;exit" SIGHUP SIGINT SIGTERM
      sleep 2m
      while /bin/true; do
        pg_dump -Fc > /dump/dump_\`date +%d-%m-%Y"_"%H_%M_%S\`.psql
        (ls -t /dump/dump*.psql|head -n $$BACKUP_NUM_KEEP;ls /dump/dump*.psql)|sort|uniq -u|xargs rm -- {}
        sleep $$BACKUP_FREQUENCY
      done
      EOF'
    networks:
    - internal

networks:
  traefik_public:
    external: true
  internal:
    driver: overlay
    ipam:
      config:
        - subnet: 172.16.58.0/24

You'll notice that there are several items under "services" in this stack. Let's take a look at what each one does:

broker - Redis server that other services use to share data
webserver - The UI that you will use to add and view documents, edit document metadata, and configure the application settings.
gotenburg - Tool that facilitates converting MS Office documents, HTML, Markdown and other document types to PDF
tika - The OCR engine that extracts text from image-only documents
db - PostgreSQL database engine to store metadata for all the documents. ²
db-backup - Service to dump the PostgreSQL database to a backup file on disk once per day

Serving

Launch the paperless stack by running docker stack deploy paperless -c <path -to-docker-compose.yml>. You can then log in with the username and password that you specified in the environment variables file above.

Head over to the Paperless documentation to see how to configure and use the application then revel in the fact you can now search all your scanned documents to to your heart's content.

Chef's notes 📓

Taken directly from Paperless documentation ↩
This particular stack configuration was chosen because it includes a "real" database in PostgreSQL versus the more lightweight SQLite database. After all, if you go to the trouble of scanning and importing a pile of documents, you want to know the database is robust enough to keep your data safe. ↩

Did you receive excellent service? Want to compliment the chef? (..and support development of current and future recipes!) Sponsor me on Github / Ko-Fi / Patreon, or see the contribute page for more (free or paid) ways to say thank you! 👏

Employ your chef (engage) 🤝

Is this too much of a geeky PITA? Do you just want results, stat? I do this for a living - I'm a full-time Kubernetes contractor, providing consulting and engineering expertise to businesses needing short-term, short-notice support in the cloud-native space, including AWS/Azure/GKE, Kubernetes, CI/CD and automation.

Learn more about working with me here.

Want to know now when this recipe gets updated, or when future recipes are added? Subscribe to the RSS feed, or leave your email address below, and we'll keep you updated.