Files
hackanooga.com/content/post/2024-05-11-traefik-3-0-service-discovery-in-docker-swarm-mode.md
2025-02-19 19:10:33 -05:00

157 lines
6.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
author: mikeconrad
categories:
- Ansible
- Automation
- Docker
- Software Engineering
- Traefik
date: "2024-05-11T09:44:01Z"
tags:
- Blog Post
title: Traefik 3.0 service discovery in Docker Swarm mode
---
I recently decided to set up a Docker swarm cluster for a project I was working on. If you arent familiar with Swarm mode, it is similar in some ways to k8s but with much less complexity and it is built into Docker. If you are looking for a fairly straightforward way to deploy containers across a number of nodes without all the overhead of k8s it can be a good choice, however it isnt a very popular or widespread solution these days.
Anyway, I set up a VM scaling set in Azure with 10 Ubuntu 22.04 vms and wrote some Ansible scripts to automate the process of installing Docker on each machine as well as setting 3 up as swarm managers and the other 7 as worker nodes. I sshd into the primary manager node and created a docker compose file for launching an observability stack.
Here is what that `docker-compose.yml` looks like:
```yaml
---
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.88.0
volumes:
- /home/user/repo/common/devops/observability/otel-config.yaml:/etc/otel/config.yaml
- /home/user/repo/log:/log/otel
command: --config /etc/otel/config.yaml
environment:
JAEGER_ENDPOINT: 'tempo:4317'
LOKI_ENDPOINT: 'http://loki:3100/loki/api/v1/push'
ports:
- '8889:8889' # Prometheus metrics exporter (scrape endpoint)
- '13133:13133' # health_check extension
- '55679:55679' # ZPages extension
deploy:
placement:
constraints:
- node.hostname==dockerswa2V8BY4
networks:
- traefik
prometheus:
container_name: prometheus
image: prom/prometheus:v2.42.0
volumes:
- /home/user/repo/common/devops/observability/prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- '9090:9090'
deploy:
placement:
constraints:
- node.hostname==dockerswa2V8BY4
networks:
- traefik
loki:
container_name: loki
image: grafana/loki:2.7.4
ports:
- '3100:3100'
networks:
- traefik
grafana:
container_name: grafana
image: grafana/grafana:9.4.3
volumes:
- /home/user/repo/common/devops/observability/grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
environment:
GF_AUTH_ANONYMOUS_ENABLED: 'false'
GF_AUTH_ANONYMOUS_ORG_ROLE: 'Admin'
expose:
- '3000'
labels:
- traefik.constraint-label=traefik
- traefik.http.middlewares.https-redirect.redirectscheme.scheme=https
- traefik.http.middlewares.https-redirect.redirectscheme.permanent=true
- traefik.http.routers.grafana-http.rule=Host(`swarm-grafana.mydomain.com`)
- traefik.http.routers.grafana-http.entrypoints=http
- traefik.http.routers.grafana-http.middlewares=https-redirect
# traefik-https the actual router using HTTPS
# Uses the environment variable DOMAIN
- traefik.http.routers.grafana-https.rule=Host(`swarm-grafana.mydomain.com`)
- traefik.http.routers.grafana-https.entrypoints=https
- traefik.http.routers.grafana-https.tls=true
# Use the special Traefik service api@internal with the web UI/Dashboard
- traefik.http.routers.grafana-https.service=grafana
# Use the "le" (Let's Encrypt) resolver created below
- traefik.http.routers.grafana-https.tls.certresolver=le
# Enable HTTP Basic auth, using the middleware created above
- traefik.http.services.grafana.loadbalancer.server.port=3000
deploy:
placement:
constraints:
- node.hostname==dockerswa2V8BY4
networks:
- traefik
# Tempo runs as user 10001, and docker compose creates the volume as root.
# As such, we need to chown the volume in order for Tempo to start correctly.
init:
image: &tempoImage grafana/tempo:latest
user: root
entrypoint:
- 'chown'
- '10001:10001'
- '/var/tempo'
volumes:
- /home/user/repo/tempo-data:/var/tempo
deploy:
placement:
constraints:
- node.hostname==dockerswa2V8BY4
tempo:
image: *tempoImage
container_name: tempo
command: ['-config.file=/etc/tempo.yaml']
volumes:
- /home/user/repo/common/devops/observability/tempo.yaml:/etc/tempo.yaml
- /home/user/repo/tempo-data:/var/tempo
deploy:
placement:
constraints:
- node.hostname==dockerswa2V8BY4
ports:
- '14268' # jaeger ingest
- '3200' # tempo
- '4317' # otlp grpc
- '4318' # otlp http
- '9411' # zipkin
depends_on:
- init
networks:
- traefik
networks:
traefik:
external: true
```
Pretty straightforward so I proceed to deploy it into the swarm
```shell
docker stack deploy -c docker-compose.yml observability
```
Everything deploys properly but when I view the Traefik logs there is an issue with all the services except for the grafana service. I get errors like this:
```shell
traefik_traefik.1.tm5iqb9x59on@dockerswa2V8BY4 | 2024-05-11T13:14:16Z ERR error="service \"observability-prometheus\" error: port is missing" container=observability-prometheus-37i852h4o36c23lzwuu9pvee9 providerName=swarm
```
It drove me crazy for about half a day or so. I couldnt find any reason why the grafana service worked as expected but none of the others did. Part of my love/hate relationship with Traefik stems from the fact that configuration issues like this can be hard to track and debug. Ultimately after lots of searching and banging my head against a wall I found the answer in the Traefik docs and thought I would share here for anyone else who might run into this issue. Again, this solution is specific to Docker Swarm mode.
<https://doc.traefik.io/traefik/providers/swarm/#configuration-examples>
Expand that first section and you will see the solution:
![](https:///wp-content/uploads/2024/05/image.png)It turns out I just needed to update my `docker-compose.yml` and nest the labels under a deploy section, redeploy and everything was working as expected.