Docker Configuration Evolution: From Scattered Containers to Managed Stacks
Table of Contents
Previously, each service lived in a separate repository with minimal docker-compose.yml:
# Old approach: garrysmod-server/
services:
garrysmod-server:
image: ceifa/garrysmod:latest
ports:
- "27015:27015"
volumes:
- ./garrysmod:/data
restart: unless-stoppedProblems:
- ❌ No resource control - one service could consume all memory
- ❌ No health checks - crashed services went undetected
- ❌ No log rotation - logs grew until disk full
- ❌ No network isolation - all services in
defaultnetwork - ❌ Manual per-service updates - high risk of human error
New Approach: Stacks with Explicit Contracts#
Now related services are grouped into a single repository with unified configuration:
grafana-stack/
├── compose.yaml # Unified stack: grafana, loki, oncall-*
├── stack.env # Shared variables (extracted from code)
├── loki/
│ ├── Dockerfile # Custom build if needed
│ └── loki.yml # Application config
└── README.md # Docs: ports, dependencies, deploymentKey Changes#
1. Resource Limits (Preventing “Starvation”)#
services:
minecraft-server:
cpus: "2.0" # Hard CPU limit
mem_limit: 3g # Hard memory limit
pids_limit: 300 # Protection against fork bombsWhy:
- Prevents one service from blocking others
- Enables precise load planning on the host
- Simplifies diagnostics: if a service hits a limit, it’s immediately visible
2. Security by Default#
services:
grafana:
security_opt:
- no-new-privileges:true # Prevent privilege escalation
- seccomp:unconfined # Only where truly needed (e.g., systemd inside)Why no-new-privileges:
- Container cannot gain more privileges than it started with
- Protects against vulnerabilities exploiting setuid binaries
- Near-zero overhead
3. Observability: Logging and Health Checks#
services:
grafana:
logging:
driver: json-file
options:
max-file: "3" # Keep only 3 rotated files
max-size: 10m # Each file max 10 MB
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
interval: 15s # Check every 15 seconds
timeout: 5s # Response timeout
retries: 3 # 3 failures = unhealthy
start_period: 30s # Ignore failures during startupResult:
- Logs don’t fill disk (rotation + limits)
- Orchestrator sees service state (
docker psshows(healthy)) - Can alert on
unhealthystatus
4. Network Isolation and Service Discovery#
networks:
prometheus:
external: true # Shared network for all metric-exporting services
name: prometheus
traefik:
external: true # Shared network for public services behind proxy
name: traefik
services:
grafana:
networks: [prometheus, traefik] # Sees both metrics and web traffic
loki:
networks: [prometheus] # Metrics only, no public accessBenefits:
- Services discover each other by name (
http://loki:3100) - Public access only through services explicitly connected to
traefik - Easy to add new service to monitoring: connect to
prometheusnetwork - it’s already visible
5. Graceful Shutdown and Signals#
services:
minecraft-server:
stop_signal: SIGTERM # First, polite request to stop
stop_grace_period: 60s # Wait up to 60s before SIGKILL
stdin_open: true # For interactive consoles
tty: trueWhy:
- Prevents data loss on restart (server saves world)
- Allows apps to close connections, flush caches properly
SIGTERM+ grace period - production deployment standard
6. Configuration via .env, Not Hardcoding#
# compose.yaml
services:
grafana:
env_file: [stack.env] # Extract sensitive data
# stack.env (in .gitignore)
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASS}
DOMAIN=potatoenergy.ruBenefits:
- One secrets file - easier to rotate, easier to audit
- No passwords in repository
- Easy to deploy to different environments (dev/stage/prod) with different
.env
Comparison Table#
| Criterion | Old Approach | New Approach |
|---|---|---|
| Grouping | 1 service = 1 repo | Related services = 1 stack |
| Resources | No limits | cpus, mem_limit, pids_limit |
| Security | Docker defaults | no-new-privileges, explicit seccomp |
| Logging | Grow until disk full | Rotation: max-file, max-size |
| Health | None | Health check with interval/timeout |
| Networks | All in default | Explicit external networks (prometheus, traefik) |
| Shutdown | Instant SIGKILL | SIGTERM + stop_grace_period |
| Config | Hardcoded in compose | .env files, excluded from repo |
Evolution in Numbers#
| Metric | Before | After |
|---|---|---|
| Average stack deployment time | ~15 min (manual update of 5 services) | ~3 min (single docker compose up -d) |
| Host memory consumption | Unpredictable, frequent OOM | Stable, limits guarantee isolation |
| Recovery time after failure | Depends on manual intervention | Auto-restart + health check detects issue |
| Configuration audit | Need to check 10+ repos | One compose.yaml per stack |
Practical Recommendations#
When Migrating from Old Approach#
- Start with one stack (e.g., monitoring: grafana + loki + prometheus)
- Add limits gradually: first
mem_limit, thencpus, thenpids_limit - Test health checks locally before deploy:
docker compose up --abort-on-container-exit - Extract secrets to
.envand add it to.gitignorebefore first commit - Document external networks: what each network is for, which services connect
Checklist for New Service#
- Resource limits:
cpus,mem_limit,pids_limit - Security:
no-new-privileges:true, explicitseccompwhere needed - Logging:
json-filedriver withmax-file/max-size - Health check: meaningful
test, reasonableinterval/timeout - Networks: explicit connection to external networks (
prometheus,traefik) - Shutdown:
stop_signal: SIGTERM,stop_grace_periodfor long-running services - Config: sensitive data in
env_file, not in code
Links#
- 🐳 Docker Compose Reference
- 🔐 Docker Security Best Practices
- 📊 Healthcheck Documentation
- 🧰 itzg Docker Images - examples of well-built images
- 🐙 Stack Repositories
There are no articles to list here yet.