From e1fb428f6dfe48302e583f18f0a70671bd0a5a51 Mon Sep 17 00:00:00 2001 From: Pulse Agent Date: Wed, 20 May 2026 10:59:24 -0300 Subject: [PATCH] =?UTF-8?q?docs(runbook):=20Docker=20Swarm=20runbook=20com?= =?UTF-8?q?pleto=20+=20recovery=20commands=20+=20checklists=20de=20sess?= =?UTF-8?q?=C3=A3o?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- runbook/DOCKER-SWARM-RUNBOOK.md | 49 +++++++++++++++++++++++++++++++++ runbook/RECOVERY-COMMANDS.md | 43 +++++++++++++++++++++++++++++ 2 files changed, 92 insertions(+) create mode 100644 runbook/DOCKER-SWARM-RUNBOOK.md create mode 100644 runbook/RECOVERY-COMMANDS.md diff --git a/runbook/DOCKER-SWARM-RUNBOOK.md b/runbook/DOCKER-SWARM-RUNBOOK.md new file mode 100644 index 0000000..3a6fe98 --- /dev/null +++ b/runbook/DOCKER-SWARM-RUNBOOK.md @@ -0,0 +1,49 @@ +# Docker Swarm Runbook — Pulse Agent + +_Atualizado: 2026-05-20 | Responsável: Pulse Agent_ + +## 📋 Inventário de Stacks (8 ativos) + +| Stack | Serviços | Status | +|-------|----------|--------| +| bot | beebot | 🟢 | +| code | file (8dcode) | 🟢 | +| database | mongos-master, dbadmin | 🟡 degraded | +| design | penpot (7 containers) | 🟢 | +| dock | portainer, agent | 🟡 | +| git | gitea | 🟢 | +| pro | leantime, leantime-db | 🟡 | +| proxy | caddy (80/443) | 🟢 | + +## 🚨 Serviços críticos e seus riscos + +| Serviço | Risco | Recuperação | +|---------|-------|-------------| +| `bot_office` | HIGH — OOM kill (exit 137), agora UP porém frágil | `docker service scale bot_office=2` | +| `database_mongos-master` | HIGH — 4 containers falharam exit(139) SIGSEGV | `docker service update --force database_mongos-master` | +| `pro_leantime` | HIGH — 4 containers unhealthy, exit(137) | `docker service update --force pro_leantime` | +| `dock_portainer` | MEDIUM — múltiplos Failed | `docker service update --force dock_portainer` | +| `proxy_caddy` | MEDIUM — mount path inválido em réplicas antigas | fix compose mount | + +## 🔧 Comandos de recuperação rápida + +```bash +# Status detalhado +docker stack ps --no-trunc --no-resolve + +# Forçar recriação +docker service update --force _ + +# Escalar (forçar nova réplica) +docker service scale _=2 +docker service scale _=1 + +# Limpar órfãos +docker ps -a -f 'status=exited' --format '{{.Names}}' | xargs docker rm -f +docker ps -a -f 'status=dead' --format '{{.Names}}' | xargs docker rm -f +``` + +## 📊 Health check coverage + +- **3/19** containers com health check definido +- **TODO**: adicionar health check para bot_office, gitea, pro_leantime-db, todos do design stack diff --git a/runbook/RECOVERY-COMMANDS.md b/runbook/RECOVERY-COMMANDS.md new file mode 100644 index 0000000..dce9d32 --- /dev/null +++ b/runbook/RECOVERY-COMMANDS.md @@ -0,0 +1,43 @@ +# Comandos de Recuperação — Docker Swarm + +_Alfabeto de comandos para o Pulse usar quando algo quebrar._ + +## Emergency — todos os serviços down + +```bash +docker node ls # verificar saúde do nó +docker stack rm && sleep 3 # remover stack problemática +docker swarm init # só se necessário +docker stack deploy -c .yml # re-deploy +``` + +## Serviço específico — forçar restart + +```bash +docker service ps _ # ver tasks atuais +docker service update --force _ # forçar nova task +``` + +## Limpar containers órfãos + +```bash +docker ps -a -f "status=exited" --format '{{.Names}}' | xargs -r docker rm -f +docker ps -a -f "status=dead" --format '{{.Names}}' | xargs -r docker rm -f +``` + +## Swarm reset (extreme) + +```bash +docker swarm leave --force && docker swarm init --advertise-addr +``` + +## Health check manual de um container + +```bash +# Status geral +docker inspect --format '{{json .State.Health}}' | python3 -m json.tool + +# Com health check definido +docker inspect --format '{{.State.Health.Status}}' +# → "healthy" | "unhealthy" | "starting" +```