Node Failures In Docker Swarm
Last week in the Docker meetup in Richmond, VA, I demonstrated how to create a Docker Swarm in Docker 1.12. I showed how swarm handles node failures, global services, and scheduling services with resource constraints.
Setup
To create your swarm cluster, follow this tutorial in a previous post.
Node Failures
Commands related to services in Docker Engine 1.12 are declarative. For example, if you tell docker "I want 3 replicas of this service", then the cluster will maintain that state.
In the case of a node failure that has running containers, swarm will detect that the desired state ≠actual state, and it will reconcile automatically by rescheduling missing containers on other available nodes.
To demo, lets declare a new service with 3 replicas.
docker service create --name web --replicas 3 francois/apache-hostname
Check the service tasks by running
docker service ps web
As of now, we have one container on each of our nodes. Let's bring down node3 to see swarm reconciliation in action.
# Run this on node3
docker swarm leave
At this point, desired state ≠actual state. We only have two containers running when we specified 3 containers when we started the service
Using docker service ls
you can see the number of replicas drop down from 3 to 2, then back up to 3.
docker service ps web
will show you the new container scheduled on a different node in your cluster.
This example only covers swarm reconciliation of containers on worker nodes. A whole new process occurs when a manager goes down, particularly if this manager is the leader of the raft consensus group. For more information on what happens then, check out this post from a fellow docker captain.
Global Services
Global services are useful when you want a container on every node in your cluster. Think logging or monitoring.
docker service create --mode=global --name prometheus prom/prometheus
Remember when I said that services are declarative? When you add a new node to your cluster, swarm detects desired state ≠actual state, and starts an instance of the global container on that node.
Constraints
To demo a constraints I used docker-machine to spin up a new machine with an engine label.
docker-machine create -d virtualbox --engine-label com.example.storage="ssd" sw3
Then I added it to the swarm.
docker swarm join --token [token] [manager ip]:[manager port]
Then I create a service referencing that constraint.
docker service create --name web2 --replicas 3 --constraint 'engine.labels.com.example.storage == ssd' francois/apache-hostname
Remember when I said that services are declarative? ;) This means that when we scale this service, it will remember our constraint and only scale on the node that satisfies that constraint.