Node Failures In Docker Swarm

Last week in the Docker meetup in Richmond, VA, I demonstrated how to create a Docker Swarm in Docker 1.12. I showed how swarm handles node failures, global services, and scheduling services with resource constraints.


To create your swarm cluster, follow this tutorial in a previous post.

Node Failures

Commands related to services in Docker Engine 1.12 are declarative. For example, if you tell docker "I want 3 replicas of this service", then the cluster will maintain that state.

In the case of a node failure that has running containers, swarm will detect that the desired state ≠ actual state, and it will reconcile automatically by rescheduling missing containers on other available nodes.

To demo, lets declare a new service with 3 replicas.

docker service create --name web --replicas 3 francois/apache-hostname  

Check the service tasks by running

docker service ps web  

As of now, we have one container on each of our nodes. Let's bring down node3 to see swarm reconciliation in action.

# Run this on node3
docker swarm leave  

At this point, desired state ≠ actual state. We only have two containers running when we specified 3 containers when we started the service

Using docker service ls you can see the number of replicas drop down from 3 to 2, then back up to 3.

docker service ps web will show you the new container scheduled on a different node in your cluster.

This example only covers swarm reconciliation of containers on worker nodes. A whole new process occurs when a manager goes down, particularly if this manager is the leader of the raft consensus group. For more information on what happens then, check out this post from a fellow docker captain.

Global Services

Global services are useful when you want a container on every node in your cluster. Think logging or monitoring.

docker service create --mode=global --name prometheus prom/prometheus  

Remember when I said that services are declarative? When you add a new node to your cluster, swarm detects desired state ≠ actual state, and starts an instance of the global container on that node.


To demo a constraints I used docker-machine to spin up a new machine with an engine label.

docker-machine create -d virtualbox --engine-label"ssd" sw3  

Then I added it to the swarm.

docker swarm join --token [token] [manager ip]:[manager port]  

Then I create a service referencing that constraint.

docker service create --name web2 --replicas 3 --constraint ' == ssd' francois/apache-hostname  

Remember when I said that services are declarative? ;) This means that when we scale this service, it will remember our constraint and only scale on the node that satisfies that constraint.