Merge pull request #124 from smarterclayton/podsafety

Proposal: Pod Safety Guarantees
author: Clayton Coleman <ccoleman@redhat.com> 2017-02-08 16:14:01 -0500
committer: GitHub <noreply@github.com> 2017-02-08 16:14:01 -0500
commit: 3763d34d14d59c59b7ed3b7f0de2ebd3f9b95a8c (patch)
tree: 8d695024f5db5fe5342c5038d50fcc138ab086a4
parent: 8e714e8072f17d9b24512b38a9920f7645651e25 (diff)
parent: 16f88595883a7461010b6708fb0e0bf1b046cf33 (diff)
1 files changed, 407 insertions, 0 deletions
diff --git a/contributors/design-proposals/pod-safety.md b/contributors/design-proposals/pod-safety.md
new file mode 100644
index 00000000..10f7589b
--- /dev/null
+++ b/contributors/design-proposals/pod-safety.md
@@ -0,0 +1,407 @@
+# Pod Safety, Consistency Guarantees, and Storage Implicitions
+
+@smarterclayton @bprashanth
+
+October 2016
+
+## Proposal and Motivation
+
+A pod represents the finite execution of one or more related processes on the
+cluster. In order to ensure higher level consistent controllers can safely
+build on top of pods, the exact guarantees around its lifecycle on the cluster
+must be clarified, and it must be possible for higher order controllers
+and application authors to correctly reason about the lifetime of those
+processes and their access to cluster resources in a distributed computing
+environment.
+
+To run most clustered software on Kubernetes, it must be possible to guarantee
+**at most once** execution of a particular pet pod at any time on the cluster.
+This allows the controller to prevent multiple processes having access to
+shared cluster resources believing they are the same entity. When a node
+containing a pet is partitioned, the Pet Set must remain consistent (no new
+entity will be spawned) but may become unavailable (cluster no longer has
+a sufficient number of members). The Pet Set guarantee must be strong enough
+for an administrator to reason about the state of the cluster by observing
+the Kubernetes API.
+
+In order to reconcile partitions, an actor (human or automated) must decide
+when the partition is unrecoverable. The actor may be informed of the failure
+in an unambiguous way (e.g. the node was destroyed by a meteor) allowing for
+certainty that the processes on that node are terminated, and thus may
+resolve the partition by deleting the node and the pods on the node.
+Alternatively, the actor may take steps to ensure the partitioned node
+cannot return to the cluster or access shared resources - this is known
+as **fencing** and is a well understood domain.
+
+This proposal covers the changes necessary to ensure:
+
+* Pet Sets can ensure **at most one** semantics for each individual pet
+* Other system components such as the node and namespace controller can
+  safely perform their responsibilities without violating that guarantee
+* An administrator or higher level controller can signal that a node
+  partition is permanent, allowing the Pet Set controller to proceed.
+* A fencing controller can take corrective action automatically to heal
+  partitions
+
+We will accomplish this by:
+
+* Clarifying which components are allowed to force delete pods (as opposed
+  to merely requesting termination)
+* Ensuring system components can observe partitioned pods and nodes
+  correctly
+* Defining how a fencing controller could safely interoperate with
+  partitioned nodes and pods to safely heal partitions
+* Describing how shared storage components without innate safety
+  guarantees can be safely shared on the cluster.
+
+
+### Current Guarantees for Pod lifecycle
+
+The existing pod model provides the following guarantees:
+
+* A pod is executed on exactly one node
+* A pod has the following lifecycle phases:
+  * Creation
+  * Scheduling
+  * Execution
+    * Init containers
+    * Application containers
+  * Termination
+  * Deletion
+* A pod can only move through its phases in order, and may not return
+  to an earlier phase.
+* A user may specify an interval on the pod called the **termination
+  grace period** that defines the minimum amount of time the pod will
+  have to complete the termination phase, and all components will honor
+  this interval.
+* Once a pod begins termination, its termination grace period can only
+  be shortened, not lengthened.
+
+Pod termination is divided into the following steps:
+
+* A component requests the termination of the pod by issuing a DELETE
+  to the pod resource with an optional **grace period**
+  * If no grace period is provided, the default from the pod is leveraged
+* When the kubelet observes the deletion, it starts a timer equal to the
+  grace period and performs the following actions:
+  * Executes the pre-stop hook, if specified, waiting up to **grace period**
+    seconds before continuing
+  * Sends the termination signal to the container runtime (SIGTERM or the
+    container image's STOPSIGNAL on Docker)
+  * Waits 2 seconds, or the remaining grace period, whichever is longer
+  * Sends the force termination signal to the container runtime (SIGKILL)
+* Once the kubelet observes the container is fully terminated, it issues
+  a status update to the REST API for the pod indicating termination, then
+  issues a DELETE with grace period = 0.
+
+If the kubelet crashes during the termination process, it will restart the
+termination process from the beginning (grace period is reset). This ensures
+that a process is always given **at least** grace period to terminate cleanly.
+
+A user may re-issue a DELETE to the pod resource specifying a shorter grace
+period, but never a longer one.
+
+Deleting a pod with grace period 0 is called **force deletion** and will
+update the pod with a `deletionGracePeriodSeconds` of 0, and then immediately
+remove the pod from etcd. Because all communication is asynchronous,
+force deleting a pod means that the pod processes may continue
+to run for an arbitary amount of time. If a higher level component like the
+StatefulSet controller treats the existence of the pod API object as a strongly
+consistent entity, deleting the pod in this fashion will violate the
+at-most-one guarantee we wish to offer for pet sets.
+
+
+### Guarantees provided by replica sets and replication controllers
+
+ReplicaSets and ReplicationControllers both attempt to **preserve availability**
+of their constituent pods over ensuring at most one (of a pod) semantics. So a
+replica set to scale 1 will immediately create a new pod when it observes an
+old pod has begun graceful deletion, and as a result at many points in the
+lifetime of a replica set there will be 2 copies of a pod's processes running
+concurrently. Only access to exclusive resources like storage can prevent that
+simultaneous execution.
+
+Deployments, being based on replica sets, can offer no stronger guarantee.
+
+
+### Concurrent access guarantees for shared storage
+
+A persistent volume that references a strongly consistent storage backend
+like AWS EBS, GCE PD, OpenStack Cinder, or Ceph RBD can rely on the storage
+API to prevent corruption of the data due to simultaneous access by multiple
+clients. However, many commonly deployed storage technologies in the
+enterprise offer no such consistency guarantee, or much weaker variants, and
+rely on complex systems to control which clients may access the storage.
+
+If a PV is assigned a iSCSI, Fibre Channel, or NFS mount point and that PV
+is used by two pods on different nodes simultaneously, concurrent access may
+result in corruption, even if the PV or PVC is identified as "read write once".
+PVC consumers must ensure these volume types are *never* referenced from
+multiple pods without some external synchronization. As described above, it
+is not safe to use persistent volumes that lack RWO guarantees with a
+replica set or deployment, even at scale 1.
+
+
+## Proposed changes
+
+### Avoid multiple instances of pods
+
+To ensure that the Pet Set controller can safely use pods and ensure at most
+one pod instance is running on the cluster at any time for a given pod name,
+it must be possible to make pod deletion strongly consistent.
+
+To do that, we will:
+
+* Give the Kubelet sole responsibility for normal deletion of pods -
+  only the Kubelet in the course of normal operation should ever remove a
+  pod from etcd (only the Kubelet should force delete)
+  * The kubelet must not delete the pod until all processes are confirmed
+    terminated.
+  * The kubelet SHOULD ensure all consumed resources on the node are freed
+    before deleting the pod.
+* Application owners must be free to force delete pods, but they *must*
+  understand the implications of doing so, and all client UI must be able
+  to communicate those implications.
+  * Force deleting a pod may cause data loss (two instances of the same
+    pod process may be running at the same time)
+* All existing controllers in the system must be limited to signaling pod
+  termination (starting graceful deletion), and are not allowed to force
+  delete a pod.
+  * The node controller will no longer be allowed to force delete pods -
+    it may only signal deletion by beginning (but not completing) a
+    graceful deletion.
+  * The GC controller may not force delete pods
+  * The namespace controller used to force delete pods, but no longer
+    does so. This means a node partition can block namespace deletion
+    indefinitely.
+  * The pod GC controller may continue to force delete pods on nodes that
+    no longer exist if we treat node deletion as confirming permanent
+    partition. If we do not, the pod GC controller must not force delete
+    pods.
+* It must be possible for an administrator to effectively resolve partitions
+  manually to allow namespace deletion.
+* Deleting a node from etcd should be seen as a signal to the cluster that
+  the node is permanently partitioned. We must audit existing components
+  to verify this is the case.
+  * The PodGC controller has primary responsibility for this - it already
+    owns the responsibility to delete pods on nodes that do not exist, and
+    so is allowed to force delete pods on nodes that do not exist.
+  * The PodGC controller must therefore always be running and will be
+    changed to always be running for this responsibility in a >=1.5
+    cluster.
+
+In the above scheme, force deleting a pod releases the lock on that pod and
+allows higher level components to proceed to create a replacement.
+
+It has been requested that force deletion be restricted to privileged users.
+That limits the application owner in resolving partitions when the consequences
+of force deletion are understood, and not all application owners will be
+privileged users. For example, a user may be running a 3 node etcd cluster in a
+pet set. If pet 2 becomes partitioned, the user can instruct etcd to remove
+pet 2 from the cluster (via direct etcd membership calls), and because a quorum
+exists pets 0 and 1 can safely accept that action. The user can then force
+delete pet 2 and the pet set controller will be able to recreate that pet on
+another node and have it join the cluster safely (pets 0 and 1 constitute a
+quorum for membership change).
+
+This proposal does not alter the behavior of finalizers - instead, it makes
+finalizers unnecessary for common application cases (because the cluster only
+deletes pods when safe).
+
+### Fencing
+
+The changes above allow Pet Sets to ensure at-most-one pod, but provide no
+recourse for the automatic resolution of cluster partitions during normal
+operation. For that, we propose a **fencing controller** which exists above
+the current controller plane and is capable of detecting and automatically
+resolving partitions. The fencing controller is an agent empowered to make
+similar decisions as a human administrator would make to resolve partitions,
+and to take corresponding steps to prevent a dead machine from coming back
+to life automatically.
+
+Fencing controllers most benefit services that are not innately replicated
+by reducing the amount of time it takes to detect a failure of a node or
+process, isolate that node or process so it cannot initiate or receive
+communication from clients, and then spawn another process. It is expected
+that many StatefulSets of size 1 would prefer to be fenced, given that most
+applications in the real world of size 1 have no other alternative for HA
+except reducing mean-time-to-recovery.
+
+While the methods and algorithms may vary, the basic pattern would be:
+
+1. Detect a partitioned pod or node via the Kubernetes API or via external
+   means.
+2. Decide whether the partition justifies fencing based on priority, policy, or
+   service availability requirements.
+3. Fence the node or any connected storage using appropriate mechanisms.
+
+For this proposal we only describe the general shape of detection and how existing
+Kubernetes components can be leveraged for policy, while the exact implementation
+and mechanisms for fencing are left to a future proposal. A future fencing controller
+would be able to leverage a number of systems including but not limited to:
+
+* Cloud control plane APIs such as machine force shutdown
+* Additional agents running on each host to force kill process or trigger reboots
+* Agents integrated with or communicating with hypervisors running hosts to stop VMs
+* Hardware IPMI interfaces to reboot a host
+* Rack level power units to power cycle a blade
+* Network routers, backplane switches, software defined networks, or system firewalls
+* Storage server APIs to block client access
+
+to appropriately limit the ability of the partitioned system to impact the cluster.
+Fencing agents today use many of these mechanisms to allow the system to make
+progress in the event of failure.  The key contribution of Kubernetes is to define
+a strongly consistent pattern whereby fencing agents can be plugged in.
+
+To allow users, clients, and automated systems like the fencing controllers to
+observe partitions, we propose an additional responsibility to the node controller
+or any future controller that attempts to detect partition. The node controller should
+add an additional condition to pods that have been terminated due to a node failing
+to heartbeat that indicates that the cause of the deletion was node partition.
+
+It may be desirable for users to be able to request fencing when they suspect a
+component is malfunctioning. It is outside the scope of this proposal but would
+allow administrators to take an action that is safer than force deletion, and
+decide at the end whether to force delete.
+
+How the fencing controller decides to fence is left undefined, but it is likely
+it could use a combination of pod forgiveness (as a signal of how much disruption
+a pod author is likely to accept) and pod disruption budget (as a measurement of
+the amount of disruption already undergone) to measure how much latency between
+failure and fencing the app is willing to tolerate. Likewise, it can use its own
+understanding of the latency of the various failure detectors - the node controller,
+any hypothetical information it gathers from service proxies or node peers, any
+heartbeat agents in the system - to describe an upper bound on reaction.
+
+
+### Storage Consistency
+
+To ensure that shared storage without implicit locking be safe for RWO access, the
+Kubernetes storage subsystem should leverage the strong consistency available through
+the API server and prevent concurrent execution for some types of persistent volumes.
+By leveraging existing concepts, we can allow the scheduler and the kubelet to enforce
+a guarantee that an RWO volume can be used on at-most-one node at a time.
+
+In order to properly support region and zone specific storage, Kubernetes adds node
+selector restrictions to pods derived from the persistent volume. Expanding this
+concept to volume types that have no external metadata to read (NFS, iSCSI) may
+result in adding a label selector to PVs that defines the allowed nodes the storage
+can run on (this is a common requirement for iSCSI, FibreChannel, or NFS clusters).
+
+Because all nodes in a Kubernetes cluster possess a special node name label, it would
+be possible for a controller to observe the scheduling decision of a pod using an
+unsafe volume and "attach" that volume to the node, and also observe the deletion of
+the pod and "detach" the volume from the node. The node would then require that these
+unsafe volumes be "attached" before allowing pod execution. Attach and detach may
+be recorded on the PVC or PV as a new field or materialized via the selection labels.
+
+Possible sequence of operations:
+
+1. Cluster administrator creates a RWO iSCSI persistent volume, available only to
+   nodes with the label selector `storagecluster=iscsi-1`
+2. User requests an RWO volume and is bound to the iSCSI volume
+3. The user creates a pod referencing the PVC
+4. The scheduler observes the pod must schedule on nodes with `storagecluster=iscsi-1`
+   (alternatively this could be enforced in admission) and binds to node `A`
+5. The kubelet on node `A` observes the pod references a PVC that specifies RWO which
+   requires "attach" to be successful
+6. The attach/detach controller observes that a pod has been bound with a PVC that
+   requires "attach", and attempts to execute a compare and swap update on the PVC/PV
+   attaching it to node `A` and pod 1
+7. The kubelet observes the attach of the PVC/PV and executes the pod
+8. The user terminates the pod
+9. The user creates a new pod that references the PVC
+10. The scheduler binds this new pod to node `B`, which also has `storagecluster=iscsi-1`
+11. The kubelet on node `B` observes the new pod, but sees that the PVC/PV is bound
+    to node `A` and so must wait for detach
+12. The kubelet on node `A` completes the deletion of pod 1
+13. The attach/detach controller observes the first pod has been deleted and that the
+    previous attach of the volume to pod 1 is no longer valid - it performs a CAS
+    update on the PVC/PV clearing its attach state.
+14. The attach/detach controller observes the second pod has been scheduled and
+    attaches it to node `B` and pod 2
+15. The kubelet on node `B` observes the attach and allows the pod to execute.
+
+If a partition occurred after step 11, the attach controller would block waiting
+for the pod to be deleted, and prevent node `B` from launching the second pod.
+The fencing controller, upon observing the partition, could signal the iSCSI servers
+to firewall node `A`. Once that firewall is in place, the fencing controller could
+break the PVC/PV attach to node `A`, allowing steps 13 onwards to continue.
+
+
+### User interface changes
+
+Clients today may assume that force deletions are safe. We must appropriately
+audit clients to identify this behavior and improve the messages. For instance,
+`kubectl delete --grace-period=0` could print a warning and require `--confirm`:
+
+```
+$ kubectl delete pod foo --grace-period=0
+warning: Force deleting a pod does not wait for the pod to terminate, meaning
+         your containers will be stopped asynchronously. Pass --confirm to
+         continue
+```
+
+Likewise, attached volumes would require new semantics to allow the attachment
+to be broken.
+
+Clients should communicate partitioned state more clearly - changing the status
+column of a pod list to contain the condition indicating NodeDown would help
+users understand what actions they could take.
+
+
+## Backwards compatibility
+
+On an upgrade, pet sets would not be "safe" until the above behavior is implemented.
+All other behaviors should remain as-is.
+
+
+## Testing
+
+All of the above implementations propose to ensure pods can be treated as components
+of a strongly consistent cluster. Since formal proofs of correctness are unlikely in
+the foreseeable future, Kubernetes must empirically demonstrate the correctness of
+the proposed systems. Automated testing of the mentioned components should be
+designed to expose ordering and consistency flaws in the presence of
+
+* Master-node partitions
+* Node-node partitions
+* Master-etcd partitions
+* Concurrent controller execution
+* Kubelet failures
+* Controller failures
+
+A test suite that can perform these tests in combination with real world pet sets
+would be desirable, although possibly non-blocking for this proposal.
+
+
+## Documentation
+
+We should document the lifecycle guarantees provided by the cluster in a clear
+and unambiguous way to end users.
+
+
+## Deferred issues
+
+* Live migration continues to be unsupported on Kubernetes for the foreseeable
+  future, and no additional changes will be made to this proposal to account for
+  that feature.
+
+
+## Open Questions
+
+* Should node deletion be treated as "node was down and all processes terminated"
+  * Pro: it's a convenient signal that we use in other places today
+  * Con: the kubelet recreates its Node object, so if a node is partitioned and
+    the admin deletes the node, when the partition is healed the node would be
+    recreated, and the processes are *definitely* not terminated
+  * Implies we must alter the pod GC controller to only signal graceful deletion,
+    and only to flag pods on nodes that don't exist as partitioned, rather than
+    force deleting them.
+  * Decision: YES - captured above.
+
+
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-safety.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->
author	Clayton Coleman <ccoleman@redhat.com>	2017-02-08 16:14:01 -0500
committer	GitHub <noreply@github.com>	2017-02-08 16:14:01 -0500
commit	3763d34d14d59c59b7ed3b7f0de2ebd3f9b95a8c (patch)
tree	8d695024f5db5fe5342c5038d50fcc138ab086a4
parent	8e714e8072f17d9b24512b38a9920f7645651e25 (diff)
parent	16f88595883a7461010b6708fb0e0bf1b046cf33 (diff)