Merge pull request #503 from kow3ns/ss-updates

initial StatefulSet updates proposal
author: Clayton Coleman <ccoleman@redhat.com> 2017-05-27 09:29:39 -0400
committer: GitHub <noreply@github.com> 2017-05-27 09:29:39 -0400
commit: 41346e4f3785c17311bf883cb7a8863100cd7b64 (patch)
tree: 3bc2ec9b76138d3233a52c21b07f29e3d869bf3e
parent: fdfa14fe69ec9665f755b4358613c98281cb6411 (diff)
parent: ac00afd78d363075f925b487356be0c0e55c5d75 (diff)
1 files changed, 828 insertions, 0 deletions
diff --git a/contributors/design-proposals/statefulset-update.md b/contributors/design-proposals/statefulset-update.md
new file mode 100644
index 00000000..c8801861
--- /dev/null
+++ b/contributors/design-proposals/statefulset-update.md
@@ -0,0 +1,828 @@
+# StatefulSet Updates
+
+**Author**: kow3ns@
+
+**Status**: Proposal
+
+## Abstract
+Currently (as of Kubernetes 1.6), `.Spec.Replicas` and 
+`.Spec.Template.Containers` are the only mutable fields of the 
+StatefulSet API object. Updating `.Spec.Replicas` will scale the number of Pods 
+in the StatefulSet. Updating `.Spec.Template.Containers` causes all subsequently 
+created Pods to have the specified containers. In order to cause the 
+StatefulSet controller to apply its updated `.Spec`, users must manually delete 
+each Pod. This manual method of applying updates is error prone. The 
+implementation of this proposal will add the capability to perform ordered,
+automated, sequential updates. 
+
+## Affected Components
+1. API Server
+1. Kubectl
+1. StatefulSet Controller
+1. StatefulSetSpec API object
+1. StatefulSetStatus API object
+
+## Use Cases
+Upon implementation, this design will support the following in scope use cases, 
+and it will not rule out the future implementation of the out of scope use 
+cases.
+
+### In Scope
+- As the administrator of a stateful application, in order to vertically scale 
+my application, I want to update resource limits or requested resources.
+- As the administrator of a stateful application, in order to deploy critical 
+security updates, break fix patches, and feature releases, I want to update 
+container images.
+- As the administrator of a stateful application, in order to update my 
+application's configuration, I want to update environment variables, container 
+entry point commands or parameters, or configuration files.
+- As the administrator of the logging and monitoring infrastructure for my 
+organization, in order to add logging and monitoring side cars, I want to patch
+a Pods' containers to add images.
+
+### Out of Scope
+- As the administrator of a stateful application, in order to increase the 
+applications storage capacity, I want to update PersistentVolumes.
+- As the administrator of a stateful application, in order to update the 
+network configuration of the application, I want to update Services and 
+container ports in a consistent way.
+- As the administrator of a stateful application, when I scale my application 
+horizontally, I want associated PodDisruptionBudgets to be adjusted to 
+compensate for the application's scaling.
+
+## Assumptions
+ - StatefulSet update must support singleton StatefulSets. However, an update in
+ this case will cause a temporary outage. This is acceptable as a single 
+ process application is, by definition, not highly available.
+ - Disruption in Kubernetes is controlled by PodDisruptionBudgets. As 
+ StatefulSet updates progress one Pod at a time, and only occur when all 
+ other Pods have a Status of Running and a Ready Condition, they can not 
+ violate reasonable PodDisruptionBudgets.
+ - Without priority and preemption, there is no guarantee that an update will 
+ not block due to a loss of capacity or due to the scheduling of another Pod
+ between Pod termination and Pod creation. This is mitigated by blocking the 
+ update when a Pod fails to schedule. Remediation will require operator 
+ intervention. This implementation is no worse than the current behavior with 
+ respect to eviction.
+ - We will eventually implement a signal that is delivered to Pods to indicate 
+ the 
+ [reason for termination](https://github.com/kubernetes/community/pull/541).
+ - StatefulSet updates will use the methodology outlined in the 
+ [controller history](https://github.com/kubernetes/community/pull/594) proposal 
+ for version tracking, update detection, and rollback detection.
+ This will be a general implementation, usable for any Pod in a Kubernetes 
+ cluster. It is, therefore, out of scope to design such a mechanism here.
+ - Kubelet does not support resizing a container's resources without terminating 
+ the Pod. In place resource reallocation is out of scope for this design. 
+ Vertical scaling must be performed destructively.
+ - The primary means of configuration update will be configuration files, 
+ command line flags, environment variables, or ConfigMaps consumed as the one 
+ of the former. 
+ - In place configuration update via SIGHUP is not universally 
+ supported, and Kubelet provides no mechanism to perform this currently. Pod 
+ reconfiguration will be performed destructively.
+ - Stateful applications are likely to evolve wire protocols and storage formats
+  between versions. In most cases, when updating the application's Pod's 
+  containers, it will not be safe to roll back or forward to an arbitrary 
+  version. Controller based Pod update should work well when rolling out an 
+  update, or performing a rollback, between two specific revisions of the 
+  controlled API object. This is how Deployment functions, and this property is,
+  perhaps, even more critical for stateful applications.
+
+## Requirements
+This design is based on the following requirements.
+- Users must be able to update the containers of a StatefulSet's Pods.
+  - Updates to container commands, images, resources and configuration must be 
+  supported.
+- The update must progress in a sequential, deterministic order and respect the 
+  StatefulSet
+  [identity](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#pod-identity), 
+  [deployment, and scaling](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#deployment-and-scaling-guarantee) 
+  guarantees.
+- A failed update must halt.
+- Users must be able to roll back an update.
+- Users must be able to roll forward to fix a failing/failed update.
+- Users must be able to view the status of an update.
+- Users should be able to view a bounded history of the updates that have been 
+applied to the StatefulSet.
+
+## API Objects
+
+The following modifications will be made to the StatefulSetSpec API object.
+
+```go
+// StatefulSetUpdateStrategy indicates the strategy that the StatefulSet 
+// controller will use to perform updates. It includes any additional parameters 
+// necessary to preform the update for the indicated strategy.
+type StatefulSetUpdateStrategy struct {
+    // Type indicates the type of the StatefulSetUpdateStrategy.
+    Type StatefulSetUpdateStrategyType
+    // Partition is used to communicate the ordinal at which to partition 
+    // the StatefulSet when Type is PartitionStatefulSetStrategyType. This 
+    // value must be set when Type is PartitionStatefulSetStrategyType, 
+    // and it must be nil otherwise.
+    Partition *PartitionStatefulSetStrategy
+
+// StatefulSetUpdateStrategyType is a string enumeration type that enumerates 
+// all possible update strategies for the StatefulSet controller.
+type StatefulSetUpdateStrategyType string
+
+const (
+    // PartitionStatefulSetStrategyType indicates that updates will only be 
+    // applied to a partition of the StatefulSet. This is useful for canaries 
+    // and phased roll outs. When a scale operation is performed with this 
+    // strategy, new Pods will be created from the updated specification.
+    PartitionStatefulSetStrategyType StatefulSetUpdateStrategyType = "Partition"
+    // RollingUpdateStatefulSetStrategyType indicates that update will be 
+    // applied to all Pods in the StatefulSet with respect to the StatefulSet 
+    // ordering constraints. When a scale operation is performed with this 
+    // strategy, new Pods will be created from the updated specification.
+    RollingUpdateStatefulSetStrategyType = "RollingUpdate"
+    // OnDeleteStatefulSetStrategyType triggers the legacy behavior. Version 
+    // tracking and ordered rolling restarts are disabled. Pods are recreated 
+    // from the StatefulSetSpec when they are manually deleted. When a scale 
+    // operation is performed with this strategy, new Pods will be created 
+    // from the current specification.
+    OnDeleteStatefulSetStrategyType = "OnDelete"
+)
+
+// PartitionStatefulSetStrategy contains the parameters used with the 
+// PartitionStatefulSetStrategyType.
+type PartitionStatefulSetStrategy struct {
+    // Ordinal indicates the ordinal at which the StatefulSet should be 
+    // partitioned.
+    Ordinal int32
+}
+
+type StatefulSetSpec struct {
+    // Replicas, Selector, Template, VolumeClaimsTemplate, and ServiceName 
+    // omitted for brevity.
+    
+    // UpdateStrategy indicates the StatefulSetUpdateStrategy that will be 
+    // employed to update Pods in the StatefulSet when a revision is made to 
+    // Template or VolumeClaimsTemplate.
+    UpdateStrategy StatefulSetUpdateStrategy `json:"updateStrategy,omitempty`
+    
+    // RevisionHistoryLimit is the maximum number of revisions that will 
+    // be maintained in the StatefulSet's revision history. The revision history
+    // consists of all revisions not represented by a currently applied 
+    // StatefulSetSpec version. The default value is 2.
+    RevisionHistoryLimit *int32 `json:revisionHistoryLimit,omitempty`
+}
+```
+
+The following modifications will be made to the StatefulSetStatus API object.
+
+```go
+ type StatefulSetStatus struct {
+    // ObservedGeneration and Replicas fields are omitted for brevity.
+    
+    // CurrentRevision, if not empty, indicates the version of PodSpecTemplate, 
+    // VolumeClaimsTemplate tuple used to generate Pods in the sequence
+    // [0,CurrentReplicas).
+    CurrentRevision string `json:"currentRevision,omitempty"`
+    
+    // UpdateRevision, if not empty, indicates the version of PodSpecTemplate, 
+    // VolumeClaimsTemplate tuple used to generate Pods in the sequence
+    // [Replicas-UpdatedReplicas,Replicas)
+    UpdateRevision string `json:"updateRevision,omitempty"`
+    
+    // ReadyReplicas is the current number of Pods, created by the StatefulSet
+    // controller, that have a Status of Running and a Ready Condition.
+    ReadyReplicas int32 `json:"readyReplicas,omitempty"`
+    
+    // CurrentReplicas is the number of Pods created by the StatefulSet 
+    // controller from the PodTemplateSpec, VolumeClaimsTemplate tuple indicated 
+    // by CurrentRevision.
+    CurrentReplicas int32 `json:"currentReplicas,omitempty"`
+    
+    // UpdatedReplicas is the number of Pods created by the StatefulSet
+    // controller from the PodTemplateSpec, VolumeClaimsTemplate tuple indicated 
+    // by UpdateRevision.
+    UpdatedReplicas int32 `json:"updatedReplicas,omitempty"`
+}
+```
+
+Additionally we introduce the following constant.
+
+```go
+// StatefulSetRevisionLabel is the label used by StatefulSet controller to track
+// which version of StatefulSet's StatefulSetSpec was used generate a Pod.
+const StatefulSetRevisionLabel = "statefulset.kubernetes.io/revision"
+
+```
+## StatefulSet Controller
+The StatefulSet controller will watch for modifications to StatefulSet and Pod 
+API objects. When a StatefulSet is created or updated, or when one 
+of the Pods in a StatefulSet is updated or deleted, the StatefulSet
+controller will attempt to create, update, or delete Pods to conform the 
+current state of the system to the user declared [target state](#target-state). 
+
+### Revised Controller Algorithm
+The StatefulSet controller will use the following algorithm to continue to 
+make progress toward the user declared [target state](#target-state) while 
+respecting the controller's 
+[identity](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#pod-identity), 
+[deployment, and scaling](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#deployment-and-scaling-guarantee) 
+guarantees. The StatefulSet controller will use the technique proposed in 
+[Controller History](https://github.com/kubernetes/community/pull/594) to 
+snapshot and version its [target Object state](#target-pod-state).
+
+1. The controller will reconstruct the 
+[revision history](#history-reconstruction) of the StatefulSet.
+1. The controller will 
+[process any updates to its StatefulSetSpec](#specification-updates) to 
+ensure that the StatefulSet's revision history is consistent with the user 
+declared desired state.
+1. The controller will select all Pods in the StatefulSet, filter any Pods not 
+owned by the StatefulSet, and sort the remaining Pods in ordinal order.
+1. For all created Pods, the controller will perform any necessary
+[non-destructive state reconciliation](#pod-state-reconciliation).
+1. If any Pods with ordinals in the sequence `[0,.Spec.Replicas)` have not been 
+created, for the Pod corresponding to the lowest such ordinal, the controller 
+will create the Pod with declared [target Pod state](#target-pod-state).
+1. If all Pods in the sequence `[0,.Spec.Replicas)` have been created, but if any 
+do not have a Ready Condition, the StatefulSet controller will wait for these 
+Pods to either become Ready, or to be completely deleted. 
+1. If all Pods in the sequence `[0,.Spec.Replicas)` have a Ready Condition, and 
+if `.Spec.Replicas` is less than `.Status.Replicas`, the controller will delete 
+the Pod corresponding to the largest ordinal. This implies that scaling takes 
+precedence over Pod updates.
+1. If all Pods in the sequence `[0,.Spec.Replicas)` have a Status of Running and 
+a Ready Condition, if `.Spec.Replicas` is equal to `.Status.Replicas`, and if 
+there are Pods that do not match their [target Pod state](#target-pod-state), 
+the Pod with the largest ordinal in that set will be deleted.
+1. If the StatefulSet controller has achieved the 
+[declared target state](#target-state) the StatefulSet controller will 
+[complete any in progress updates](#update-completion).
+1. The controller will [report its status](#status-reporting).
+1. The controller will perform any necessary
+[maintenance of its revision history](#history-maintenance).
+
+### Target State
+The target state of the StatefulSet controller with respect to an individual 
+StatefulSet is defined as follows. 
+
+1. The StatefulSet contains exactly `[0,.Spec.Replicas)` Pods.
+1. All Pods in the StatefulSet have the correct 
+[target Pod state](#target-pod-state).
+
+### Target Pod State
+As in the [Controller History](https://github.com/kubernetes/community/pull/594) 
+proposal we define the target Object state of StatefulSetSpec specification type 
+object to be the `.Template` and `.VolumeClaimsTemplate`. The latter is currently 
+immutable, but we will version it as one day this constraint may be lifted. This 
+state provides enough information to generate a Pod and its associated 
+PersistentVolumeClaims. The target Pod State for a Pod in a StatefulSet is as 
+follows.
+1. The Pods PersistentVolumeClaims have been created.
+   - Note that we do not currently delete PersistentVolumeClaims.
+1. If the Pod's ordinal is in the sequence `[0,.Spec.Replicas)` the Pod should 
+have a Ready Condition. This implies the Pod is Running.
+1. If Pod's ordinal is greater than or equal to `.Spec.Replicas`, the Pod 
+should be completely terminated and deleted.
+1. If the StatefulSet's `Spec.UpdateStrategy.Type` is equal to 
+`OnDeleteStatefulSetStrategyType`, no version tracking is performed, Pods 
+can be at an arbitrary version, and they will be recreated from the current 
+`.Spec.Template` and `.Spec.VolumeClaimsTemplate` when the are deleted.
+1. If StatefulSet's `Spec.UpdateStrategy.Type` is equal to 
+`RollingUpdateStatefulSetStrategyType` then the version of the Pod should be 
+as follows.
+    1. If the Pod's ordinal is in the sequence `[0,.Status.CurrentReplicas)`, 
+    the Pod should be consistent with version indicated by `Status.CurrentRevision`.
+    1. If the Pod's ordinal is in the sequence 
+    `[.Status.Replicas - .Status.UpdatedReplicas, .Status.Replicas)`
+    the Pod should be consistent with the version indicated by 
+    `Status.UpdateRevision`.
+1. If the StatefulSet's `.Spec.UpdateStrategy.Type` is equal to 
+`PartitionStatefulSetStrategyType` then the version of the Pod should be 
+as follows.
+    1. If the Pod's ordinal is in the sequence `[0,.Status.CurrentReplicas)`, 
+    the Pod should be consistent with version indicated by `Status.CurrentRevision`.
+    1. If the Pod's ordinal is in the sequence 
+    `[.Status.Replicas - .Status.UpdatedReplicas, .Status.Replicas)` the Pod 
+    should be consistent with the version indicated by `Status.UpdateRevision`.
+    1. If the Pod does not meet either of the prior two conditions, and if 
+    ordinal is in the sequence `[0, .Spec.UpdateStrategy.Partition.Ordinal)`, 
+    it should  be consistent with the version indicated by 
+    `Status.CurrentRevision`.
+    1. Otherwise, the Pod should be consistent with the version indicated 
+    by `Status.UpdateRevision`.
+
+### Pod State Reconciliation
+In order to reconcile a Pod with declared desired 
+[target state](#target-pod-state) the StatefulSet controller will do the 
+following.
+
+1. If the Pod is already consistent with its target state the controller will do 
+nothing.
+1. If the Pod is labeled with a `StatefulSetRevisionLabel` that indicates 
+the Pod was generated from a version of the StatefulSetSpec that is semantically 
+equivalent to, but not equal to, the [target version](#target-pod-state), the 
+StatefulSet controller will update the Pod with a `StatefulSetRevisionLabel` 
+indicating the new semantically equivalent version. This form of reconciliation 
+is non-destructive.
+1. If the Pod was not created from the target version, the Pod will be deleted 
+and recreated from that version. This form of reconciliation is destructive.
+
+### Specification Updates
+The StatefulSet controller will [snapshot](#snapshot-creation) its target 
+Object state when mutations are made to its `.Spec.Template` or 
+`.Spec.VolumeClaimsTemplate` (Note that the latter is currently immutable).
+
+1. When the StatefulSet controller observes a mutation to a StatefulSet's 
+ `.Spec.Template` it will snapshot its target Object state and compare 
+the snapshot with the version indicated by its `.Status.UpdateRevision`.
+1. If the current state is equivalent to the version indicated by 
+`.Status.UpdateRevision` no update has occurred. 
+1. If the `Status.CurrentRevision` field is empty, then the StatefulSet has no 
+revision history. To initialize its revision history, the StatefulSet controller 
+will set both `.Status.CurrentRevision` and `.Status.UpdateRevision` to the 
+version of the current snapshot. 
+1. If the `.Status.CurrentRevision` is not empty, and if the 
+`.Status.UpdateRevision` is not equal to the version of the current snapshot, 
+the StatefulSet controller will set the `.Status.UpdateRevision` to the version 
+indicated by the current snapshot.
+
+### StatefulSet Revision History
+The StatefulSet controller will use the technique proposed in 
+[Controller History](https://github.com/kubernetes/community/pull/594) to 
+snapshot and version its target Object state.
+
+#### Snapshot Creation
+In order to snapshot a version of its target Object state, it will 
+serialize and store the `.Spec.Template` and `.Spec.VolumesClaimsTemplate` 
+along with the `.Generation` in each snapshot. Each snapshot will be labeled
+with the StatefulSet's `.Selector`.
+
+#### History Reconstruction
+As proposed in 
+[Controller History](https://github.com/kubernetes/community/pull/594), in 
+order to reconstruct the revision history of a StatefulSet, the StatefulSet 
+controller will select all snapshots based on its `Spec.Selector` and sort them 
+by the contained `.Generation`. This will produce an ordered set of 
+revisions to the StatefulSet's target Object state.
+
+#### History Maintenance 
+In order to prevent the revision history of the StatefulSet from exceeding 
+memory or storage limits, the StatefulSet controller will periodically prune 
+its revision history so that no more that `.Spec.RevisionHistoryLimit` non-live 
+versions of target Object state are preserved.
+
+### Update Completion
+The criteria for update completion is as follows.
+
+1. If the StatefulSet's `.Spec.UpdateStrategy.Type` is equal to 
+`OnDeleteStatefulSetStrategyType` then no version tracking is performed. In
+this case, an update can never be in progress.
+1. If the StatefulSet's `.Spec.UpdateStrategy.Type` is equal to 
+`PartitionStatefulSetStrategyType` updates can not complete. The version 
+indicated `.Status.UpdateRevision` will only be applied to Pods with ordinals 
+in the sequence `(.Spec.UpdateStrategy.Partition.Ordinal,.Spec.Replicas)`.
+1. If the StatefulSet's `.Spec.UpdateStrategy.Type` is equal to 
+`RollingUpdateStatefulSetStrategyType`, then an update is complete when the 
+StatefulSet is at its [target state](#target-state). The StatefulSet controller 
+will signal update completion as follows.
+    1. The controller will set `.Status.CurrentRevision` to the value of 
+    `.Status.UpdateRevision`.
+    1. The controller will set `.Status.CurrentReplicas` to 
+    `.Status.UpdatedReplicas`. Note that this value will be equal to 
+    `.Status.Replicas`.
+    1. The controller will set `.Status.UpdatedReplicas` to 0.
+
+### Status Reporting
+After processing the creation, update, or deletion of a StatefulSet or Pod, 
+the StatefulSet controller will record its status by persisting a 
+StatefulSetStatus object. This has two purposes.
+
+1. It allows the StatefulSet controller to recreate the exact StatefulSet 
+membership in the event of a hard restart of the entire system.
+1. It communicates the current state of the StatefulSet to clients. Using the 
+`.Status.ObserverGeneration`, clients can construct a linearizable view of 
+the operations performed by the controller.
+
+When the StatefulSet controller records the status of a StatefulSet it will 
+do the following.
+
+1. The controller will increment the `.Status.ObservedGeneration` to communicate 
+the `.Generation` of the StatefulSet object that was observed.
+1. The controller will set the `.Status.Replicas` to the current number of 
+created Pods.
+1. The controller will set the `.Status.ReadyReplicas` to the current number of 
+Pods that have a Ready Condition.
+1. The controller will set the `.Status.CurrentRevision` and 
+`.Status.UpdateRevision` in accordance with StatefulSet's 
+[revision history](#statefulset-revision-history) and 
+any [complete updates](#update-completion).
+1. The controller will set the `.Status.CurrentReplicas` to the number of 
+Pods that it has created from the version indicated by 
+`.Status.CurrentRevision`.
+1. The controller will set the `.Status.UpdatedReplicas` to the number of Pods 
+that it has created from the version indicated by `.Status.UpdateRevision`.
+1. The controller will then persist the StatefulSetStatus make it durable and 
+communicate it to observers.
+
+## API Server
+The API Server will perform validation for StatefulSet creation and updates.
+
+### StatefulSet Validation
+As is currently implemented, the API Server will not allow mutation to any 
+fields of the StatefulSet object other than `.Spec.Replicas` and 
+`.Spec.Template.Containers`. This design imposes the following, additional 
+constraints.
+
+1. If the `.Spec.UpdateStrategy.Type` is equal to 
+`PartitionStatefulSetStrategyType`, the API Server should fail validation 
+if any of the following conditions are true.
+   1. `.Spec.UpdateStrategy.Partition` is nil.
+   1. `.Spec.UpdateStrategy.Parition` is not nil, and 
+   `.Spec.UpdateStrategy.Partition.Ordinal` not in the sequence 
+   `(0,.Spec.Replicas)`.
+1. The API Server will fail validation on any update to a StatefulSetStatus
+object if any of the following conditions are true.
+    1. `.Status.Replicas` is negative.
+    1. `.Status.ReadyReplicas` is negative or greater than `.Status.Replicas`.
+    1. `.Status.CurrentReplicas` is negative or greater than `.Status.Replicas`.
+    1. `.Stauts.UpdateReplicas` is negative or greater than `.Status.Replicas`.
+   
+## Kubectl
+Kubectl will  use the `rollout` command to control and provide the status of 
+StatefulSet updates.
+
+ - `kubectl rollout status statefulset <StatefulSet-Name>`: displays the status 
+ of a StatefulSet update.
+ - `kubectl rollout undo statefulset <StatefulSet-Name>`: triggers a rollback 
+ of the current update.
+ - `kubectl rollout history statefulset <StatefulSet-Name>`: displays a the 
+ StatefulSets revision history.
+
+## Usage
+This section demonstrates how the design functions in typical usage scenarios.
+
+### Initial Deployment
+Users can create a StatefulSet using `kubectl apply`.
+
+Given the following manifest `web.yaml`
+
+```yaml
+apiVersion: apps/v1beta1
+kind: StatefulSet
+metadata:
+  name: web
+spec:
+  serviceName: "nginx"
+  replicas: 3
+  template:
+    metadata:
+      labels:
+        app: nginx
+    spec:
+      containers:
+      - name: nginx
+        image: gcr.io/google_containers/nginx-slim:0.8
+        ports:
+        - containerPort: 80
+          name: web
+        volumeMounts:
+        - name: www
+          mountPath: /usr/share/nginx/html
+  volumeClaimTemplates:
+  - metadata:
+      name: www
+      annotations:
+        volume.alpha.kubernetes.io/storage-class: anything
+    spec:
+      accessModes: [ "ReadWriteOnce" ]
+      resources:
+        requests:
+          storage: 1Gi
+```
+
+Users can use the following command to create the StatefulSet.
+
+```shell
+kubectl apply -f web.yaml
+```
+
+The only difference between the proposed and current implementation is that 
+the proposed implementation will initialize the StatefulSet's revision history 
+upon initial creation.
+
+### Rolling out an Update
+Users can create a rolling update using `kubectl apply`. If a user creates a 
+StatefulSet [as above](#initial-deployment), the user can trigger a rolling 
+update by updating the image (as in the manifest as below).
+
+```yaml
+apiVersion: apps/v1beta1
+kind: StatefulSet
+metadata:
+  name: web
+spec:
+  serviceName: "nginx"
+  replicas: 3
+  template:
+    metadata:
+      labels:
+        app: nginx
+    spec:
+      updateStrategy: 
+        type: RollingUpdate
+      containers:
+      - name: nginx
+        image: gcr.io/google_containers/nginx-slim:0.9
+        ports:
+        - containerPort: 80
+          name: web
+        volumeMounts:
+        - name: www
+          mountPath: /usr/share/nginx/html
+  volumeClaimTemplates:
+  - metadata:
+      name: www
+      annotations:
+        volume.alpha.kubernetes.io/storage-class: anything
+    spec:
+      accessModes: [ "ReadWriteOnce" ]
+      resources:
+        requests:
+          storage: 1Gi
+```
+
+
+Users can use the following command to trigger a rolling update.
+
+```shell
+kubectl apply -f web.yaml
+```
+
+### Canaries
+Users can create a canary using `kubectl apply`. The only difference between a
+ [rolling update](#rolling-out-an-update) and a canary is that the 
+ `.Spec.UpdateStrategy.Type` is set to `PartitionStatefulSetStrategyType` and 
+ the `.Spec.UpdateStrategy.Partition.Ordinal` is set to `.Spec.Replicas-1`.
+ 
+ 
+```yaml
+apiVersion: apps/v1beta1
+kind: StatefulSet
+metadata:
+  name: web
+spec:
+  serviceName: "nginx"
+  replicas: 3
+  template:
+    metadata:
+      labels:
+        app: nginx
+    spec:
+      updateStrategy: 
+        type: Partition
+        partition: 
+          ordinal: 2
+      containers:
+      - name: nginx
+        image: gcr.io/google_containers/nginx-slim:0.9
+        ports:
+        - containerPort: 80
+          name: web
+        volumeMounts:
+        - name: www
+          mountPath: /usr/share/nginx/html
+     
+  volumeClaimTemplates:
+  - metadata:
+      name: www
+      annotations:
+        volume.alpha.kubernetes.io/storage-class: anything
+    spec:
+      accessModes: [ "ReadWriteOnce" ]
+      resources:
+        requests:
+          storage: 1Gi
+```
+
+Users can also simultaneously scale up and add a canary. This reduces risk 
+for some deployment scenarios by adding additional capacity for the canary. 
+For example, in the manifest below, `.Spec.Replicas` is increased to `4` while 
+`.Spec.UpdateStrategy.Partition.Ordinal` is set to `.Spec.Replicas-1`.
+
+```yaml
+apiVersion: apps/v1beta1
+kind: StatefulSet
+metadata:
+  name: web
+spec:
+  serviceName: "nginx"
+  replicas: 4
+  template:
+    metadata:
+      labels:
+        app: nginx
+    spec:
+      updateStrategy: 
+        type: Partition
+        partition: 
+          ordinal: 3
+      containers:
+      - name: nginx
+        image: gcr.io/google_containers/nginx-slim:0.9
+        ports:
+        - containerPort: 80
+          name: web
+        volumeMounts:
+        - name: www
+          mountPath: /usr/share/nginx/html
+  volumeClaimTemplates:
+  - metadata:
+      name: www
+      annotations:
+        volume.alpha.kubernetes.io/storage-class: anything
+    spec:
+      accessModes: [ "ReadWriteOnce" ]
+      resources:
+        requests:
+          storage: 1Gi
+```
+
+### Phased Roll Outs
+Users can create a canary using `kubectl apply`. The only difference between a
+ [canary](#canaries) and a phased roll out is that the 
+ `.Spec.UpdateStrategy.Partition.Ordinal` is set to  a value less than 
+ `.Spec.Replicas-1`.
+ 
+```yaml
+apiVersion: apps/v1beta1
+kind: StatefulSet
+metadata:
+  name: web
+spec:
+  serviceName: "nginx"
+  replicas: 4
+  template:
+    metadata:
+      labels:
+        app: nginx
+    spec:
+      updateStrategy: 
+        type: Partition
+        partition: 
+          ordinal: 2
+      containers:
+      - name: nginx
+        image: gcr.io/google_containers/nginx-slim:0.9
+        ports:
+        - containerPort: 80
+          name: web
+        volumeMounts:
+        - name: www
+          mountPath: /usr/share/nginx/html
+  volumeClaimTemplates:
+  - metadata:
+      name: www
+      annotations:
+        volume.alpha.kubernetes.io/storage-class: anything
+    spec:
+      accessModes: [ "ReadWriteOnce" ]
+      resources:
+        requests:
+          storage: 1Gi
+```
+
+Phased roll outs can be used to roll out a configuration, image, or resource 
+update to some portion of the fleet maintained by the StatefulSet prior to 
+updating the entire fleet. It is useful to support linear, geometric, and 
+exponential roll out of an update. Users can modify the 
+`.Spec.UpdateStrategy.Partition.Ordinal`  to allow the roll out to progress.
+
+```yaml
+apiVersion: apps/v1beta1
+kind: StatefulSet
+metadata:
+  name: web
+spec:
+  serviceName: "nginx"
+  replicas: 3
+  template:
+    metadata:
+      labels:
+        app: nginx
+    spec:
+      updateStrategy: 
+        type: Partition
+        partition: 
+          ordinal: 1
+      containers:
+      - name: nginx
+        image: gcr.io/google_containers/nginx-slim:0.9
+        ports:
+        - containerPort: 80
+          name: web
+        volumeMounts:
+        - name: www
+          mountPath: /usr/share/nginx/html
+  volumeClaimTemplates:
+  - metadata:
+      name: www
+      annotations:
+        volume.alpha.kubernetes.io/storage-class: anything
+    spec:
+      accessModes: [ "ReadWriteOnce" ]
+      resources:
+        requests:
+          storage: 1Gi
+```
+
+### Rollbacks
+To rollback an update, users can use the `kubectl rollout` command.
+
+The command below will roll back the `web` StatefulSet to the previous revision in 
+its history. If a roll out is in progress, it will stop deploying the target 
+revision, and roll back to the current revision.
+
+```shell
+kubectl rollout undo statefulset web
+```
+
+### Rolling Forward
+Rolling back is usually the safest, and often the fastest, strategy to mitigate
+deployment failure, but rolling forward is sometimes the only practical solution 
+for stateful applications (e.g. A users has a minor configuration error but has 
+already modified the storage format for the application). Users can use 
+sequential `kubectl apply`'s to update the StatefulSet's current 
+[target state](#target-state). The StatefulSet's `.Spec.GenerationPartition` 
+will be respected, and it therefore interacts well with canaries and phased roll
+ outs.
+
+## Tests
+- Updating a StatefulSet's containers will trigger updates to the StatefulSet's 
+Pods respecting the 
+[identity](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#pod-identity) 
+and [deployment, and scaling](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#deployment-and-scaling-guarantee) 
+guarantees.
+- A StatefulSet update will block on failure.
+- A StatefulSet update can be rolled back.
+- A StatefulSet update can be rolled forward by applying another update. 
+- A StatefulSet update's status can be retrieved.
+- A StatefulSet's revision history contains all updates with respect to the 
+configured revision history limit.
+- A StatefulSet update can create a canary.
+- A StatefulSet update can be performed in stages.
+
+## Future Work
+In the future, we may implement the following features to enhance StatefulSet 
+updates.
+
+### Termination Reason
+Without communicating a signal indicating the reason for termination to a Pod in 
+a StatefulSet, as proposed [here](https://github.com/kubernetes/community/pull/541),
+the tenant application has no way to determine if it is being terminated due to 
+a scale down operation or due to an update. 
+
+Consider a BASE distributed storage application like Cassandra, where 2 TiB of 
+persistent data is not atypical, and the data distribution is not identical on 
+every server. We want to enable two distinct behaviors based on the reason for 
+termination.
+
+- If the termination is due to scale down, during the configured termination 
+grace period, the entry point of the Pod should cause the application to drain 
+its client connections, replicate its persisted data (so that the cluster is not 
+left under replicated) and decommission the application to remove it from the 
+cluster.
+- If the termination is due to a temporary capacity loss (e.g. an update or an 
+image upgrade), the application should drain all of its client connections, 
+flush any in memory data structures to the file system, and synchronize the 
+file system with storage media. It should not redistribute its data.
+
+If the application implements the strategy of always redistributing its data, 
+we unnecessarily decrease recovery time during an update and incur the 
+additional network and storage cost of two full data redistributions for every 
+updated node.
+It should be noted that this is already an issue for Node cordon and Pod eviction 
+(due to drain or taints), and applications can use the same mitigation as they 
+would for these events for StatefulSet update.
+
+### VolumeTemplatesSpec Updates
+While this proposal does not address 
+[VolumeTemplateSpec updates](https://github.com/kubernetes/kubernetes/issues/41015), 
+this would be a valuable feature for production users of storage systems that use
+intermittent compaction as a form of garbage collection. Applications that use 
+log structured merge trees with size tiered compaction (e.g Cassandra) or append 
+only B(+/*) Trees (e.g Couchbase) can temporarily double their storage requirement 
+during compaction. If there is insufficient space for compaction 
+to progress, these applications will either fail or degrade  until 
+additional capacity is added. While, if the user is using AWS EBS or GCE PD, 
+there are valid manual workarounds to expand the size of a PD, it would be 
+useful to automate the resize via updates to the StatefulSet's 
+VolumeClaimsTemplate.
+
+### In Place Updates
+Currently configuration, images, and resource request/limits updates are all 
+performed destructively. Without a [termination reason](https://github.com/kubernetes/community/pull/541)
+implementation, there is little value to implementing in place image updates, 
+and configuration and resource request/limit updates are not possible.
+When [termination reason](#https://github.com/kubernetes/kubernetes/issues/1462) 
+is implemented we may modify the behavior of StatefulSet update to only update, 
+rather than delete and create, Pods when the only mutated value is the container
+ image, and if resizable resource request/limits is implemented, we may extend 
+ the above to allow for updates to Pod resources.
author	Clayton Coleman <ccoleman@redhat.com>	2017-05-27 09:29:39 -0400
committer	GitHub <noreply@github.com>	2017-05-27 09:29:39 -0400
commit	41346e4f3785c17311bf883cb7a8863100cd7b64 (patch)
tree	3bc2ec9b76138d3233a52c21b07f29e3d869bf3e
parent	fdfa14fe69ec9665f755b4358613c98281cb6411 (diff)
parent	ac00afd78d363075f925b487356be0c0e55c5d75 (diff)