summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKenneth Owens <kowens0826@gmail.com>2017-05-24 09:00:13 -0700
committerGitHub <noreply@github.com>2017-05-24 09:00:13 -0700
commitd8a25a81dbb43888b710ed395f3bf7884ee539d4 (patch)
treeb278195d2f213e9b93fa459c75df0746f3e9e6bc
parentf929dac3046c3a75652b3731c4ab88ce1912e39b (diff)
parent3bc2e74ec3e94e06571d26349c15f50b26fe2092 (diff)
Merge pull request #594 from kow3ns/controller-history
Merging controller history proposal
-rw-r--r--contributors/design-proposals/controller_history.md462
1 files changed, 462 insertions, 0 deletions
diff --git a/contributors/design-proposals/controller_history.md b/contributors/design-proposals/controller_history.md
new file mode 100644
index 00000000..fbf89b70
--- /dev/null
+++ b/contributors/design-proposals/controller_history.md
@@ -0,0 +1,462 @@
+# Controller History
+
+**Author**: kow3ns@
+
+**Status**: Proposal
+
+## Abstract
+In Kubernetes, in order to update and rollback the configuration and binary
+images of controller managed Pods, users mutate DaemonSet, StatefulSet,
+and Deployment Objects, and the corresponding controllers attempt to transition
+the current state of the system to the new declared target state.
+
+To facilitate update and rollback for these controllers, and to provide a
+primitive that third party controllers can build on, we propose a mechanism
+that allows controllers to manage a bounded history of revisions to the declared
+target state of their generated Objects.
+
+## Affected Components
+
+1. API Machinery
+1. API Server
+1. Kubectl
+1. Controllers that utilize the feature
+
+## Requirements
+
+1. History is a collection of points in time, and each point in time must be
+represented by its own Object. While it is tempting to aggregate all of an
+Object's history into a single container Object, experience with Borg and Mesos
+has taught us that this inevitably leads to exhausting the single Object size
+limit of the system's storage backend.
+1. We must be able to select the Objects that contain point in time snapshots
+of versions of an Object to reconstruct the Object's history.
+1. History respects causality. The Object type used to store point in time
+snapshots must be strictly ordered with respect to creation. CreationTimestamp
+should not be used, as this is susceptible to clock skew.
+1. History must not be revisionist. Once an Object corresponding to a version
+of a controllers target state is created, it can not be mutated.
+1. Controller history requires only current events. Storing an exhaustive
+history of all revisions to all controllers is out of scope for our purposes,
+and it can be solved by applying a version control system to manifests. Internal
+revision history must only store revisions to the controller's target state that
+correspond to live Objects and (potentially) a small, configurable number of
+prior revisions.
+1. History is scale invariant. A revision to a controller is a modification
+that changes the specification of the Objects it generates. Changing the
+cardinality of those Objects is a scaling operation and does not constitute a
+revision.
+
+## Terminology
+The following terminology is used throughout the rest of this proposal. We
+make its meaning explicit here.
+- The specification type of a controller is the type that contains the
+specification for the Objects generated by the controller.
+ - For example, the specification types for the ReplicaSet, DaemonSet,
+ and StatefulSet controllers are ReplicaSetSpec, DaemonSetSpec,
+ and StatefulSetSpec respectively.
+- The generated type(s) for a controller is/are the type of the Object(s)
+generated by the controller.
+ - Pod is a generated type for the ReplicaSet, DaemonSet, and StatefulSet
+ controllers.
+ - PersistentVolumeClaim is also a generated type for the StatefulSet
+ controller.
+- The current state of a controller is the union of the states of its generated
+Objects along with its status.
+ - For ReplicaSet, DaemonSet, and StatefulSet, the current state of the
+ corresponding controllers can be derived from Pods they contain and the
+ ReplicasSetStatus, DaemonSetStatus, and StatefulSetStatus objects
+ respectively.
+- For all specification type Objects for controllers, the target state is the
+set of fields in the Object that determine the state to which the controller
+attempts to evolve the system.
+ - This may not necessarily be all fields of the Object.
+ - For example, for the StatefulSet controller `.Spec.Template`,
+ `.Spec.Replicas`, and `.Spec.VolumeClaims` determine the target state. The
+ controller "wants" to create `.Spec.Replicas` Pods generated from
+ `.Spec.Template` and `.Spec.VolumeClaims`.
+- The target Object state is the subset of the target state necessary to create
+Objects of the generated type(s).
+ - To make this concrete, for the StatefulSet controller `.Spec.Template`
+ and `.Spec.VolumeClaims` are the target Object state. This is enough
+ information for the controller to generate Pods and corresponding PVCs.
+- If a version of the target Object state was used to generate an Object that
+has not yet been deleted, we refer to the version, and any snapshots of the
+version, as live.
+
+## API Objects
+
+Kubernetes controllers already persist their current and target states to the
+API Server. In order to maintain a history of revisions to specification type
+Objects, we only need to persist snapshots of the target Object states
+contained in the specification type when they are revised.
+
+One approach would be to, for every specification type, have a
+corresponding History type. For example, we could introduce a StatefulSetHistory
+object that aggregates a PodTemplateSpec and a slice of PersistentVolumeClaims.
+The StatefulSet controller could use this object to store point in time
+snapshots of versions of StatefulSetSpecs. However, this requires that we
+introduce a new History Kind for all current and future controllers. It has the
+benefit of type safety, but, for this benefit, we trade generality.
+
+Another approach would be to use PodTemplate objects. This mechanisms provides
+the desired generality, but it only provides for the recording of versions of
+PodTemplateSpecs (e.g. For StatefulSet, we can not use PodTemplates to
+record revisions to PersistentVolumeClaims). Also, it introduces the potential
+for overlapping histories for two Objects of different Kinds, with the same
+`.Name` in the same Namespace. Lastly, it constrains the PodTemplate Kind from
+evolving to fulfill its original intention.
+
+We propose an approach that has analogs with the approach taken by the
+[Mesos](http://mesos.apache.org/) community. Mesos frameworks, which are in some
+ways like Kubernetes controllers, are responsible for check pointing,
+persisting, and recovering their own state. This problem is so common that
+Mesos provides a ["State Abstraction"](https://github.com/apache/mesos/blob/master/include/mesos/state/state.hpp)
+that allows frameworks to persist their state in either ZooKeeper or the
+Mesos Replicate Log (A Multi-Paxos based state machine used by the Mesos
+Masters). This State Abstraction is a mutable, durable dictionary where keys
+and values are opaque strings. As controllers only need the capability to
+persist an immutable point in time snapshot of target Object states to
+implement a revision history, we propose to use the ControllerRevision object
+for this purpose.
+
+``` golang
+// ControllerRevision implements an immutable snapshot of state data. Clients
+// are responsible for serializing and deserializing the objects that contain
+// their internal state.
+// Once a ControllerRevision has been successfully created, it can not be updated.
+// The API Server will fail validation of all requests that attempt to mutate
+// the Data field. ControllerRevisions may, however, be deleted.
+type ControllerRevision struct {
+ metav1.TypeMeta
+ // +optional
+ metav1.ObjectMeta
+ // Data contains the serialized state.
+ Data runtime.RawExtension
+ // Revision indicates the revision of the state represented by Data.
+ Revision int64
+}
+```
+
+## API Server
+The API Server must support the creation and deletion of ControllerRevision
+objects. As we have no mechanism for declarative immutability, the API server
+must fail any update request that updates the `.Data` field of a
+ControllerRevision Object.
+
+## Controllers
+This section is presented as a generalization of how an arbitrary controller
+can use ControllerRevision to persist a history of revisions to its
+specification type Objects. The technique is applicable, without loss of
+generality, to the existing Kubernetes controllers that have Pod as a generated
+type.
+
+When a controller detects a revision to the target Object state of a
+specification type Object it will do the following.
+
+1. The controller will [create a snapshot](#version-snapshot-creation) of the
+current target Object state.
+1. The controller will [reconstruct the history](#history-reconstruction) of
+revisions to the Object's target Object state.
+1. The controller will test the current target Object state for
+[equivalence](#version-equivalence) with all other versions in the Object's
+revision history.
+ - If the current version is semantically equivalent to its immediate
+ predecessor no update to the Object's target state has been performed.
+ - If the current version is equivalent to a version prior to its immediate
+ predecessor, this indicates a rollback.
+ - If the current version is not equivalent to any prior version, this
+ indicates an update or a roll forward.
+ - Controllers should use their status objects for book keeping with respect
+ to current and prior revisions.
+1. The controller will
+[reconcile its generated Objects](#target-object-state-reconciliation)
+with the new target Object state.
+1. The controller will [maintain the length of its history](#history-maintenance)
+to be less than the configured limit.
+
+### Version Snapshot Creation
+To take a snapshot of the target Object state contained in a specification type
+Object, a controller will do the following.
+
+1. The controller will serialize all the Object's target object state and store
+the serialized representation in the ControllerRevision's `.Data`.
+1. The controller will store a unique, monotonically increasing
+[revision number](#revision-number-selection) in the Revision field.
+1. The controller will compute the [hash](#hashing) of the
+ControllerRevision's `.Data`.
+1. The controller will attach a label to the ControllerRevision so that it is
+selectable with a low probability of overlap.
+ - ControllerRefs will be used as the authoritative test for ownership.
+ - The specification type Object's `.Selector` should be used where
+ applicable.
+ - Alternatively, a Kind unique label may be set to the `.Name` of the
+ specification type Object.
+1. The controller will add a ControllerRef indicating the specification type
+Object as the owner of the ControllerRevision in the ControllerRevision's
+`.OwnerReferences`.
+1. The controller will use the hash from above, along with a user identifiable
+prefix, to [generate a unique `.Name`](#unique-name-generation) for the
+ControllerRevision.
+ - The controller should, where possible, use the `.Name` of the
+ specification type Object.
+1. The controller will persist the ControllerRevision via the API Server.
+ - Note that, in practice, creation occurs concurrently with
+ [collision resolution](#collision-resolution).
+
+### Revision Number Selection
+We propose two methods for selecting the `.Revision` used to order a
+specification type Object's revision history.
+
+1. Set the `.Revision` field to the `.Generation` field.
+ - This approach has the benefit of leveraging the existing monotonically
+ increasing sequence generated by `.Generation` field.
+ - The downside of this approach is that history will not survive the
+ destruction of an Object.
+1. Use an approach analogous to Deployment.
+ 1. Reconstruct the Object's revision history.
+ 1. If the history is empty, use a `.Revision` of `0`.
+ 1. If the history is not empty, set the `.Revision` to a value greater than
+ the maximum value of all previous `.Revisions`.
+
+### History Reconstruction
+To reconstruct the history of a specification type Object, a controller will do
+the following.
+
+1. Select all ControllerRevision Objects labeled as described
+[above](#version-snapshot-creation).
+1. Filter any ControllerRevisions that do not have a ControllerRef in their
+`.OwnerReferences` indicating ownership by the Object.
+1. Sort the ControllerRevisions by the `.Revision` field.
+1. This produces a strictly ordered set of ControllerRevisions that comprises
+the ordered revision history of the specification type Object.
+
+### History Maintenance
+Controllers should be configured, either globally or on a per specification type
+Object basis, to have a `RevisionHistoryLimit`. This field will indicate the
+number of non-live revisions the controller should maintain in its history
+for each specification type Object. Every time a controller observes a
+specification type Object it will do the following.
+
+1. The controller will
+[reconstruct the Object's revision history](#history-reconstruction).
+ - Note that the process of reconstructing the Object's history filters any
+ ControllerRevisions not owned by the Object.
+1. The controller will filter any ControllerRevisions that represent a live
+version.
+1. If the number of remaining ControllerRevisions is greater than the configured
+`RevisionHistoryLimit`, the controller will delete them, in order with respect
+to the value mapped to their `.Revisions`, until the number
+of remaining ControllerRevisions is equal to the `RevisionHistoryLimit`.
+
+This ensures that the number of recorded, non-live revisions is less than or
+equal to the configured `RevisionHistoryLimit`.
+
+### Version Tracking
+Controllers must track the version of the target Object state that corresponds
+to their generated Objects. This information is necessary to determine which
+versions are live, and to track which Objects need to be updated during a
+target state update or rollback. We propose two methods that controllers may
+use to track live versions and their association with generated Objects.
+
+1. The most straightforward method is labeling. In this method the generated
+Objects are labeled with the `.Name` of the ControllerRevision object that
+corresponds to the version of the target Object state that was used to generate
+them. As we have taken care to ensure the uniqueness of the `.Names` of the
+ControllerRevisions, this approach is reasonable.
+ - A revision is considered to be live while any generated Object labeled
+ with its `.Name` is live.
+ - This method has the benefit of providing visibility, via the label, to
+ users with respect to the historical provenance of a generated Object.
+ - The primary drawback is the lack of support for using garbage collection
+ to ensure that only non-live version snapshots are collected.
+1. Controllers may also use the `OwnerReferences` field of the
+ControllerRevision to record all Objects that are generated from target Object
+state version represented by the ControllerRevision as its owners.
+ - A revision is considered to be live while any generated Object that owns
+ it is live.
+ - This method allows for the implementation of generic garbage collection.
+ - The primary drawback with this method is that the book keeping is complex,
+ and deciding if a generated Object corresponds to a particular revision
+ will require testing each Object for membership in the `OwnerReferences`
+ of all ControllerRevisions.
+
+Note that, since we are labeling the generated Objects to indicate their
+provenance with respect to the version of the controller's target Object state,
+we are susceptible to downstream mutations by other controllers changing the
+controller's product. The best we can do is guarantee that our product meets
+the specification at the time of creation. If a third party mutates the product
+downstream (as long as it does so in a consistent and intentional way), we
+don't want to recall it and make it conform to the original specification. This
+would cause the controllers to "fight" indefinitely.
+
+At the cost of the complexity of implementing both labeling and ownership,
+controllers may use a combination of both approaches to mitigate the
+deficiencies of each.
+
+### Version Equivalence
+When the target Object state of a specification type Object is revised, we wish
+to minimize the number of mutations to generated Objects as the controller seeks
+to conform the system to its target state. That is, if a generated Object
+already conforms to the revised target Object state, it is imperative that we
+do not mutate it.
+
+Failure to implement this correctly could result in the simultaneous rolling
+restart of every Pod in every StatefulSet and DaemonSet in the system when
+additions are made to PodTemplateSpec during a master upgrade. It is therefore
+necessary to determine if the current target Object state is equivalent to a
+prior version.
+
+Since we [track the version of](#version-tracking) of generated Objects, this
+reduces to deciding if the version of the target Object state associated with
+the generated Object is equivalent to the current target Object state.
+Even though [hashing](#hashing) is used to generate the `.Name` of the
+ControllerRevisions used to encapsulate versions of the target Object state, as
+we do not require cryptographically strong collision resistance, and given we
+use a [collision resolution](#collision-resolution) technique, we can't use the
+[generated names](#unique-name-generation) of ControllerRevisions to decide
+equality.
+
+We propose that two ControllerRevisions can be considered equal if their
+`.Data` is equivalent, but that it is not sufficient to compare the serialized
+representation of the their `.Data`. Consider that the addition of new fields
+to the Objects that represent the target Object state may cause the serialized
+representation of those Objects to be unequal even when they are semantically
+equivalent.
+
+The controller should deserialize the values of the ControllerRevisions
+representing their target Object state and perform a deep, semantic equality
+test. Here all differences that do not constitute a mutation to the target
+Object state are disregarded during the equivalence test.
+
+### Target Object State Reconciliation
+There are three ways for a controller to reconcile a generated Object with the
+declared target Object state.
+
+1. If the target Object state is [equivalent](#version-equivalence) to the
+target Object state associated with the generated Object, the controller will
+update the associated [version tracking information](#version-tracking).
+1. If the Object can be updated in place to reconcile its state with the
+current target Object state, a controller may update the Object in place
+provided that the associated version tracking information is updated as well.
+1. Otherwise, the controller must destroy the Object and recreate it from the
+current target Object state.
+
+### Kubernetes Upgrades
+During the upgrade process form a version of Kubernetes that does not support
+controller history to a version that does, controllers that implement history
+based update mechanisms may find that they have specification type Objects with
+no history and with generated Objects. For instance, a StatefulSet may exist
+with several Pods and no history. We defer requirements for handling history
+initialization to the individual proposals pertaining to those controller's
+update mechanisms. However, implementors should take note of the following.
+
+1. If the history of an Object is not initialized, controllers should
+continue to (re)create generated Objects based on the current target Object
+state.
+1. The history should be initialized on the first mutation to the specification
+type Object for which the history will be generated.
+1. After the history has been initialized, any generated Objects that have no
+indication of the revision from which they were generated may be treated as if
+they have a nil revision. That is, without respect to the method of
+[version tracking](#version-tracking) used, the generated Objects may be
+treated as if they have a version that corresponds to no revision, and the
+controller may proceed to
+[reconcile their state](target-object-state-reconciliation) as appropriate to
+the internal implementation.
+
+## Kubectl
+
+Modifications to kubectl to leverage controller history are an optional
+extension. Users can trigger rolling updates and rollbacks by modifying their
+manifests and using `kubectl apply`. Controllers will be able to detect
+revisions to their target Object state and perform
+[reconciliation](#target-object-state-reconciliation) as necessary.
+
+### Viewing History
+
+Users can view a controller's revision history with the following command.
+
+```bash
+> kubectl rollout history
+```
+
+To view the details of the revision indicated by `<revision>`. Users can use
+the following command.
+
+```bash
+> kubectl rollout history --revision <revision>
+```
+
+### Rollback
+
+For future work, `kubeclt rollout undo` can be implemented in the general case
+as an extension of the [above](#viewing-history ).
+
+```bash
+> kubectl rollout undo
+```
+
+Here `kubectl undo` simply uses strategic merge patch to apply the state
+contained at a particular revision.
+
+## Tests
+
+1. Controllers can create a ControllerRevision containing a revision of their
+target Object state.
+1. Controllers can reconstruct their revision history.
+1. Controllers can't update a ControllerRevision's `.Data`.
+1. Controllers can delete a ControllerRevision to maintain their history with
+respect to the configured `RevisionHistoryLimit`.
+
+## Appendix
+
+### Hashing
+We will require a CRHF (collision resistant hash function), but, as we expect
+no adversaries, such a function need not be resistant to pre-image and
+secondary pre-image attacks.
+As the property of interest is primarily collision resistance, and as we
+provide a method of [collision resolution](#collision-resolution), both
+cryptographically strong functions, such as Secure Hash Algorithm 2 (SHA-2),
+and non-cryptographic functions, such as Fowler-Noll-Vo (FNV) are applicable.
+
+### Collision Resolution
+As the function selected for hashing may not be cryptographically strong and may
+produce collisions, we need a method for collision resolution. To demonstrate
+its feasibility, we construct such a scheme here. However, this proposal does
+not mandate its use.
+
+Given a hash function with output size `HashSize` defined
+as `func H(s srtring) [HashSize] byte`, in order to resolve collisions we
+define a new function `func H'(s string, n int) [HashSize]byte` where `H'`
+returns the result of invoking `H` on the concatenation of `s` with the string
+value of `n`. We define a third function
+`func H''(s string, exists func (string) bool)(int,[HashSize]byte)`. `H''`
+will start with `n := 0` and compute `s' := H'(s,n)`, incrementing `n` when
+`exists(s')` returns true, until `exists(s')` returns false. After this it will
+return `n,s'`.
+
+For our purposes, the implementation of the `exists` function will attempt to
+create a `.Named` ControllerRevision via the API Server using a
+[unique name generation](#unique-name-generation). If creation fails, due to a
+conflict, the method returns false.
+
+### Unique Name Generation
+We can use our [hash function](#hashsing) and
+[collision resolution](#collision-resolution) scheme to generate a system
+wide unique identifier for an Object based on a deterministic non-unique prefix
+and a serialized representation of the Object. Kubernetes Object's `.Name`
+fields must conform to a DNS subdomain. Therefore, the total length of the
+unique identifier must not exceed 255, and in practice 253, characters. We can
+generate a unique identifier that meets this constraint by selecting a hash
+function such that the output length is equal to `253-len(prefix)` and applying
+our [hash](#hashing) function and [collision-resolution](#collision-resolution)
+scheme to the serialized representation of the Object's data. The unique hash
+and integer can be combined to produce a unique suffix for the Object's `.Name`.
+
+1. We must also ensure that unique name does not contain any bad words.
+1. We may also wish to spend additional characters to prettify the generated
+name for readability.
+
+
+