Add notes for LTS and End Users sessions at the Contributor Summit.

author: Josh Berkus <jberkus@redhat.com> 2019-11-18 16:43:04 -0800
committer: Tim Pepper <tpepper@vmware.com> 2019-11-25 11:51:53 -0800
commit: d674ef4a7ad2c538b1030b099fab280369cdae4c (patch)
tree: 7ad4ff7b181fa103621ac99044ba883beceeda2d /events
parent: 1ad8a3776d60967eaea096e127423619cbd61715 (diff)
2 files changed, 165 insertions, 0 deletions
diff --git a/events/2019/11-contributor-summit/session-notes/LTS-notes.md b/events/2019/11-contributor-summit/session-notes/LTS-notes.md
new file mode 100644
index 00000000..8270b5f4
--- /dev/null
+++ b/events/2019/11-contributor-summit/session-notes/LTS-notes.md
@@ -0,0 +1,59 @@
+LTS Session
+
+13 attendees
+
+We've shied away from talking long term support because we don't want to predefine the mission but we had to call the WG something.
+
+This is a summary of what's happened in 2019.
+
+We took a survey in Q1 of 2019. 600 people started filling it out, 324 completed it.  Survey was very long with a lot of details.  People are upgrading continually, so we're cloud.  But things are moving more slowly in infrastructure.  A lot of people are falling behind.  We got 45% users so we're not just talking to each other.
+
+Put your Q1 hat on.  At that time, 1.13, 1.12, 1.11 were under support.  Even 1.11 will be out of support in 2 months.  specifically 1.9 and 1.10 were a big chunk of people who are just out of support.
+
+Why are they falling behind?  Well, some don't care.  Many want to be up to date, but there are lots of business reasons to not upgrade.
+
+The other thing we discussed is what does "support" really mean?  Critical bug fixes, upgrade path so that users can get to newer versions.  ALso user stability & API compatibility.  We're relatively "normal" as a project relatively to the general industry.
+
+Patch releases, we maintain 3 branches, each branch gets 9+ months of support; around the eol edge there's a fuzzy period where we don't necessarily stop patching, depending.  Lots of people said "why 9 months", which is a wierd timespan.  Also we only support upgrading to the next version, but that's standard for industry.
+
+API discussion: rest APIs, config data, provider abstractions.  We have robust promotion process, better than industry.
+
+Proposals: suggested faster releases, like monthly.  Or maybe slower releases (6 months?). Or do a Kernel.org and pick a single release per year and have it be an "LTS".
+
+We need to separate out patch releass, upgrades, stability.  Distinct although related.
+
+API stability options:  this is all teh SIG-arch stuff.  KEPs, conformance, pushing for more key APIs to hit stable.  Only 1 or 2 APIs out of 60 still not v1.  What about stale APIs?  Should we do a cleanup?
+
+Upgrades:  this is hard.  maybe preflight checks?
+
+Patch Release: we have a draft KEP for 4 branches of patch release support, which is 1 year of support.  We can do something impactful -- 30% of userbase is in that window of 3mo out of support.  Cost is known.  Google runs most things, but k8s-infra can quantify.  Because of reoganization of patch release team it's not as much effort.  We could stand to streamline how we generate patches though.
+
+The WG should continue for one more year.  Maybe another survey, more concrete action items, and get contributors around those.
+
+Brian Grant:  every version of Kubernets we have has schema changes.  We don't have a way to unbundle those from API changes, which would be required to skip releases.  Releases a year old are just stabilizing now becuase they've been used.  We don't want to support 1.13 for 2 years, so we need to make releases more stable faster.  So more test coverage.   The reason we're patching the same thing into 4 different branches is that we find problems very late.  If we can get people using newer releases sooner we'll find problems sooner.
+How do we fix this?  Better test coverage.  Not letting features going in until they're more mature, but that could mean finding issues slower for those.  Maybe we could not merge things wihtout good test coverage.  We experimented with feature branches.  And with mutliple repos, we should decide maybe we shouuldn't integrate a repo.  We have better test coverage for compatibility, but still happens with one thing every release.
+ANyway, that's the whole philosophy of faster releases.
+
+Jago: need to look at support programs for external repos.  This becomes a combinatorial explosion.  Lots of good work over the last year.  Tim: we can also make ourselve more consumable.  We need more distro experience.  Where's the debian of Kubernetes?  Jago: I support the 1-year extension.  Nick: the 9-month window means you'll have to upgrade at a bad time.  Jago: make sure we know it is 4 this year and not 5 next year.  12 months is the important part.
+
+Tim: everyone in the coinversation has not wanted to go to extremely long support.  Just asking for a little bit more.
+
+... missed some discussion here ...
+
+Josh: even if stability is perfect, people have other business reasons not to upgrade.  Regulation, certification, management approval, time required to do the upgrade.  Nick: regulatory environment works with yearly upgrades.
+
+Everyone good with 12 months of patching?  Everyone was good.
+
+... More discussion about server upgrades and stability, not captured.
+
+Quinton: even if people only upgrade once a year, people will upgrade at different times, so we'll get feedback around the year.
+
+???: getting more beta feedback now, since we're making beta releases again.
+
+Tim: one user said that they actually do skip version updates, not sure how they do it.  For Nodes you can do it, but for control planes it's known to be unsafe.  Creating new clusters and migrating is one of the things that people do.  SIG-MC says that the idea of MC for migration is popular.
+
+Jago: upgrade test and tooling needs work.  They're not covering everything and they break all the time.  Josh: our upgrade tests don't test an upgrade path that anyone uses it.
+
+Tim: what about more common tools instead of kube-up and 12 different installers.
+
+Noah: time for another survey?  Tim has started designing a new survey.  Josh says we need to digest the existing data first, figure out what we want to ask.  Quinn: can we consolidate surveys across several groups?  Maybe consolidate them?  So have a multi-SIG survey?  Maybe do something like Linux Registry?
diff --git a/events/2019/11-contributor-summit/session-notes/end-user-panel.md b/events/2019/11-contributor-summit/session-notes/end-user-panel.md
new file mode 100644
index 00000000..714e3140
--- /dev/null
+++ b/events/2019/11-contributor-summit/session-notes/end-user-panel.md
@@ -0,0 +1,106 @@
+End User Panel
+
+Josh Mickelson Conde Nast
+Federico Hernandez Mellwater, provides kubernetes-as-a-service to their developers.
+Andy Snowden, Equityzen
+
+Brian Grant
+Peter Norquist & Kevin Fox, PNW National Lab
+Josh Berkus
+Ryan Bohnham, Granular.
+
+How many developers & clusters?
+
+JoshM: 380 developers across all clusters now, 18 clusters.
+  8 clusters in international, 10 clusters in US.
+  Separate production and dev clusters.
+
+Federico: dev, test, production clusters.  We might have special purpose clusters in the future, like ML, but we dont' need them yet.
+
+Andy: just getting started, have 8 devs, 1 cluster, very small.
+
+Peter: none of our clusters launched yet.  Going to have 2 clusters, one for dev purely internal, the prod cluster will be managed by the security group because exposed.
+
+Brian: about 200 engineers, have 15 clusters today.  Spread across 4 regions. Run dev/test/prod in each region.
+
+Paris talked about navigating the project, when you don't have contributors on staff.
+
+How are people deloying and upgrading clusters?
+
+JoshM: CoreOS techtonic, are still maintaining a fork of that, investigating EKS, ClusterAPI.  On the US side, rolled everything manually, mainly cloudformation and scripts.  They don't upgrade, mostly.
+
+Federico: Run on AWS, use KOPS w/ Amazon CNI plugin, love/hate relationship with KOPS.  Discovered a lot of bugs with CNI plugin, discovered a lot of bugs and submitted PRs for Kops.  Wrote our own custom upgrade script because Kops was not zero-downtime.  Specifically, zero downtime for applications.
+
+Andy: we use Kops too.  Are just encountering the same problems.  Right now we are allowed downtime because 9 to 5, that will change, and Kops doesn't support that.  The lack of rerouting service requests while upgrading is an issue.  Stay around 2 releases behind.
+
+Kevin: using kubeadm, have a container-based repo do to CentOS kickstarts with custom code. Custom stuff for upgrading the nodes using labels.  Metalkubed looks interesting once that's mature.
+
+Ryan: Kops again.
+
+JoshM: the Techtonic parts don't do the full upgrade, so we have custom scripts.  We run CoreOS, and Docker.
+
+Kevin: some CRIO versions change the image storage, so you have to drain a node. Nothing in the KubeAPI lets you completely drain a node, daemonsets are a problem.
+
+Federico: we also double all the nodes in the cluster, cordon the old ones, and migrade over.  That's worked well so far, better than node-at-a-time via Kops.
+
+Andy: we've had a few times when Istio gets restarted where we lose a request.
+
+JoshM: our cluster serves stuff that can be cached, so having nodes shut down is not as much of a problem.
+
+Andy: Any manual control of the load balancers?  JoshM: No.  We have an ELB in front of everything.
+
+JoshB: is upgrading everyone's biggest problem with Kubernetes?
+
+Federico: yes.   Especially you run into dependancies, and you have to upgrade all of them, it's a puzzle to solve.  We have add-ons to give users a better experience, like DNS, we have the cluster scaler for cost management.  Those need to be maintained when you upgrade the cluster.  Installing them ourselves, not using Kops add-ons, we wrote our own installer.
+
+Andy: we had the same experience with Helm.  For the cluster to be useful you have to install a lot of add-ons.  Like for resourcing, you need to figure out all your add ons.  Like we only have 70% of our cluster available because of all the add-ons.  Ryan: are you tracking the cost?  Andy: manually, we don't have tools.
+
+... missed some discussions on cost accounting ...
+
+Federico: Zalando has something for cloudcost.
+
+JoshM: we don't have to track in detail. I'm lucky.
+
+F: we need to forecast cost, it's not about chargebacks.  We need to know what it'll cost next year.  We have a couple teams who are really cost-aware.
+
+Andy: one of our reasons to bring Kubernetes in was cost efficiency, so we want to know how much we saved.  We were able to downsize using K compared with previous AWS solution.  We compute cost-per-click.
+
+JoshM: cluster upgrades are not the most difficult thing for us.  They're still difficult, but we've worked around a lot of the problems.  Right now our most difficult thing is getting our US infra in parity, so I guess that cluster upgrades are still a problem.
+
+Federico: the other most diffcult thing for us is finding out cluster user changes.  Like which things in the release log will affect my users.  Finding that in the release notes is a challenge.  It has become a lot better, but it's still a big effort. I'd like to have "these are neutral changes, these are things ou need to know about, these are things users need to do something about".  Like when the Kubectl API removed duplicates.
+
+Andy: yeah, that's also a big effort for me, reading through.
+
+Kevin: two things: multitenancy.  After that, security things, like certificates etc.  We end up deploying ingresses for users in namespaces they can't manage, and we need workarounds for that.
+
+Brian: do you expose Kubernetes directly to your app devs, or do you hide it behind a UI?
+
+Federico: they have direct access to command line.  Most teams have some kind of CI/CD, but we don't hide anything.  They're still responsible for writing their own YAML.  A few teams use Teraform, a few use Helm with no Tiller.
+
+JoshM: we optimized for app delivery.  We do expose it, but we put a lot of effort into base Helm charts etc. so that users use those templates (all apps deployed through same chart).  They use parameters to change things.  They can't deploy whatever the want, they have to go through CI/CD, there's several options, but they have to go through that.
+
+Kevin: we try to do a "batteries included" configuration, so that our devs can have a common template where they can just deploy their applications.
+
+Paris: do you feel supported from the Kubernetes community/ecosystem?
+
+Andy: I haven't actually had to try yet.
+
+JoshM: from the support side we haven't had any problems.  More recently I've started wanting to contribute back to the project, so most gripes have to do with initial contributor experience, PRs not getting reviewed, there's so many PRs in the queue.  My biggest question is how to jump in and get started?
+
+Paris talked about doing a contributor orientation at companies.
+
+Federico: we have some devs who are interested in contributions, but they're nervous about doing the contributor summit stuff, something at our office would be really nice.
+
+Brian: is the structure of the project clear?  Like routing the PRs to the right place?
+
+Paris went over the contributor structure.
+
+Kevin: I contributed to Openstack, you got influence by showing up at meetings, but there was no way to get visibilty across the whole project.
+
+Kevin: the problems I run into are typically WG problems, those really help me.
+
+JoshM: one of the things I've pushed for is hackdays, that's easier that getting my company to pay for full time contributors. Are there features we can just knock out?  Folks suggested Prow or Docs.
+
+Kevin: a lot of the docs make the assumption that the developer and the cluster admin are the same person, so we need to separate personas.
+
+Federico: I copied code comments into documentation for Kops, stuff got noticed much faster.
author	Josh Berkus <jberkus@redhat.com>	2019-11-18 16:43:04 -0800
committer	Tim Pepper <tpepper@vmware.com>	2019-11-25 11:51:53 -0800
commit	d674ef4a7ad2c538b1030b099fab280369cdae4c (patch)
tree	7ad4ff7b181fa103621ac99044ba883beceeda2d /events
parent	1ad8a3776d60967eaea096e127423619cbd61715 (diff)