Merge pull request #4261 from tpepper/kcsna19-notes

Contrib Summit Nov. 2019 notes
author: Kubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com> 2019-12-02 14:02:58 -0800
committer: GitHub <noreply@github.com> 2019-12-02 14:02:58 -0800
commit: 339b134cd2cd8768535b561b0b9a298a813513ba (patch)
tree: 9caf9dba79741c64b9516d965aee68a450058f8b
parent: 49749f5136c2f437d24a68bd5c20d9a9f7e0c158 (diff)
parent: 6b1f88ad4784915bc208c8601fe8148e86500f44 (diff)
2 files changed, 183 insertions, 0 deletions
diff --git a/events/2019/11-contributor-summit/session-notes/LTS-notes.md b/events/2019/11-contributor-summit/session-notes/LTS-notes.md
new file mode 100644
index 00000000..8e6d5cdc
--- /dev/null
+++ b/events/2019/11-contributor-summit/session-notes/LTS-notes.md
@@ -0,0 +1,77 @@
+LTS Session
+
+13 attendees
+
+Opening slides shared:
+
+ * https://docs.google.com/presentation/d/1Q0ZkKP_6jAZezWRF3aiDESflVm-1oGXz8lX9LD2FbFQ/edit?usp=sharing
+ * https://static.sched.com/hosted_files/kcsna2019/d6/2019%20Contrib%20Summit%20-%20WG%20LTS.pdf
+
+We've shied away from talking long term support because we don't want to predefine the mission but we had to call the WG something.
+
+This is a summary of what's happened in 2019.
+
+We took a survey in Q1 of 2019. 600 people started filling it out, 324 completed it.  Survey was very long with a lot of details.  People are upgrading continually, so we're cloud.  But things are moving more slowly in infrastructure.  A lot of people are falling behind.  We got 45% users so we're not just talking to each other.
+
+Put your Q1 hat on.  At that time, 1.13, 1.12, 1.11 were under support.  Even 1.11 will be out of support in 2 months.  specifically 1.9 and 1.10 were a big chunk of people who are just out of support.
+
+Why are they falling behind?  Well, some don't care.  Many want to be up to date, but there are lots of business reasons to not upgrade.
+
+The other thing we discussed is what does "support" really mean?  Critical bug fixes, upgrade path so that users can get to newer versions.  ALso user stability & API compatibility.  We're relatively "normal" as a project relatively to the general industry.
+
+Patch releases, we maintain 3 branches, each branch gets 9+ months of support; around the eol edge there's a fuzzy period where we don't necessarily stop patching, depending.  Lots of people said "why 9 months", which is a weird timespan.  Also we only support upgrading to the next version, but that's standard for industry.
+
+API discussion: rest APIs, config data, provider abstractions.  We have robust promotion process, better than industry.
+
+Proposals: suggested faster releases, like monthly.  Or maybe slower releases (6 months?). Or do a Kernel.org and pick a single release per year and have it be an "LTS".
+
+We need to separate out patch releass, upgrades, stability.  Distinct although related.
+
+API stability options:  this is all the SIG-arch stuff.  KEPs, conformance, pushing for more key APIs to hit stable.  Only 1 or 2 APIs out of 60 still not v1.  What about stale APIs?  Should we do a cleanup?
+
+Upgrades:  this is hard.  maybe preflight checks?
+
+Patch Release: we have a draft KEP for 4 branches of patch release support, which is 1 year of support.  We can do something impactful -- 30% of userbase is in that window of 3mo out of support.  Cost is known.  Google runs most things, but k8s-infra can quantify.  Because of reoganization of patch release team it's not as much effort.  We could stand to streamline how we generate patches though.
+
+The WG should continue for one more year.  Maybe another survey, more concrete action items, and get contributors around those.
+
+Brian Grant:  every version of Kubernetes we have has schema changes.  We don't have a way to unbundle those from API changes, which would be required to skip releases.  Releases a year old are just stabilizing now because they've been used.  We don't want to support 1.13 for 2 years, so we need to make releases more stable faster.  So more test coverage.   The reason we're patching the same thing into 4 different branches is that we find problems very late.  If we can get people using newer releases sooner we'll find problems sooner.
+How do we fix this?  Better test coverage.  Not letting features going in until they're more mature, but that could mean finding issues slower for those.  Maybe we could not merge things without good test coverage.  We experimented with feature branches.  And with multiple repos, we should decide maybe we shouuldn't integrate a repo.  We have better test coverage for compatibility, but still happens with one thing every release.
+Anyway, that's the whole philosophy of faster releases.
+People are on the versions they're on, because that's where they've found stability.  Just doing things today the same wont lead to things being more stable.
+
+Jago: The three letters "LTS" are easy to say, made people nervous last year, but the WG has been very admirable for keeping discussion open minded toward any possibility.  Need to also look at support programs for external repos.  This becomes a combinatorial explosion.  Lots of good work over the last year.
+
+Tim: we can also make ourselve more consumable in the aggregate.  We need more distro experience.  Where's the debian of Kubernetes?
+
+Jago: I support the 1-year extension.  Nick: the 9-month window means you'll have to upgrade at a bad time.  Jago: make sure we know it is 4 this year and not 5 next year.  12 months is the important part.  Don't enable a slippery slope.
+
+Tim: everyone in the conversation has not wanted to go to extremely long support.  Just asking for a little bit more.
+
+Josh: vendors don't have all the answers either.
+
+Josh: even if stability is perfect, people have other business reasons not to upgrade.  Regulation, certification, management approval, time required to do the upgrade.  Nick: regulatory environment works with yearly upgrades.
+
+Quinton: companies batch upgrades.  Tim: What's an example of software with great compat?  Even if there is one, businesses build risk management processes which make upgrade friction.
+
+Josh: Kops stalled during the survey and their users weren't upgrading.  Could happen on any component.
+
+Everyone good with 12 months of patching?  Everyone was good.
+
+Noah: Likes the 12 month idea, but also sees no path technically viable toward "traditional LTS" of pick 1 release and support it for years.
+
+BrianG: can technical changes make it cheaper/faster to move to new releases?  Improve tools, audit ability. Nick: even machine parseable changelogs is a notable improvement.
+
+Mark Johnson: Q4 can actually be a period where ops has free cycles to do beta experimentation. Could be about to get more beta feedback now, since we're making beta releases again.
+
+Quinton: even if people only upgrade once a year, people will upgrade at different times, so we'll get feedback around the year.
+
+Tim: one user said that they actually do skip version updates, not sure how they do it.  For Nodes you can do it, but for control planes it's known to be unsafe.  Creating new clusters and migrating is one of the things that people do.  SIG-MC says that the idea of MC for migration is popular.  These users may be missing subtle compat issues in their clusters today.  Things likely better the more stateless you are.
+
+Jago: upgrade test and tooling needs work, too few people contributing.  They're not covering everything and they break all the time.  Josh: our upgrade tests don't test an upgrade path that anyone uses (GCP & kube-up.sh versus other providers and upgrade tools).
+
+Tim: what about more common tools instead of kube-up and 12 different installers.
+
+Noah: time for another survey?  Tim has started discussion of combining forces on a new survey with the "Production Readiness" folks who are considering a survey soon.  Josh says we need to digest the existing data first, figure out what we want to ask.  Quinn: can we consolidate surveys across several groups?  Maybe consolidate them?  So have a multi-SIG survey?  Maybe do something like Linux Registry?  Give more thought to role focuses, maybe do multiple small surveys, still in a coordinated way across SIGs/WGs, but try to avoid a long survey.
+
+BrianG: overarching theme here is shortening the release feedback loop.
diff --git a/events/2019/11-contributor-summit/session-notes/end-user-panel.md b/events/2019/11-contributor-summit/session-notes/end-user-panel.md
new file mode 100644
index 00000000..cd507a19
--- /dev/null
+++ b/events/2019/11-contributor-summit/session-notes/end-user-panel.md
@@ -0,0 +1,106 @@
+End User Panel
+
+Josh Mickelson Conde Nast
+Federico Hernandez Mellwater, provides kubernetes-as-a-service to their developers.
+Andy Snowden, Equityzen
+
+Brian Grant
+Peter Norquist & Kevin Fox, PNW National Lab
+Josh Berkus
+Ryan Bohnham, Granular.
+
+How many developers & clusters?
+
+JoshM: 380 developers across all clusters now, 18 clusters.
+  8 clusters in international, 10 clusters in US.
+  Separate production and dev clusters.
+
+Federico: dev, test, production clusters.  We might have special purpose clusters in the future, like ML, but we dont' need them yet.
+
+Andy: just getting started, have 8 devs, 1 cluster, very small.
+
+Peter: none of our clusters launched yet.  Going to have 2 clusters, one for dev purely internal, the prod cluster will be managed by the security group because exposed.
+
+Brian: about 200 engineers, have 15 clusters today.  Spread across 4 regions. Run dev/test/prod in each region.
+
+Paris talked about navigating the project, when you don't have contributors on staff.
+
+How are people deloying and upgrading clusters?
+
+JoshM: CoreOS techtonic, are still maintaining a fork of that, investigating EKS, ClusterAPI.  On the US side, rolled everything manually, mainly cloudformation and scripts.  They don't upgrade, mostly.
+
+Federico: Run on AWS, use KOPS w/ Amazon CNI plugin, love/hate relationship with KOPS.  Discovered a lot of bugs with CNI plugin, discovered a lot of bugs and submitted PRs for Kops.  Wrote our own custom upgrade script because Kops was not zero-downtime.  Specifically, zero downtime for applications.
+
+Andy: we use Kops too.  Are just encountering the same problems.  Right now we are allowed downtime because 9 to 5, that will change, and Kops doesn't support that.  The lack of rerouting service requests while upgrading is an issue.  Stay around 2 releases behind.
+
+Kevin: using kubeadm, have a container-based repo do to CentOS kickstarts with custom code. Custom stuff for upgrading the nodes using labels.  Metalkubed looks interesting once that's mature.
+
+Ryan: Kops again.
+
+JoshM: the Techtonic parts don't do the full upgrade, so we have custom scripts.  We run CoreOS, and Docker.
+
+Kevin: some CRIO versions change the image storage, so you have to drain a node. Nothing in the KubeAPI lets you completely drain a node, daemonsets are a problem.
+
+Federico: we also double all the nodes in the cluster, cordon the old ones, and migrade over.  That's worked well so far, better than node-at-a-time via Kops.
+
+Andy: we've had a few times when Istio gets restarted where we lose a request.
+
+JoshM: our cluster serves stuff that can be cached, so having nodes shut down is not as much of a problem.
+
+Andy: Any manual control of the load balancers?  JoshM: No.  We have an ELB in front of everything.
+
+JoshB: is upgrading everyone's biggest problem with Kubernetes?
+
+Federico: yes.   Especially you run into dependencies, and you have to upgrade all of them, it's a puzzle to solve.  We have add-ons to give users a better experience, like DNS, we have the cluster scaler for cost management.  Those need to be maintained when you upgrade the cluster.  Installing them ourselves, not using Kops add-ons, we wrote our own installer.
+
+Andy: we had the same experience with Helm.  For the cluster to be useful you have to install a lot of add-ons.  Like for resourcing, you need to figure out all your add ons.  Like we only have 70% of our cluster available because of all the add-ons.  Ryan: are you tracking the cost?  Andy: manually, we don't have tools.
+
+... missed some discussions on cost accounting ...
+
+Federico: Zalando has something for cloudcost.
+
+JoshM: we don't have to track in detail. I'm lucky.
+
+F: we need to forecast cost, it's not about chargebacks.  We need to know what it'll cost next year.  We have a couple teams who are really cost-aware.
+
+Andy: one of our reasons to bring Kubernetes in was cost efficiency, so we want to know how much we saved.  We were able to downsize using K compared with previous AWS solution.  We compute cost-per-click.
+
+JoshM: cluster upgrades are not the most difficult thing for us.  They're still difficult, but we've worked around a lot of the problems.  Right now our most difficult thing is getting our US infra in parity, so I guess that cluster upgrades are still a problem.
+
+Federico: the other most diffcult thing for us is finding out cluster user changes.  Like which things in the release log will affect my users.  Finding that in the release notes is a challenge.  It has become a lot better, but it's still a big effort. I'd like to have "these are neutral changes, these are things ou need to know about, these are things users need to do something about".  Like when the Kubectl API removed duplicates.
+
+Andy: yeah, that's also a big effort for me, reading through.
+
+Kevin: two things: multitenancy.  After that, security things, like certificates etc.  We end up deploying ingresses for users in namespaces they can't manage, and we need workarounds for that.
+
+Brian: do you expose Kubernetes directly to your app devs, or do you hide it behind a UI?
+
+Federico: they have direct access to command line.  Most teams have some kind of CI/CD, but we don't hide anything.  They're still responsible for writing their own YAML.  A few teams use Teraform, a few use Helm with no Tiller.
+
+JoshM: we optimized for app delivery.  We do expose it, but we put a lot of effort into base Helm charts etc. so that users use those templates (all apps deployed through same chart).  They use parameters to change things.  They can't deploy whatever the want, they have to go through CI/CD, there's several options, but they have to go through that.
+
+Kevin: we try to do a "batteries included" configuration, so that our devs can have a common template where they can just deploy their applications.
+
+Paris: do you feel supported from the Kubernetes community/ecosystem?
+
+Andy: I haven't actually had to try yet.
+
+JoshM: from the support side we haven't had any problems.  More recently I've started wanting to contribute back to the project, so most gripes have to do with initial contributor experience, PRs not getting reviewed, there's so many PRs in the queue.  My biggest question is how to jump in and get started?
+
+Paris talked about doing a contributor orientation at companies.
+
+Federico: we have some devs who are interested in contributions, but they're nervous about doing the contributor summit stuff, something at our office would be really nice.
+
+Brian: is the structure of the project clear?  Like routing the PRs to the right place?
+
+Paris went over the contributor structure.
+
+Kevin: I contributed to Openstack, you got influence by showing up at meetings, but there was no way to get visibilty across the whole project.
+
+Kevin: the problems I run into are typically WG problems, those really help me.
+
+JoshM: one of the things I've pushed for is hackdays, that's easier that getting my company to pay for full time contributors. Are there features we can just knock out?  Folks suggested Prow or Docs.
+
+Kevin: a lot of the docs make the assumption that the developer and the cluster admin are the same person, so we need to separate personas.
+
+Federico: I copied code comments into documentation for Kops, stuff got noticed much faster.
author	Kubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com>	2019-12-02 14:02:58 -0800
committer	GitHub <noreply@github.com>	2019-12-02 14:02:58 -0800
commit	339b134cd2cd8768535b561b0b9a298a813513ba (patch)
tree	9caf9dba79741c64b9516d965aee68a450058f8b
parent	49749f5136c2f437d24a68bd5c20d9a9f7e0c158 (diff)
parent	6b1f88ad4784915bc208c8601fe8148e86500f44 (diff)