GKE checklist for production

6 min readJun 19, 2020

In this article we are going to review the checklist I used during multiple intervies regarding GKE readiness for production. This checklist focuses on Google Kubernetes Engine and some advices might not be relevant if you are running containers in another environment. I will attach to every point the interesting GCP documentation pointers.

This checklist is a generic overview and is subject to change regarding your company and your needs.

If you think of any points you would like to include, I will be happy to discuss it with you.

Cluster Configuration

In this first section we assess the main configuration of your cluster. Some of this setup is already done by GKE but I will include all the points in case you need it.

A good read to start is : https://cloud.google.com/kubernetes-engine/docs/concepts/scalability

Restrict access to alpha or beta features

You have to decide the way you want your GKE environment to evolve and upgrade but this requires to understand how it works under the hood.

First: Cluster upgrade and Node Pool upgrade are not the same thing.

Cluster upgrade (Control plane/master) are automatic but can be initiated manually. You have the possibility to configure a maintenance window and this will be honored if possible.

Node Pool upgrade can be set to manual or auto-upgrade mode.

Node-Pool auto upgrade is the recommended way to go for production.

In order to be ready for production you have to test a full upgrade. Do a full dry run by manually upgrading your node pool and see how this impact your production. This will trigger other discussions about Pod Disruption Budget (PDB), probs etc..

This test and your level of confidence will impact your release channel choice.

If you want to test Alpha feature(s) GKE offers Alpha cluster. To help you choose between the types of clusters here is a good read

RBAC and access control

The simple rule is : do not use ServiceAccount for end-user authentication. RBAC is done for this.

When creating RBAC roles and cluster roles always use the principle of least privilege

Another rule : use Workload Identity to match Kubernetes Service Account and Google Service Account. This is the best way to access Google Cloud services from within GKE.

Be careful when using Workload Identity with multiple namespace because :

All KSAs that share a name, Namespace name, and workload identity pool resolve to the same member name, and therefore share access to GSAs

Logging and monitoring

You want to collect logs from Nodes, Control Plane and Audit logs. By default ,since GKE 1.14, Kubernetes Engine Monitoring is the default option. It comes with everything you need and many dashboards. I recommend to not use the Legacy Logging anymore (lacks of feature regarding KEM).

You can also use your own logging and monitoring system but be sure to include the same level of metrics and monitoring offered by KEM.

Private or Public cluster

You don’t have to use a private cluster in order to be production ready. Many GKE users are running production workload on non-private cluster.

However, if you don’t need to expose public API, in a private cluster the nodes have reserved IP addresses only, which ensures that their workloads are isolated from the public internet.

Single-zone, Multi-zone and Regional Cluster

https://cloud.google.com/kubernetes-engine/docs/concepts/scalability#choosing_a_regional_or_zonal_control_plane

Most of the time, Regional Clusters are the way to go because it is better suited for high availability. The main difference is the number of master nodes. Zonal clusters have one master node in a single compute zone. But the usage of Regional cluster comes with some trade-offs. As an example clusters configuration change takes longer time… Due to these trade-offs, zonal and regional clusters have different use cases:

Use zonal clusters to create or upgrade clusters rapidly when availability is less of a concern.
Use regional clusters when availability is more important than flexibility.

DNS

For large-scale clusters or high DNS request load, enable NodeLocal DNS for more distributed DNS serving.

Configuring nodes for better performance

GKE nodes are regular Google Cloud virtual machines. Some of their parameters, for example the number of cores or size of disk, can influence how GKE clusters perform. So it can be a good idea to check GCE best practices.

As an example, if you use SSD you will gain some time when downloading new images on your nodes.

For the same reason you should always check your Compute Engine API quota before a large release.

Application Stack

Readiness and Liveness probes

There is a full article available on the GCP website : https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-setting-up-health-checks-with-readiness-and-liveness-probes

Read this article… and then read it again. I won’t explain how it works but be sure to set Readiness and “passive” liveness probes.

We use the word “passive” for Liveness because many users use Liveness probes to handle fatal errors in their app and request Kubernetes to restart the app and this is not the way to go.

Instead, you should let the app crash.

The Liveness probe should be used as a recovery mechanism only in case the process is not responsive.

For the same reason you should not use Liveness prob to handle fatal error. Fatal errors should result in a container crash.

Application independence

This is linked to our previous point regarding probes.

One other bad habits is to start containers and pod in a certain order to satisfy strong dependencies. A good example is the web app and database dependency.

Your application shouldn’t crash if the database pods are not running. Instead the application should try to reconnect until it succeeds.

Graceful shutdown

There is a small amount of time between SIGTERM and pods shutdown. It means that you can and you have to handle closing network connections, save data, disconnect sockets (long-lived connections..), etc. to do this check the preStop handler configuration. Good advice is to capture SIGTERM inside the application.

Scaling and fault tolerance

There is a lot of factors relating to fault tolerance. This won’t be an exhaustive list but it gets you started with what is the most important.

First, never run a Pod without Deployment, DaemonSet, ReplicaSet or StatefulSet. Running single Pod will cause downtime.

Use Pod disruption budgets to protect all replicas pods to be drained at the same time. This event can occur quite often. For example during a node upgrade.

Use anti-affinity and inter-pod affinity to have your pod replicas on multiple nodes. Having one hundred pods on one single node will also cause downtime if you lose this node.

You have to use HPA and Cluster Autoscaling if your application has to scale.

Most of the time, during some interviews, I heard the argument that using CA is “too complicated and can be configured at a later stage”.

Most of the time this is because you are not in control of what is happening and you need to become more confident. This is ok.

But you have to be confident. Do some test load, auto-upgrade, full surge and understand how everything respond. If everything is on point… Active CA.

Conclusion

First of all, I want to thank you for reading this article.
I will be more than happy to include any feedback, so feel free to DM or add comments.

Once again, this list is not complete and exhaustive. It is just a lot of best practices I gathered during interviews. Most of this advices can by applied directly to your cluster while some of them may not…