6 min Devops

Planning for failure – Avoiding the biggest mistakes made by Kubernetes developers

A bald man in a black shirt is looking at the camera.
Planning for failure – Avoiding the biggest mistakes made by Kubernetes developers

As our industry continues to hurtle towards more container-based applications orchestrated via Kubernetes, many developers are finding themselves needing a crash course in best practices related to this transformational approach. That’s not unusual, as, despite its hype, this is still a relatively new technology, and many best practices are still being built today.

As a result, developer teams have hit some snags along the way with this powerful but complex technology. So, keeping that in mind, when working with Kubernetes, you need to be prepared for obstacles and setbacks when building your project. If you approach your project without an open  mind, it can be highly frustrating and, sometimes, counterproductive when working with Kubernetes for a few key reasons:

  • Downtime: If pods or nodes fail and your application isn’t architected to handle it gracefully, you can experience prolonged downtime and service disruptions. A lack of redundancy built into your application means no failover when things go wrong. (see this note from Kubernetes – https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/)
  • Data loss: Stateful workloads like databases need data persistence planning. Ephemeral storage can lead to permanent data loss if pods get restarted or rescheduled.
  • Cascading failures: Problems can compound quickly. If one overloaded pod starts failing, it can take down dependent services, causing a cascade. Without limits, a failure can spread.
  • Difficult debugging: When things go wrong, a lack of visibility into systems and log data makes diagnosing issues time-consuming and difficult. You want insight into failures.
  • No failsafe: Without failure planning, knowing if your application will recover safely in scenarios like hardware loss is hard. You lose the failsafe and possibly the application.
  • Unmet Service Level Objectives (SLOs): Downtime and slow failures mean you can miss SLOs, which means unhappy customers, including team leads and other internal stakeholders. Customer experience needs redundancy to improve.

Not putting in the design work early to build resiliency and failure handling into Kubernetes applications results in brittle systems that can fail catastrophically. The complexity gets pushed to incident response versus being baked into the system design.

Common Kubernetes Mistakes to Avoid

So, what exactly do you need to avoid when using Kubernetes? There are a few common mistakes when working with Kubernetes that contribute to a brittle system. The key is taking advantage of Kubernetes’ capabilities without over-relying on them. Kubernetes makes managing failures easier, but there is still a need to architect for things like pod/node failures.

First, resource limits and requests for containers need to be defined. Not doing so can lead to resource contention and unpredictable performance in the system. Resources should be explicitly allocated to maintain performance.

Make sure specific image versions are referenced. Too often, the latest tag for container images is used, which can cause issues when the image changes unexpectedly on an upgrade.

Another misstep is creating overly complex configurations. As a result, it’s easy to end up with YAML sprawl. You can avoid this by simplifying and modularizing configurations where possible.

Finally, observability is a must. Insufficient logging, monitoring, and alerting makes issues harder to detect and debug. To avoid this, build in metrics export, log collection, etc., to ensure inevitable issues can be found and fixed.

Plan for failure, secure configurations, simplify deployments, and invest in observability to avoid common pitfalls.

A Better Approach to Leveraging Kubernetes

There are also a number of approaches teams can use to avoid these pitfalls. First and foremost, be sure to design for failure from the start. That means building in redundancy, graceful degradation, failure recovery, and resiliency up front rather than bolting them on later.

Also, make sure to implement pod health checks. This means configuring readiness and liveness probes to enable Kubernetes to handle unhealthy pods automatically. This will help catch issues early.

Use replica sets properly. Run multiple instances of pods and containers and allow Kubernetes to reschedule. And don’t forget to provide a capacity buffer.

Avoid putting all your eggs in one basket. Leverage affinity/anti-affinity and spread workloads across nodes/failure domains. In addition, automate some of the processes using Kubernetes restartPolicies and ReplicationControllers to restart and replace failed pods automatically.

Some other critical tips include:

  • Select pinned image versions – Avoid “latest” tags to prevent unexpected changes on upgrades. Lock in image versions.
  • Set pod resource limits – Don’t rely on unbounded resources. Be sure to restrict CPU/memory usage to create resiliency. That said, setting these limits is tricky (especially CPU limits), so make sure you do your research to understand what will work for your system. (see Optimizing Kubernetes for Java Developers – Pretius – https://pretius.com/blog/jvm-kubernetes/)
  • Configure pod disruption budgets – Define minimum service levels so auto-scaling handles voluntary disruptions smoothly.
  • Implement rolling deployments – Upgrade a little at a time instead of all at once—incremental risk reduction.
  • Practice failure injection – Test failure scenarios in lower environments. Verify fault tolerance capabilities.

Organizational Approaches to Building Great Kubernetes Systems

Beyond these granular technical tips, there are multiple tenets that teams can implement that will ensure their current and future projects go more smoothly. The best advice for teams is to keep track of mistakes and learn from them. Keep track of outages, understand the root causes of those issues, and factor those learnings into future designs. Team post-mortems can offer valuable lessons for future projects.

Additionally, follow basic industry principles and best practices, including recommended standards for container image management, RBAC policies, pod and node design, etc. This includes adopting SRE principles. Reliability must be a key goal, in addition to setting error budgets, and balancing innovation with maintaining system stability.

Finally, keep learning.  Stay up to date on Kubernetes evolutions and deprecations to avoid surprises. Be proactive about upgrades and migrations. Get training on Kubernetes best practices and study the evolving landscape. Join communities of practice. Stay curious.

While many organizations manage and run Kubernetes, another option is delegating infrastructure management to a specialized platform-as-a-service (PaaS). Providers like Lightbend, with their Kalix offering, deliver a fully managed Kubernetes environment purpose-built for running cloud-native applications.

Conclusion

Though Kubernetes comes with challenges, this is still a fantastic technology whose benefits far outstrip learning the ropes. Kubernetes enables reliability and scalability when used correctly. Make resilience, fault tolerance, and automation of recovery native Kubernetes capabilities instead of an afterthought. Plan for failure as the default!

Lean on the community’s knowledge, start small, and keep an open mindset to improve continuously. Learning Kubernetes takes time but pays big dividends.

This is a contribution by Lightbend.