Kubernetes Production Best Practices: Building Enterprise-Grade Container Orchestration
Running Kubernetes in production requires careful planning, robust security measures, and comprehensive operational practices. This guide covers the essential aspects of building and maintaining enterprise-grade Kubernetes clusters that can handle mission-critical workloads with high availability, security, and performance.
Production Cluster Architecture
Multi-Master High Availability Setup
# kubeadm-ha-config.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 10.0.1.10
bindPort: 6443
nodeRegistration:
criSocket: unix:///var/run/containerd/containerd.sock
kubeletExtraArgs:
cloud-provider: external
container-runtime: remote
container-runtime-endpoint: unix:///var/run/containerd/containerd.sock
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.28.2
clusterName: production-cluster
controlPlaneEndpoint: k8s-api-lb.company.com:6443
apiServer:
advertiseAddress: 10.0.1.10
bindPort: 6443
certSANs:
- k8s-api-lb.company.com
- k8s-master-1.company.com
- k8s-master-2.company.com
- k8s-master-3.company.com
- 10.0.1.10
- 10.0.1.11
- 10.0.1.12
- 127.0.0.1
extraArgs:
audit-log-maxage: "30"
audit-log-maxbackup: "10"
audit-log-maxsize: "100"
audit-log-path: /var/log/kubernetes/audit.log
audit-policy-file: /etc/kubernetes/audit-policy.yaml
enable-admission-plugins: >
NodeRestriction,
ResourceQuota,
LimitRanger,
ServiceAccount,
DefaultStorageClass,
DefaultTolerationSeconds,
MutatingAdmissionWebhook,
ValidatingAdmissionWebhook,
PodSecurityPolicy
encryption-provider-config: /etc/kubernetes/encryption-config.yaml
feature-gates: "RotateKubeletServerCertificate=true"
service-account-lookup: "true"
service-account-key-file: /etc/kubernetes/pki/sa.pub
service-account-signing-key-file: /etc/kubernetes/pki/sa.key
extraVolumes:
- name: audit-policy
hostPath: /etc/kubernetes/audit-policy.yaml
mountPath: /etc/kubernetes/audit-policy.yaml
readOnly: true
pathType: File
- name: audit-logs
hostPath: /var/log/kubernetes
mountPath: /var/log/kubernetes
pathType: DirectoryOrCreate
- name: encryption-config
hostPath: /etc/kubernetes/encryption-config.yaml
mountPath: /etc/kubernetes/encryption-config.yaml
readOnly: true
pathType: File
etcd:
local:
dataDir: /var/lib/etcd
extraArgs:
listen-metrics-urls: http://0.0.0.0:2381
auto-compaction-mode: periodic
auto-compaction-retention: "1"
max-request-bytes: "33554432"
quota-backend-bytes: "6442450944"
heartbeat-interval: "250"
election-timeout: "1250"
snapshot-count: "10000"
serverCertSANs:
- k8s-master-1.company.com
- k8s-master-2.company.com
- k8s-master-3.company.com
- 10.0.1.10
- 10.0.1.11
- 10.0.1.12
peerCertSANs:
- k8s-master-1.company.com
- k8s-master-2.company.com
- k8s-master-3.company.com
- 10.0.1.10
- 10.0.1.11
- 10.0.1.12
networking:
serviceSubnet: 10.96.0.0/12
podSubnet: 10.244.0.0/16
dnsDomain: cluster.local
controllerManager:
extraArgs:
bind-address: 0.0.0.0
secure-port: "10257"
cluster-signing-duration: "8760h"
feature-gates: "RotateKubeletServerCertificate=true"
terminated-pod-gc-threshold: "50"
scheduler:
extraArgs:
bind-address: 0.0.0.0
secure-port: "10259"
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: KubeletConfiguration
cgroupDriver: systemd
containerRuntimeEndpoint: unix:///var/run/containerd/containerd.sock
resolvConf: /run/systemd/resolve/resolv.conf
runtimeRequestTimeout: "15m"
tlsCipherSuites:
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
protectKernelDefaults: true
makeIPTablesUtilChains: true
eventRecordQPS: 0
shutdownGracePeriod: 60s
shutdownGracePeriodCriticalPods: 20s
featureGates:
RotateKubeletServerCertificate: true
serverTLSBootstrap: true
rotateCertificates: true
Load Balancer Configuration (HAProxy)
# /etc/haproxy/haproxy.cfg
global
log stdout local0
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
stats timeout 30s
user haproxy
group haproxy
daemon
defaults
mode http
log global
option httplog
option dontlognull
option log-health-checks
option forwardfor except 127.0.0.0/8
option redispatch
retries 3
timeout http-request 10s
timeout queue 20s
timeout connect 10s
timeout client 1m
timeout server 1m
timeout http-keep-alive 10s
timeout check 10s
# Kubernetes API Server
frontend k8s-api-frontend
bind *:6443
mode tcp
option tcplog
default_backend k8s-api-backend
backend k8s-api-backend
mode tcp
option tcp-check
balance roundrobin
default-server inter 10s downinter 5s rise 2 fall 2 slowstart 60s maxconn 250 maxqueue 256 weight 100
server k8s-master-1 10.0.1.10:6443 check
server k8s-master-2 10.0.1.11:6443 check
server k8s-master-3 10.0.1.12:6443 check
# HAProxy Stats
frontend stats
bind *:8404
stats enable
stats uri /stats
stats refresh 30s
stats admin if TRUE
Security Hardening
Pod Security Standards
# pod-security-standards.yaml
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
---
apiVersion: v1
kind: Namespace
metadata:
name: staging
labels:
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
---
# Custom Pod Security Policy for legacy support
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restricted-psp
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'persistentVolumeClaim'
- 'csi'
hostNetwork: false
hostIPC: false
hostPID: false
runAsUser:
rule: 'MustRunAsNonRoot'
supplementalGroups:
rule: 'MustRunAs'
ranges:
- min: 1
max: 65535
fsGroup:
rule: 'MustRunAs'
ranges:
- min: 1
max: 65535
readOnlyRootFilesystem: true
seLinux:
rule: 'RunAsAny'
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: restricted-psp-user
rules:
- apiGroups: ['policy']
resources: ['podsecuritypolicies']
verbs: ['use']
resourceNames:
- restricted-psp
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: restricted-psp-all-serviceaccounts
roleRef:
kind: ClusterRole
name: restricted-psp-user
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: Group
name: system:serviceaccounts
apiGroup: rbac.authorization.k8s.io
Network Security Policies
# network-security-policies.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: kube-system
- podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: web-app-netpol
namespace: production
spec:
podSelector:
matchLabels:
app: web-app
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
- podSelector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
- to:
- podSelector:
matchLabels:
app: redis
ports:
- protocol: TCP
port: 6379
- to: []
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: database-netpol
namespace: production
spec:
podSelector:
matchLabels:
app: database
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: web-app
- podSelector:
matchLabels:
app: api-server
ports:
- protocol: TCP
port: 5432
RBAC Configuration
# rbac-configuration.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: app-service-account
namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: app-role
rules:
- apiGroups: [""]
resources: ["pods", "services", "endpoints"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["configmaps", "secrets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: app-role-binding
namespace: production
subjects:
- kind: ServiceAccount
name: app-service-account
namespace: production
roleRef:
kind: Role
name: app-role
apiGroup: rbac.authorization.k8s.io
---
# Cluster-level roles for monitoring
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: monitoring-reader
rules:
- apiGroups: [""]
resources: ["nodes", "nodes/metrics", "services", "endpoints", "pods"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "daemonsets", "replicasets", "statefulsets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["networking.k8s.io"]
resources: ["ingresses"]
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: monitoring-reader-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: monitoring-reader
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
Resource Management and Optimization
Resource Quotas and Limits
# resource-management.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "100"
requests.memory: 200Gi
limits.cpu: "200"
limits.memory: 400Gi
persistentvolumeclaims: "50"
requests.storage: "1Ti"
count/deployments.apps: "50"
count/services: "25"
count/secrets: "100"
count/configmaps: "100"
---
apiVersion: v1
kind: LimitRange
metadata:
name: production-limits
namespace: production
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "2"
memory: "4Gi"
min:
cpu: "50m"
memory: "64Mi"
type: Container
- default:
storage: "10Gi"
max:
storage: "100Gi"
min:
storage: "1Gi"
type: PersistentVolumeClaim
---
# Priority Classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000
globalDefault: false
description: "High priority class for critical applications"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: medium-priority
value: 500
globalDefault: true
description: "Medium priority class for standard applications"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 100
globalDefault: false
description: "Low priority class for batch jobs"
Horizontal Pod Autoscaler (HPA) Configuration
# hpa-configuration.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Min
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 5
periodSeconds: 60
selectPolicy: Max
---
# Vertical Pod Autoscaler (VPA)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-app-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: web-app
maxAllowed:
cpu: 2
memory: 4Gi
minAllowed:
cpu: 100m
memory: 128Mi
controlledResources: ["cpu", "memory"]
Storage Management
Storage Classes and Persistent Volumes
# storage-configuration.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "3000"
throughput: "125"
encrypted: "true"
kmsKeyId: arn:aws:kms:us-west-2:123456789012:key/12345678-1234-1234-1234-123456789012
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Delete
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard-hdd
provisioner: ebs.csi.aws.com
parameters:
type: gp2
encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Delete
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: high-iops-ssd
provisioner: ebs.csi.aws.com
parameters:
type: io2
iops: "10000"
encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain
---
# Volume Snapshot Class
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: ebs-snapshot-class
driver: ebs.csi.aws.com
deletionPolicy: Delete
parameters:
tagSpecification_1: "Name=*"
tagSpecification_2: "Environment=production"
---
# Persistent Volume Claim Template
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: database-pvc
namespace: production
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
Backup and Disaster Recovery
# backup-configuration.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: backup-scripts
namespace: production
data:
backup.sh: |
#!/bin/bash
set -e
NAMESPACE=${NAMESPACE:-production}
BACKUP_BUCKET=${BACKUP_BUCKET:-k8s-backups}
DATE=$(date +%Y%m%d-%H%M%S)
# Backup etcd
echo "Backing up etcd..."
kubectl exec -n kube-system etcd-master-1 -- \
etcdctl snapshot save /tmp/etcd-backup-${DATE}.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Copy etcd backup to S3
kubectl cp kube-system/etcd-master-1:/tmp/etcd-backup-${DATE}.db ./etcd-backup-${DATE}.db
aws s3 cp ./etcd-backup-${DATE}.db s3://${BACKUP_BUCKET}/etcd/
# Backup application data
echo "Backing up application data..."
kubectl get all,pvc,secrets,configmaps -n ${NAMESPACE} -o yaml > app-backup-${DATE}.yaml
aws s3 cp app-backup-${DATE}.yaml s3://${BACKUP_BUCKET}/applications/
# Backup persistent volumes
echo "Creating volume snapshots..."
kubectl get pvc -n ${NAMESPACE} -o json | jq -r '.items[].metadata.name' | while read pvc; do
cat <<EOF | kubectl apply -f -
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: ${pvc}-snapshot-${DATE}
namespace: ${NAMESPACE}
spec:
volumeSnapshotClassName: ebs-snapshot-class
source:
persistentVolumeClaimName: ${pvc}
EOF
done
echo "Backup completed successfully"
restore.sh: |
#!/bin/bash
set -e
BACKUP_DATE=${1:-latest}
NAMESPACE=${NAMESPACE:-production}
BACKUP_BUCKET=${BACKUP_BUCKET:-k8s-backups}
if [ "$BACKUP_DATE" = "latest" ]; then
BACKUP_DATE=$(aws s3 ls s3://${BACKUP_BUCKET}/etcd/ | sort | tail -n 1 | awk '{print $4}' | sed 's/etcd-backup-\(.*\)\.db/\1/')
fi
echo "Restoring from backup: $BACKUP_DATE"
# Restore etcd (requires cluster shutdown)
echo "Restoring etcd..."
aws s3 cp s3://${BACKUP_BUCKET}/etcd/etcd-backup-${BACKUP_DATE}.db ./etcd-backup.db
# Stop etcd on all masters
sudo systemctl stop kubelet
sudo docker stop $(sudo docker ps -q --filter name=k8s_etcd)
# Restore etcd data
sudo etcdctl snapshot restore ./etcd-backup.db \
--data-dir=/var/lib/etcd-restore \
--name=master-1 \
--initial-cluster=master-1=https://10.0.1.10:2380,master-2=https://10.0.1.11:2380,master-3=https://10.0.1.12:2380 \
--initial-advertise-peer-urls=https://10.0.1.10:2380
# Replace etcd data directory
sudo rm -rf /var/lib/etcd
sudo mv /var/lib/etcd-restore /var/lib/etcd
sudo chown -R etcd:etcd /var/lib/etcd
# Start kubelet
sudo systemctl start kubelet
# Restore application resources
echo "Restoring application resources..."
aws s3 cp s3://${BACKUP_BUCKET}/applications/app-backup-${BACKUP_DATE}.yaml ./app-backup.yaml
kubectl apply -f app-backup.yaml
echo "Restore completed successfully"
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: cluster-backup
namespace: production
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
serviceAccountName: backup-service-account
containers:
- name: backup
image: amazon/aws-cli:latest
command: ["/bin/bash"]
args: ["/scripts/backup.sh"]
env:
- name: NAMESPACE
value: "production"
- name: BACKUP_BUCKET
value: "k8s-backups"
volumeMounts:
- name: backup-scripts
mountPath: /scripts
- name: kubectl-config
mountPath: /root/.kube
volumes:
- name: backup-scripts
configMap:
name: backup-scripts
defaultMode: 0755
- name: kubectl-config
secret:
secretName: kubectl-config
restartPolicy: OnFailure
Monitoring and Observability
Prometheus Configuration
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-west-2'
rule_files:
- "/etc/prometheus/rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Kubernetes API Server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Kubernetes Nodes
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# Kubernetes Pods
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# cAdvisor
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
alerts.yml: |
groups:
- name: kubernetes-alerts
rules:
- alert: KubernetesNodeReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: Kubernetes Node not ready (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has been unready for a long time"
- alert: KubernetesMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes memory pressure (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has MemoryPressure condition"
- alert: KubernetesDiskPressure
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes disk pressure (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has DiskPressure condition"
- alert: KubernetesOutOfDisk
expr: kube_node_status_condition{condition="OutOfDisk",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes out of disk (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} has OutOfDisk condition"
- alert: KubernetesOutOfCapacity
expr: sum by (node) ((kube_pod_status_phase{phase="Running"} == 1) + on(uid) group_left(node) (0 * kube_pod_info{pod_template_hash=""})) / sum by (node) (kube_node_status_allocatable{resource="pods"}) * 100 > 90
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes out of capacity (instance {{ $labels.instance }})
description: "Node {{ $labels.node }} is out of capacity"
- alert: KubernetesContainerOomKiller
expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
for: 0m
labels:
severity: warning
annotations:
summary: Kubernetes container oom killer (instance {{ $labels.instance }})
description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled"
- alert: KubernetesPodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
- alert: KubernetesReplicassetMismatch
expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas
for: 10m
labels:
severity: warning
annotations:
summary: Kubernetes ReplicasSet mismatch (instance {{ $labels.instance }})
description: "Deployment Replicas mismatch"
- alert: KubernetesDeploymentReplicasMismatch
expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
for: 10m
labels:
severity: warning
annotations:
summary: Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }})
description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than 10 minutes."
- alert: KubernetesStatefulsetReplicasMismatch
expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas
for: 10m
labels:
severity: warning
annotations:
summary: Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }})
description: "A StatefulSet does not match the expected number of replicas."
- alert: KubernetesHpaScalingAbility
expr: kube_horizontalpodautoscaler_status_condition{status="false", condition="AbleToScale"} == 1
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes HPA scaling ability (instance {{ $labels.instance }})
description: "Pod is unable to scale"
- alert: KubernetesHpaMetricAvailability
expr: kube_horizontalpodautoscaler_status_condition{status="false", condition="ScalingActive"} == 1
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes HPA metric availability (instance {{ $labels.instance }})
description: "HPA is not able to collect metrics"
- alert: KubernetesHpaScaleCapability
expr: kube_horizontalpodautoscaler_status_desired_replicas >= kube_horizontalpodautoscaler_spec_max_replicas
for: 2m
labels:
severity: info
annotations:
summary: Kubernetes HPA scale capability (instance {{ $labels.instance }})
description: "The maximum number of desired Pods has been hit"
- alert: KubernetesPodNotHealthy
expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
description: "Pod has been in a non-ready state for longer than 15 minutes."
- alert: KubernetesPodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
- alert: KubernetesVolumeOutOfDiskSpace
expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes Volume out of disk space (instance {{ $labels.instance }})
description: "Volume is almost full (< 10% left)"
- alert: KubernetesVolumeFullInFourDays
expr: predict_linear(kubelet_volume_stats_available_bytes[6h], 4 * 24 * 3600) < 0
for: 0m
labels:
severity: critical
annotations:
summary: Kubernetes Volume full in four days (instance {{ $labels.instance }})
description: "{{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available."
- alert: KubernetesPersistentVolumeError
expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending", job="kube-state-metrics"} > 0
for: 0m
labels:
severity: critical
annotations:
summary: Kubernetes PersistentVolume error (instance {{ $labels.instance }})
description: "Persistent volume is in bad state"
- alert: KubernetesStatefulsetDown
expr: (kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current) != 1
for: 1m
labels:
severity: critical
annotations:
summary: Kubernetes StatefulSet down (instance {{ $labels.instance }})
description: "A StatefulSet went down"
- alert: KubernetesHpaReplicasMismatch
expr: (kube_horizontalpodautoscaler_status_desired_replicas != kube_horizontalpodautoscaler_status_current_replicas) and (kube_horizontalpodautoscaler_status_current_replicas > kube_horizontalpodautoscaler_spec_min_replicas)
for: 15m
labels:
severity: warning
annotations:
summary: Kubernetes HPA replicas mismatch (instance {{ $labels.instance }})
description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has not matched the desired number of replicas for longer than 15 minutes."
Operational Best Practices
Cluster Maintenance Procedures
#!/bin/bash
# cluster-maintenance.sh
set -e
CLUSTER_NAME="production-cluster"
BACKUP_BUCKET="k8s-backups"
DATE=$(date +%Y%m%d-%H%M%S)
# Pre-maintenance checks
echo "=== Pre-maintenance checks ==="
# Check cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed
# Check resource usage
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=cpu
# Create backup
echo "=== Creating backup ==="
./backup.sh
# Drain nodes for maintenance (one by one)
drain_node() {
local node=$1
echo "Draining node: $node"
# Cordon the node
kubectl cordon $node
# Drain the node
kubectl drain $node --ignore-daemonsets --delete-emptydir-data --force --grace-period=300
echo "Node $node drained successfully"
}
# Uncordon node after maintenance
uncordon_node() {
local node=$1
echo "Uncordoning node: $node"
kubectl uncordon $node
# Wait for node to be ready
kubectl wait --for=condition=Ready node/$node --timeout=300s
echo "Node $node is ready"
}
# Rolling update procedure
rolling_update() {
local deployment=$1
local namespace=${2:-default}
local image=$3
echo "Updating deployment $deployment in namespace $namespace"
# Update the deployment
kubectl set image deployment/$deployment container=$image -n $namespace
# Wait for rollout to complete
kubectl rollout status deployment/$deployment -n $namespace --timeout=600s
# Verify the update
kubectl get pods -n $namespace -l app=$deployment
echo "Deployment $deployment updated successfully"
}
# Certificate rotation
rotate_certificates() {
echo "=== Rotating certificates ==="
# Check certificate expiration
kubeadm certs check-expiration
# Renew certificates
kubeadm certs renew all
# Restart control plane components
kubectl -n kube-system delete pod -l component=kube-apiserver
kubectl -n kube-system delete pod -l component=kube-controller-manager
kubectl -n kube-system delete pod -l component=kube-scheduler
# Update kubeconfig
sudo cp /etc/kubernetes/admin.conf ~/.kube/config
sudo chown $(id -u):$(id -g) ~/.kube/config
echo "Certificates rotated successfully"
}
# Cleanup old resources
cleanup_resources() {
echo "=== Cleaning up old resources ==="
# Remove completed jobs older than 7 days
kubectl get jobs --all-namespaces -o json | jq -r '.items[] | select(.status.conditions[]?.type == "Complete") | select(.metadata.creationTimestamp | fromdateiso8601 < (now - 604800)) | "\(.metadata.namespace) \(.metadata.name)"' | while read namespace job; do
kubectl delete job $job -n $namespace
done
# Remove old replica sets
kubectl get rs --all-namespaces -o json | jq -r '.items[] | select(.spec.replicas == 0) | select(.metadata.creationTimestamp | fromdateiso8601 < (now - 604800)) | "\(.metadata.namespace) \(.metadata.name)"' | while read namespace rs; do
kubectl delete rs $rs -n $namespace
done
# Remove old pods in Succeeded state
kubectl get pods --all-namespaces --field-selector=status.phase=Succeeded -o json | jq -r '.items[] | select(.metadata.creationTimestamp | fromdateiso8601 < (now - 86400)) | "\(.metadata.namespace) \(.metadata.name)"' | while read namespace pod; do
kubectl delete pod $pod -n $namespace
done
echo "Cleanup completed"
}
# Performance optimization
optimize_performance() {
echo "=== Performance optimization ==="
# Compact etcd
kubectl -n kube-system exec etcd-master-1 -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
compact $(kubectl -n kube-system exec etcd-master-1 -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint status --write-out="json" | jq -r '.[0].Status.header.revision')
# Defragment etcd
kubectl -n kube-system exec etcd-master-1 -- etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
defrag
echo "Performance optimization completed"
}
# Main maintenance function
main() {
case "$1" in
"drain")
drain_node $2
;;
"uncordon")
uncordon_node $2
;;
"update")
rolling_update $2 $3 $4
;;
"certs")
rotate_certificates
;;
"cleanup")
cleanup_resources
;;
"optimize")
optimize_performance
;;
"full")
echo "Starting full maintenance procedure..."
cleanup_resources
optimize_performance
rotate_certificates
echo "Full maintenance completed"
;;
*)
echo "Usage: $0 {drain|uncordon|update|certs|cleanup|optimize|full} [args...]"
echo " drain <node> - Drain a node for maintenance"
echo " uncordon <node> - Uncordon a node after maintenance"
echo " update <deployment> <ns> <img> - Rolling update deployment"
echo " certs - Rotate certificates"
echo " cleanup - Cleanup old resources"
echo " optimize - Optimize cluster performance"
echo " full - Run full maintenance procedure"
exit 1
;;
esac
}
main "$@"
Conclusion
Running Kubernetes in production requires a comprehensive approach covering architecture design, security hardening, resource management, monitoring, and operational procedures. The configurations and practices outlined in this guide provide a solid foundation for enterprise-grade Kubernetes deployments.
Key takeaways for production Kubernetes:
- High Availability: Design for failure with multi-master setups and proper load balancing
- Security First: Implement defense in depth with RBAC, network policies, and pod security standards
- Resource Management: Use quotas, limits, and autoscaling to optimize resource utilization
- Monitoring: Implement comprehensive observability with metrics, logs, and distributed tracing
- Backup and Recovery: Regular backups and tested disaster recovery procedures are essential
- Operational Excellence: Automate maintenance tasks and establish clear operational procedures
Remember that Kubernetes is a complex system that requires ongoing attention and optimization. Start with these best practices and continuously refine your approach based on your specific requirements and operational experience.