Skip to main content

Kubernetes Production Best Practices: Building Enterprise-Grade Container Orchestration

Author
16 min
3342 words
--

AI Summary

This article provides comprehensive insights into "Kubernetes Production Best Practices: Building Enterprise-Grade Container Orchestration", exploring key concepts, practical applications, and future developments to offer readers a thorough understanding of the subject matter.

Content generated by AI

Kubernetes Production Best Practices: Building Enterprise-Grade Container Orchestration

Running Kubernetes in production requires careful planning, robust security measures, and comprehensive operational practices. This guide covers the essential aspects of building and maintaining enterprise-grade Kubernetes clusters that can handle mission-critical workloads with high availability, security, and performance.

Production Cluster Architecture

Multi-Master High Availability Setup

# kubeadm-ha-config.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: 10.0.1.10
  bindPort: 6443
nodeRegistration:
  criSocket: unix:///var/run/containerd/containerd.sock
  kubeletExtraArgs:
    cloud-provider: external
    container-runtime: remote
    container-runtime-endpoint: unix:///var/run/containerd/containerd.sock

---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.28.2
clusterName: production-cluster
controlPlaneEndpoint: k8s-api-lb.company.com:6443
apiServer:
  advertiseAddress: 10.0.1.10
  bindPort: 6443
  certSANs:
    - k8s-api-lb.company.com
    - k8s-master-1.company.com
    - k8s-master-2.company.com
    - k8s-master-3.company.com
    - 10.0.1.10
    - 10.0.1.11
    - 10.0.1.12
    - 127.0.0.1
  extraArgs:
    audit-log-maxage: "30"
    audit-log-maxbackup: "10"
    audit-log-maxsize: "100"
    audit-log-path: /var/log/kubernetes/audit.log
    audit-policy-file: /etc/kubernetes/audit-policy.yaml
    enable-admission-plugins: >
      NodeRestriction,
      ResourceQuota,
      LimitRanger,
      ServiceAccount,
      DefaultStorageClass,
      DefaultTolerationSeconds,
      MutatingAdmissionWebhook,
      ValidatingAdmissionWebhook,
      PodSecurityPolicy
    encryption-provider-config: /etc/kubernetes/encryption-config.yaml
    feature-gates: "RotateKubeletServerCertificate=true"
    service-account-lookup: "true"
    service-account-key-file: /etc/kubernetes/pki/sa.pub
    service-account-signing-key-file: /etc/kubernetes/pki/sa.key
  extraVolumes:
    - name: audit-policy
      hostPath: /etc/kubernetes/audit-policy.yaml
      mountPath: /etc/kubernetes/audit-policy.yaml
      readOnly: true
      pathType: File
    - name: audit-logs
      hostPath: /var/log/kubernetes
      mountPath: /var/log/kubernetes
      pathType: DirectoryOrCreate
    - name: encryption-config
      hostPath: /etc/kubernetes/encryption-config.yaml
      mountPath: /etc/kubernetes/encryption-config.yaml
      readOnly: true
      pathType: File

etcd:
  local:
    dataDir: /var/lib/etcd
    extraArgs:
      listen-metrics-urls: http://0.0.0.0:2381
      auto-compaction-mode: periodic
      auto-compaction-retention: "1"
      max-request-bytes: "33554432"
      quota-backend-bytes: "6442450944"
      heartbeat-interval: "250"
      election-timeout: "1250"
      snapshot-count: "10000"
    serverCertSANs:
      - k8s-master-1.company.com
      - k8s-master-2.company.com
      - k8s-master-3.company.com
      - 10.0.1.10
      - 10.0.1.11
      - 10.0.1.12
    peerCertSANs:
      - k8s-master-1.company.com
      - k8s-master-2.company.com
      - k8s-master-3.company.com
      - 10.0.1.10
      - 10.0.1.11
      - 10.0.1.12

networking:
  serviceSubnet: 10.96.0.0/12
  podSubnet: 10.244.0.0/16
  dnsDomain: cluster.local

controllerManager:
  extraArgs:
    bind-address: 0.0.0.0
    secure-port: "10257"
    cluster-signing-duration: "8760h"
    feature-gates: "RotateKubeletServerCertificate=true"
    terminated-pod-gc-threshold: "50"

scheduler:
  extraArgs:
    bind-address: 0.0.0.0
    secure-port: "10259"

---
apiVersion: kubeadm.k8s.io/v1beta3
kind: KubeletConfiguration
cgroupDriver: systemd
containerRuntimeEndpoint: unix:///var/run/containerd/containerd.sock
resolvConf: /run/systemd/resolve/resolv.conf
runtimeRequestTimeout: "15m"
tlsCipherSuites:
  - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
  - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
  - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
  - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
  - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
  - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
protectKernelDefaults: true
makeIPTablesUtilChains: true
eventRecordQPS: 0
shutdownGracePeriod: 60s
shutdownGracePeriodCriticalPods: 20s
featureGates:
  RotateKubeletServerCertificate: true
serverTLSBootstrap: true
rotateCertificates: true

Load Balancer Configuration (HAProxy)

# /etc/haproxy/haproxy.cfg
global
    log stdout local0
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin
    stats timeout 30s
    user haproxy
    group haproxy
    daemon

defaults
    mode http
    log global
    option httplog
    option dontlognull
    option log-health-checks
    option forwardfor except 127.0.0.0/8
    option redispatch
    retries 3
    timeout http-request 10s
    timeout queue 20s
    timeout connect 10s
    timeout client 1m
    timeout server 1m
    timeout http-keep-alive 10s
    timeout check 10s

# Kubernetes API Server
frontend k8s-api-frontend
    bind *:6443
    mode tcp
    option tcplog
    default_backend k8s-api-backend

backend k8s-api-backend
    mode tcp
    option tcp-check
    balance roundrobin
    default-server inter 10s downinter 5s rise 2 fall 2 slowstart 60s maxconn 250 maxqueue 256 weight 100
    
    server k8s-master-1 10.0.1.10:6443 check
    server k8s-master-2 10.0.1.11:6443 check
    server k8s-master-3 10.0.1.12:6443 check

# HAProxy Stats
frontend stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 30s
    stats admin if TRUE

Security Hardening

Pod Security Standards

# pod-security-standards.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

---
apiVersion: v1
kind: Namespace
metadata:
  name: staging
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

---
# Custom Pod Security Policy for legacy support
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted-psp
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  volumes:
    - 'configMap'
    - 'emptyDir'
    - 'projected'
    - 'secret'
    - 'downwardAPI'
    - 'persistentVolumeClaim'
    - 'csi'
  hostNetwork: false
  hostIPC: false
  hostPID: false
  runAsUser:
    rule: 'MustRunAsNonRoot'
  supplementalGroups:
    rule: 'MustRunAs'
    ranges:
      - min: 1
        max: 65535
  fsGroup:
    rule: 'MustRunAs'
    ranges:
      - min: 1
        max: 65535
  readOnlyRootFilesystem: true
  seLinux:
    rule: 'RunAsAny'

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: restricted-psp-user
rules:
- apiGroups: ['policy']
  resources: ['podsecuritypolicies']
  verbs: ['use']
  resourceNames:
  - restricted-psp

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: restricted-psp-all-serviceaccounts
roleRef:
  kind: ClusterRole
  name: restricted-psp-user
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: Group
  name: system:serviceaccounts
  apiGroup: rbac.authorization.k8s.io

Network Security Policies

# network-security-policies.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    - podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: web-app-netpol
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: web-app
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    - podSelector:
        matchLabels:
          app.kubernetes.io/name: ingress-nginx
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: database
    ports:
    - protocol: TCP
      port: 5432
  - to:
    - podSelector:
        matchLabels:
          app: redis
    ports:
    - protocol: TCP
      port: 6379
  - to: []
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: database-netpol
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: database
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: web-app
    - podSelector:
        matchLabels:
          app: api-server
    ports:
    - protocol: TCP
      port: 5432

RBAC Configuration

# rbac-configuration.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-service-account
  namespace: production

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: app-role
rules:
- apiGroups: [""]
  resources: ["pods", "services", "endpoints"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["configmaps", "secrets"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: app-role-binding
  namespace: production
subjects:
- kind: ServiceAccount
  name: app-service-account
  namespace: production
roleRef:
  kind: Role
  name: app-role
  apiGroup: rbac.authorization.k8s.io

---
# Cluster-level roles for monitoring
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-reader
rules:
- apiGroups: [""]
  resources: ["nodes", "nodes/metrics", "services", "endpoints", "pods"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments", "daemonsets", "replicasets", "statefulsets"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["networking.k8s.io"]
  resources: ["ingresses"]
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: monitoring-reader-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: monitoring-reader
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: monitoring

Resource Management and Optimization

Resource Quotas and Limits

# resource-management.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 200Gi
    limits.cpu: "200"
    limits.memory: 400Gi
    persistentvolumeclaims: "50"
    requests.storage: "1Ti"
    count/deployments.apps: "50"
    count/services: "25"
    count/secrets: "100"
    count/configmaps: "100"

---
apiVersion: v1
kind: LimitRange
metadata:
  name: production-limits
  namespace: production
spec:
  limits:
  - default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    max:
      cpu: "2"
      memory: "4Gi"
    min:
      cpu: "50m"
      memory: "64Mi"
    type: Container
  - default:
      storage: "10Gi"
    max:
      storage: "100Gi"
    min:
      storage: "1Gi"
    type: PersistentVolumeClaim

---
# Priority Classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000
globalDefault: false
description: "High priority class for critical applications"

---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: medium-priority
value: 500
globalDefault: true
description: "Medium priority class for standard applications"

---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 100
globalDefault: false
description: "Low priority class for batch jobs"

Horizontal Pod Autoscaler (HPA) Configuration

# hpa-configuration.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
      selectPolicy: Min
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      - type: Pods
        value: 5
        periodSeconds: 60
      selectPolicy: Max

---
# Vertical Pod Autoscaler (VPA)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: web-app
      maxAllowed:
        cpu: 2
        memory: 4Gi
      minAllowed:
        cpu: 100m
        memory: 128Mi
      controlledResources: ["cpu", "memory"]

Storage Management

Storage Classes and Persistent Volumes

# storage-configuration.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
  encrypted: "true"
  kmsKeyId: arn:aws:kms:us-west-2:123456789012:key/12345678-1234-1234-1234-123456789012
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Delete

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard-hdd
provisioner: ebs.csi.aws.com
parameters:
  type: gp2
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Delete

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: high-iops-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: io2
  iops: "10000"
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain

---
# Volume Snapshot Class
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: ebs-snapshot-class
driver: ebs.csi.aws.com
deletionPolicy: Delete
parameters:
  tagSpecification_1: "Name=*"
  tagSpecification_2: "Environment=production"

---
# Persistent Volume Claim Template
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: database-pvc
  namespace: production
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 100Gi

Backup and Disaster Recovery

# backup-configuration.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: backup-scripts
  namespace: production
data:
  backup.sh: |
    #!/bin/bash
    set -e
    
    NAMESPACE=${NAMESPACE:-production}
    BACKUP_BUCKET=${BACKUP_BUCKET:-k8s-backups}
    DATE=$(date +%Y%m%d-%H%M%S)
    
    # Backup etcd
    echo "Backing up etcd..."
    kubectl exec -n kube-system etcd-master-1 -- \
      etcdctl snapshot save /tmp/etcd-backup-${DATE}.db \
      --endpoints=https://127.0.0.1:2379 \
      --cacert=/etc/kubernetes/pki/etcd/ca.crt \
      --cert=/etc/kubernetes/pki/etcd/server.crt \
      --key=/etc/kubernetes/pki/etcd/server.key
    
    # Copy etcd backup to S3
    kubectl cp kube-system/etcd-master-1:/tmp/etcd-backup-${DATE}.db ./etcd-backup-${DATE}.db
    aws s3 cp ./etcd-backup-${DATE}.db s3://${BACKUP_BUCKET}/etcd/
    
    # Backup application data
    echo "Backing up application data..."
    kubectl get all,pvc,secrets,configmaps -n ${NAMESPACE} -o yaml > app-backup-${DATE}.yaml
    aws s3 cp app-backup-${DATE}.yaml s3://${BACKUP_BUCKET}/applications/
    
    # Backup persistent volumes
    echo "Creating volume snapshots..."
    kubectl get pvc -n ${NAMESPACE} -o json | jq -r '.items[].metadata.name' | while read pvc; do
      cat <<EOF | kubectl apply -f -
    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshot
    metadata:
      name: ${pvc}-snapshot-${DATE}
      namespace: ${NAMESPACE}
    spec:
      volumeSnapshotClassName: ebs-snapshot-class
      source:
        persistentVolumeClaimName: ${pvc}
    EOF
    done
    
    echo "Backup completed successfully"

  restore.sh: |
    #!/bin/bash
    set -e
    
    BACKUP_DATE=${1:-latest}
    NAMESPACE=${NAMESPACE:-production}
    BACKUP_BUCKET=${BACKUP_BUCKET:-k8s-backups}
    
    if [ "$BACKUP_DATE" = "latest" ]; then
      BACKUP_DATE=$(aws s3 ls s3://${BACKUP_BUCKET}/etcd/ | sort | tail -n 1 | awk '{print $4}' | sed 's/etcd-backup-\(.*\)\.db/\1/')
    fi
    
    echo "Restoring from backup: $BACKUP_DATE"
    
    # Restore etcd (requires cluster shutdown)
    echo "Restoring etcd..."
    aws s3 cp s3://${BACKUP_BUCKET}/etcd/etcd-backup-${BACKUP_DATE}.db ./etcd-backup.db
    
    # Stop etcd on all masters
    sudo systemctl stop kubelet
    sudo docker stop $(sudo docker ps -q --filter name=k8s_etcd)
    
    # Restore etcd data
    sudo etcdctl snapshot restore ./etcd-backup.db \
      --data-dir=/var/lib/etcd-restore \
      --name=master-1 \
      --initial-cluster=master-1=https://10.0.1.10:2380,master-2=https://10.0.1.11:2380,master-3=https://10.0.1.12:2380 \
      --initial-advertise-peer-urls=https://10.0.1.10:2380
    
    # Replace etcd data directory
    sudo rm -rf /var/lib/etcd
    sudo mv /var/lib/etcd-restore /var/lib/etcd
    sudo chown -R etcd:etcd /var/lib/etcd
    
    # Start kubelet
    sudo systemctl start kubelet
    
    # Restore application resources
    echo "Restoring application resources..."
    aws s3 cp s3://${BACKUP_BUCKET}/applications/app-backup-${BACKUP_DATE}.yaml ./app-backup.yaml
    kubectl apply -f app-backup.yaml
    
    echo "Restore completed successfully"

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cluster-backup
  namespace: production
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: backup-service-account
          containers:
          - name: backup
            image: amazon/aws-cli:latest
            command: ["/bin/bash"]
            args: ["/scripts/backup.sh"]
            env:
            - name: NAMESPACE
              value: "production"
            - name: BACKUP_BUCKET
              value: "k8s-backups"
            volumeMounts:
            - name: backup-scripts
              mountPath: /scripts
            - name: kubectl-config
              mountPath: /root/.kube
          volumes:
          - name: backup-scripts
            configMap:
              name: backup-scripts
              defaultMode: 0755
          - name: kubectl-config
            secret:
              secretName: kubectl-config
          restartPolicy: OnFailure

Monitoring and Observability

Prometheus Configuration

# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: 'production'
        region: 'us-west-2'

    rule_files:
      - "/etc/prometheus/rules/*.yml"

    alerting:
      alertmanagers:
        - static_configs:
            - targets:
              - alertmanager:9093

    scrape_configs:
      # Kubernetes API Server
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
        - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      # Kubernetes Nodes
      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
        - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics

      # Kubernetes Pods
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name

      # cAdvisor
      - job_name: 'kubernetes-cadvisor'
        kubernetes_sd_configs:
        - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

  alerts.yml: |
    groups:
    - name: kubernetes-alerts
      rules:
      - alert: KubernetesNodeReady
        expr: kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes Node not ready (instance {{ $labels.instance }})
          description: "Node {{ $labels.node }} has been unready for a long time"

      - alert: KubernetesMemoryPressure
        expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes memory pressure (instance {{ $labels.instance }})
          description: "Node {{ $labels.node }} has MemoryPressure condition"

      - alert: KubernetesDiskPressure
        expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes disk pressure (instance {{ $labels.instance }})
          description: "Node {{ $labels.node }} has DiskPressure condition"

      - alert: KubernetesOutOfDisk
        expr: kube_node_status_condition{condition="OutOfDisk",status="true"} == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes out of disk (instance {{ $labels.instance }})
          description: "Node {{ $labels.node }} has OutOfDisk condition"

      - alert: KubernetesOutOfCapacity
        expr: sum by (node) ((kube_pod_status_phase{phase="Running"} == 1) + on(uid) group_left(node) (0 * kube_pod_info{pod_template_hash=""})) / sum by (node) (kube_node_status_allocatable{resource="pods"}) * 100 > 90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes out of capacity (instance {{ $labels.instance }})
          description: "Node {{ $labels.node }} is out of capacity"

      - alert: KubernetesContainerOomKiller
        expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes container oom killer (instance {{ $labels.instance }})
          description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled"

      - alert: KubernetesPodCrashLooping
        expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

      - alert: KubernetesReplicassetMismatch
        expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes ReplicasSet mismatch (instance {{ $labels.instance }})
          description: "Deployment Replicas mismatch"

      - alert: KubernetesDeploymentReplicasMismatch
        expr: kube_deployment_spec_replicas != kube_deployment_status_available_replicas
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }})
          description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than 10 minutes."

      - alert: KubernetesStatefulsetReplicasMismatch
        expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }})
          description: "A StatefulSet does not match the expected number of replicas."

      - alert: KubernetesHpaScalingAbility
        expr: kube_horizontalpodautoscaler_status_condition{status="false", condition="AbleToScale"} == 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes HPA scaling ability (instance {{ $labels.instance }})
          description: "Pod is unable to scale"

      - alert: KubernetesHpaMetricAvailability
        expr: kube_horizontalpodautoscaler_status_condition{status="false", condition="ScalingActive"} == 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes HPA metric availability (instance {{ $labels.instance }})
          description: "HPA is not able to collect metrics"

      - alert: KubernetesHpaScaleCapability
        expr: kube_horizontalpodautoscaler_status_desired_replicas >= kube_horizontalpodautoscaler_spec_max_replicas
        for: 2m
        labels:
          severity: info
        annotations:
          summary: Kubernetes HPA scale capability (instance {{ $labels.instance }})
          description: "The maximum number of desired Pods has been hit"

      - alert: KubernetesPodNotHealthy
        expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
          description: "Pod has been in a non-ready state for longer than 15 minutes."

      - alert: KubernetesPodCrashLooping
        expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

      - alert: KubernetesVolumeOutOfDiskSpace
        expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes Volume out of disk space (instance {{ $labels.instance }})
          description: "Volume is almost full (< 10% left)"

      - alert: KubernetesVolumeFullInFourDays
        expr: predict_linear(kubelet_volume_stats_available_bytes[6h], 4 * 24 * 3600) < 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes Volume full in four days (instance {{ $labels.instance }})
          description: "{{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available."

      - alert: KubernetesPersistentVolumeError
        expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending", job="kube-state-metrics"} > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes PersistentVolume error (instance {{ $labels.instance }})
          description: "Persistent volume is in bad state"

      - alert: KubernetesStatefulsetDown
        expr: (kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current) != 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes StatefulSet down (instance {{ $labels.instance }})
          description: "A StatefulSet went down"

      - alert: KubernetesHpaReplicasMismatch
        expr: (kube_horizontalpodautoscaler_status_desired_replicas != kube_horizontalpodautoscaler_status_current_replicas) and (kube_horizontalpodautoscaler_status_current_replicas > kube_horizontalpodautoscaler_spec_min_replicas)
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes HPA replicas mismatch (instance {{ $labels.instance }})
          description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has not matched the desired number of replicas for longer than 15 minutes."

Operational Best Practices

Cluster Maintenance Procedures

#!/bin/bash
# cluster-maintenance.sh

set -e

CLUSTER_NAME="production-cluster"
BACKUP_BUCKET="k8s-backups"
DATE=$(date +%Y%m%d-%H%M%S)

# Pre-maintenance checks
echo "=== Pre-maintenance checks ==="

# Check cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

# Check resource usage
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=cpu

# Create backup
echo "=== Creating backup ==="
./backup.sh

# Drain nodes for maintenance (one by one)
drain_node() {
    local node=$1
    echo "Draining node: $node"
    
    # Cordon the node
    kubectl cordon $node
    
    # Drain the node
    kubectl drain $node --ignore-daemonsets --delete-emptydir-data --force --grace-period=300
    
    echo "Node $node drained successfully"
}

# Uncordon node after maintenance
uncordon_node() {
    local node=$1
    echo "Uncordoning node: $node"
    kubectl uncordon $node
    
    # Wait for node to be ready
    kubectl wait --for=condition=Ready node/$node --timeout=300s
    
    echo "Node $node is ready"
}

# Rolling update procedure
rolling_update() {
    local deployment=$1
    local namespace=${2:-default}
    local image=$3
    
    echo "Updating deployment $deployment in namespace $namespace"
    
    # Update the deployment
    kubectl set image deployment/$deployment container=$image -n $namespace
    
    # Wait for rollout to complete
    kubectl rollout status deployment/$deployment -n $namespace --timeout=600s
    
    # Verify the update
    kubectl get pods -n $namespace -l app=$deployment
    
    echo "Deployment $deployment updated successfully"
}

# Certificate rotation
rotate_certificates() {
    echo "=== Rotating certificates ==="
    
    # Check certificate expiration
    kubeadm certs check-expiration
    
    # Renew certificates
    kubeadm certs renew all
    
    # Restart control plane components
    kubectl -n kube-system delete pod -l component=kube-apiserver
    kubectl -n kube-system delete pod -l component=kube-controller-manager
    kubectl -n kube-system delete pod -l component=kube-scheduler
    
    # Update kubeconfig
    sudo cp /etc/kubernetes/admin.conf ~/.kube/config
    sudo chown $(id -u):$(id -g) ~/.kube/config
    
    echo "Certificates rotated successfully"
}

# Cleanup old resources
cleanup_resources() {
    echo "=== Cleaning up old resources ==="
    
    # Remove completed jobs older than 7 days
    kubectl get jobs --all-namespaces -o json | jq -r '.items[] | select(.status.conditions[]?.type == "Complete") | select(.metadata.creationTimestamp | fromdateiso8601 < (now - 604800)) | "\(.metadata.namespace) \(.metadata.name)"' | while read namespace job; do
        kubectl delete job $job -n $namespace
    done
    
    # Remove old replica sets
    kubectl get rs --all-namespaces -o json | jq -r '.items[] | select(.spec.replicas == 0) | select(.metadata.creationTimestamp | fromdateiso8601 < (now - 604800)) | "\(.metadata.namespace) \(.metadata.name)"' | while read namespace rs; do
        kubectl delete rs $rs -n $namespace
    done
    
    # Remove old pods in Succeeded state
    kubectl get pods --all-namespaces --field-selector=status.phase=Succeeded -o json | jq -r '.items[] | select(.metadata.creationTimestamp | fromdateiso8601 < (now - 86400)) | "\(.metadata.namespace) \(.metadata.name)"' | while read namespace pod; do
        kubectl delete pod $pod -n $namespace
    done
    
    echo "Cleanup completed"
}

# Performance optimization
optimize_performance() {
    echo "=== Performance optimization ==="
    
    # Compact etcd
    kubectl -n kube-system exec etcd-master-1 -- etcdctl \
        --endpoints=https://127.0.0.1:2379 \
        --cacert=/etc/kubernetes/pki/etcd/ca.crt \
        --cert=/etc/kubernetes/pki/etcd/server.crt \
        --key=/etc/kubernetes/pki/etcd/server.key \
        compact $(kubectl -n kube-system exec etcd-master-1 -- etcdctl \
            --endpoints=https://127.0.0.1:2379 \
            --cacert=/etc/kubernetes/pki/etcd/ca.crt \
            --cert=/etc/kubernetes/pki/etcd/server.crt \
            --key=/etc/kubernetes/pki/etcd/server.key \
            endpoint status --write-out="json" | jq -r '.[0].Status.header.revision')
    
    # Defragment etcd
    kubectl -n kube-system exec etcd-master-1 -- etcdctl \
        --endpoints=https://127.0.0.1:2379 \
        --cacert=/etc/kubernetes/pki/etcd/ca.crt \
        --cert=/etc/kubernetes/pki/etcd/server.crt \
        --key=/etc/kubernetes/pki/etcd/server.key \
        defrag
    
    echo "Performance optimization completed"
}

# Main maintenance function
main() {
    case "$1" in
        "drain")
            drain_node $2
            ;;
        "uncordon")
            uncordon_node $2
            ;;
        "update")
            rolling_update $2 $3 $4
            ;;
        "certs")
            rotate_certificates
            ;;
        "cleanup")
            cleanup_resources
            ;;
        "optimize")
            optimize_performance
            ;;
        "full")
            echo "Starting full maintenance procedure..."
            cleanup_resources
            optimize_performance
            rotate_certificates
            echo "Full maintenance completed"
            ;;
        *)
            echo "Usage: $0 {drain|uncordon|update|certs|cleanup|optimize|full} [args...]"
            echo "  drain <node>                    - Drain a node for maintenance"
            echo "  uncordon <node>                 - Uncordon a node after maintenance"
            echo "  update <deployment> <ns> <img> - Rolling update deployment"
            echo "  certs                           - Rotate certificates"
            echo "  cleanup                         - Cleanup old resources"
            echo "  optimize                        - Optimize cluster performance"
            echo "  full                            - Run full maintenance procedure"
            exit 1
            ;;
    esac
}

main "$@"

Conclusion

Running Kubernetes in production requires a comprehensive approach covering architecture design, security hardening, resource management, monitoring, and operational procedures. The configurations and practices outlined in this guide provide a solid foundation for enterprise-grade Kubernetes deployments.

Key takeaways for production Kubernetes:

  1. High Availability: Design for failure with multi-master setups and proper load balancing
  2. Security First: Implement defense in depth with RBAC, network policies, and pod security standards
  3. Resource Management: Use quotas, limits, and autoscaling to optimize resource utilization
  4. Monitoring: Implement comprehensive observability with metrics, logs, and distributed tracing
  5. Backup and Recovery: Regular backups and tested disaster recovery procedures are essential
  6. Operational Excellence: Automate maintenance tasks and establish clear operational procedures

Remember that Kubernetes is a complex system that requires ongoing attention and optimization. Start with these best practices and continuously refine your approach based on your specific requirements and operational experience.

Share Article