Management & Operations¶
Day-to-day management and operational procedures for the Nexus Kubernetes Operator.
Operator Management¶
Check Status¶
Expected healthy output:
✓ Operator is installed
Status: installed
Version: latest
Namespace: nexus-system
Ready replicas: 1/1
✓ Operator is healthy and running
View Logs¶
# Stream logs
kubectl -n nexus-system logs -f deployment/nexus-operator
# Last 100 lines
kubectl -n nexus-system logs deployment/nexus-operator --tail=100
# Logs with timestamps
kubectl -n nexus-system logs deployment/nexus-operator --timestamps
Restart Operator¶
Scale Operator¶
# Not recommended - operator should run as single replica
kubectl -n nexus-system scale deployment/nexus-operator --replicas=1
Capability Management¶
List All Capabilities¶
Get Capability Details¶
Watch Capability Status¶
Update Capability¶
Edit the CR and apply:
Delete Capability¶
Deletion Policy
The deletionPolicy in the CR spec determines whether AWS resources are deleted or retained.
Monitoring¶
Health Endpoints¶
The operator exposes health endpoints on port 8080:
| Endpoint | Purpose |
|---|---|
/healthz |
Liveness probe |
/readyz |
Readiness probe |
/metrics |
Prometheus metrics |
Check Health Manually¶
# From within cluster
kubectl -n nexus-system exec -it deploy/nexus-operator -- curl localhost:8080/healthz
kubectl -n nexus-system exec -it deploy/nexus-operator -- curl localhost:8080/readyz
Prometheus Metrics¶
Available metrics:
# HELP nexus_operator_reconciliations_total Total number of reconciliations
# TYPE nexus_operator_reconciliations_total counter
nexus_operator_reconciliations_total 0
# HELP nexus_operator_ready Operator readiness status
# TYPE nexus_operator_ready gauge
nexus_operator_ready 1
Application Updates (Quick Deploy)¶
For code-only changes (no operator or infrastructure changes), use deploy-app-changes.sh:
This script:
- Builds
ai-job-engine(backend) andnexus-ui(frontend) Docker images - Pushes both to ECR
- Deletes existing pods to trigger a rollout with the new images
- Waits for all pods to be Ready
Application Lifecycle¶
Deploy¶
Update¶
Edit the YAML (e.g., change image tag or replica count) and reapply:
Scale¶
Delete¶
deletionPolicy: Delete- Operator removes all AWS resources (DynamoDB, S3, Glue, SSM, Secrets, IAM)deletionPolicy: Retain- Operator keeps AWS resources, only removes K8s workloads
Real-Time Monitoring¶
# Watch deployment step-by-step
./operator-nexus-dev.sh monitor
# Watch a specific capability
./operator-nexus-dev.sh monitor nexus-ai-prod/nexus-ai-prod
The monitor shows a live-updating table with each step's status, duration, and resources created.
Post-Deployment Verification¶
This runs 4 verification sections:
- AWS Resources - DynamoDB tables (ACTIVE), S3 buckets (accessible), SSM parameters, Secrets, Glue database, IAM role (IRSA check)
- Kubernetes Resources - Pods (Running/Ready), Services, Ingress (ALB address), ServiceAccount (IRSA annotation)
- Application Health - Frontend via ALB (HTTP 200), Backend via ALB, Internal health check from pod
- Connectivity - IRSA token mounted, AWS identity from pod
Common Operations¶
Verify CRD¶
Check RBAC¶
kubectl get clusterrole nexus-operator -o yaml
kubectl get clusterrolebinding nexus-operator -o yaml
kubectl get serviceaccount nexus-operator -n nexus-system -o yaml
View Events¶
# Operator events
kubectl -n nexus-system get events --sort-by='.lastTimestamp' | tail -20
# Capability events
kubectl get events -n <capability-namespace> --sort-by='.lastTimestamp'
Check Resource Status¶
Kubernetes Resources¶
AWS Resources¶
# DynamoDB tables
aws dynamodb list-tables | grep <capability>
# S3 buckets
aws s3 ls | grep <capability>
# IAM roles
aws iam list-roles | grep <capability>
# SSM parameters
aws ssm get-parameters-by-path --path /<capability>/<env>/ --recursive
Backup & Recovery¶
Export Capability CR¶
Export All CRs¶
Restore Capability¶
Maintenance Tasks¶
Update Operator Image¶
# Build new image
./build-and-push.sh
# Update deployment
kubectl -n nexus-system set image deployment/nexus-operator \
nexus-operator=764119721991.dkr.ecr.ap-southeast-1.amazonaws.com/nexus-operator:latest
# Or use management script
./operator-nexus-dev.sh update
Clean Up Failed Deployments¶
# List failed capabilities
kubectl get tc -A | grep Failed
# Force delete stuck capability
kubectl delete nexuscapability <name> -n <namespace> --force --grace-period=0
# Remove finalizer if stuck
kubectl patch nexuscapability <name> -n <namespace> \
-p '{"metadata":{"finalizers":null}}' --type=merge
Rotate IAM Credentials¶
IRSA handles credential rotation automatically. No manual rotation needed.
Check IRSA Configuration¶
# Verify ServiceAccount annotation
kubectl get serviceaccount nexus-operator -n nexus-system -o yaml | grep eks.amazonaws.com
# Verify IAM role trust policy
aws iam get-role --role-name NexusAIOperatorRole --query 'Role.AssumeRolePolicyDocument'
Scaling Capabilities¶
Scale Workloads¶
Update the capability CR:
Apply:
Enable HPA¶
Security Operations¶
Audit Operator Permissions¶
# Check what the operator can do
kubectl auth can-i --list --as=system:serviceaccount:nexus-system:nexus-operator
Review IAM Role Policy¶
aws iam list-attached-role-policies --role-name NexusAIOperatorRole
aws iam get-role-policy --role-name NexusAIOperatorRole --policy-name NexusAIOperatorPolicy
Check Pod Security¶
Performance Tuning¶
Operator Resources¶
Adjust operator resource limits in manifests/deployment.yaml:
Capability Resources¶
Adjust in the capability CR:
Useful Commands Reference¶
| Task | Command |
|---|---|
| Check operator status | ./operator-nexus-dev.sh status |
| View operator logs | kubectl -n nexus-system logs -f deployment/nexus-operator |
| List capabilities | kubectl get tc -A |
| Describe capability | kubectl describe tc <name> |
| Watch capabilities | kubectl get tc -A -w |
| Delete capability | kubectl delete tc <name> |
| View events | kubectl -n nexus-system get events |
| Restart operator | kubectl -n nexus-system rollout restart deployment/nexus-operator |
| Check CRD | kubectl get crd nexuscapabilities.nexus.ai |