Troubleshooting¶
Common issues and solutions for the Nexus Kubernetes Operator.
Quick Diagnostics¶
Run these commands to gather diagnostic information:
# Operator status
./operator-nexus-dev.sh status
# Operator pods
kubectl -n nexus-system get pods
# Operator logs
kubectl -n nexus-system logs deployment/nexus-operator --tail=100
# Events
kubectl -n nexus-system get events --sort-by='.lastTimestamp' | tail -20
# CRD status
kubectl get crd nexuscapabilities.nexus.ai
Common Issues¶
Issue: kubectl Access Denied (401 Unauthorized)¶
Symptoms:
Cause: Not in aws-auth ConfigMap or Access Entry not configured.
Solution: The deploy script automatically fixes this. Run:
Or manually add to aws-auth:
kubectl patch configmap aws-auth -n kube-system --type merge -p '{
"data": {
"mapRoles": "- rolearn: YOUR_ROLE_ARN\n username: admin:{{SessionName}}\n groups:\n - system:masters\n"
}
}'
Issue: ImagePullBackOff¶
Symptoms:
Cause: Operator image doesn't exist in ECR.
Solution: Build and push the image:
Issue: Operator Pod Not Starting¶
Symptoms:
- Pod stuck in Pending or CrashLoopBackOff
- Status shows 0/1 Ready
Diagnosis:
kubectl -n nexus-system describe pod -l app.kubernetes.io/name=nexus-operator
kubectl -n nexus-system logs deployment/nexus-operator
Common causes and solutions:
| Cause | Solution |
|---|---|
| Image pull errors | Check ECR permissions, run ./build-and-push.sh |
| IRSA misconfiguration | Verify ServiceAccount annotation |
| Resource constraints | Check node capacity, adjust resource limits |
| Missing CRD | Run kubectl apply -f manifests/crd.yaml |
Issue: Token Expiry Errors¶
Symptoms:
Status: ✅ FIXED in current version
The operator uses kubeconfig which calls aws eks get-token for auto-refreshing tokens.
If still occurring:
# Refresh credentials
aws sso login --profile external-access
# or
aws sts get-caller-identity --profile external-access
Issue: Capabilities Not Deploying¶
Symptoms:
- Capability stuck in Pending or Provisioning
- No AWS resources created
Diagnosis:
# Check operator logs
kubectl -n nexus-system logs -f deployment/nexus-operator
# Check capability status
kubectl describe nexuscapability <name>
# Check events
kubectl get events -n <capability-namespace>
Common causes:
| Cause | Solution |
|---|---|
| AWS permissions | Check IAM role has required policies |
| Invalid spec | Validate capability YAML against CRD |
| Resource conflicts | Check for existing resources with same names |
| Region mismatch | Ensure region in spec matches cluster region |
Issue: AWS Resource Creation Failed¶
Symptoms:
Solution:
-
Check IAM role policy:
-
Verify required permissions:
dynamodb:*for DynamoDB operationss3:*for S3 operationsglue:*for Glue operationsiam:*for IAM role creationssm:*for SSM parameterssecretsmanager:*for secrets
Issue: IRSA Not Working¶
Symptoms:
Diagnosis:
# Check ServiceAccount annotation
kubectl get sa nexus-operator -n nexus-system -o yaml | grep eks.amazonaws.com
# Verify OIDC provider
aws iam list-open-id-connect-providers
# Check IAM role trust policy
aws iam get-role --role-name NexusAIOperatorRole --query 'Role.AssumeRolePolicyDocument'
Solution:
-
Ensure OIDC provider is associated:
-
Verify ServiceAccount annotation matches IAM role ARN
Issue: Capability Stuck in Deleting¶
Symptoms: - Capability CR won't delete - Stuck with finalizers
Solution:
# Remove finalizer
kubectl patch nexuscapability <name> -n <namespace> \
-p '{"metadata":{"finalizers":null}}' --type=merge
# Force delete
kubectl delete nexuscapability <name> -n <namespace> --force --grace-period=0
Issue: LoadBalancer Not Getting External IP¶
Symptoms:
- Service stuck in <pending> for external IP
- No load balancer created
Diagnosis:
kubectl describe service <service-name> -n <namespace>
kubectl get events -n <namespace> | grep LoadBalancer
Common causes:
| Cause | Solution |
|---|---|
| Missing AWS LB Controller | Install AWS Load Balancer Controller |
| Subnet tags missing | Add required Kubernetes tags to subnets |
| IAM permissions | Check LB controller role permissions |
Advanced Troubleshooting¶
Enable Debug Logging¶
# Set debug level
kubectl -n nexus-system set env deployment/nexus-operator LOG_LEVEL=DEBUG
# View debug logs
kubectl -n nexus-system logs -f deployment/nexus-operator
Check AWS API Calls¶
Verify Network Connectivity¶
# From operator pod
kubectl -n nexus-system exec -it deploy/nexus-operator -- python -c "
import boto3
sts = boto3.client('sts')
print(sts.get_caller_identity())
"
Check Resource Quotas¶
Recovery Procedures¶
Recover from Failed Deployment¶
# 1. Check what failed
kubectl describe nexuscapability <name>
# 2. Delete the failed capability
kubectl delete nexuscapability <name>
# 3. Clean up any partial AWS resources
aws dynamodb delete-table --table-name <capability>-<env>-transformation-system
aws s3 rb s3://<capability>-<env>-call-data --force
# 4. Redeploy
kubectl apply -f capability.yaml
Recover from Operator Crash¶
# 1. Check crash reason
kubectl -n nexus-system logs deployment/nexus-operator --previous
# 2. Restart operator
kubectl -n nexus-system rollout restart deployment/nexus-operator
# 3. If persistent, redeploy
./operator-nexus-dev.sh delete
./build-and-deploy.sh
Reset Operator State¶
# 1. Delete operator
./operator-nexus-dev.sh delete
# 2. Clean up CRs manually
kubectl delete nexuscapabilities --all -A
# 3. Redeploy
./build-and-deploy.sh
Getting Help¶
Gather Information¶
Before reporting an issue, gather:
- Operator status:
./operator-nexus-dev.sh status - Operator logs:
kubectl -n nexus-system logs deployment/nexus-operator - Capability description:
kubectl describe tc <name> - Events:
kubectl get events -A --sort-by='.lastTimestamp' - AWS credentials:
aws sts get-caller-identity
Support Contacts¶
For internal support, contact the platform team with: - Cluster name and region - Error messages - Steps to reproduce - Diagnostic information gathered above
← Back to Kubernetes Operator | ← Previous: Operations Guide