Troubleshooting¶

Common issues and solutions for the Nexus Kubernetes Operator.

Quick Diagnostics¶

Run these commands to gather diagnostic information:

# Operator status
./operator-nexus-dev.sh status

# Operator pods
kubectl -n nexus-system get pods

# Operator logs
kubectl -n nexus-system logs deployment/nexus-operator --tail=100

# Events
kubectl -n nexus-system get events --sort-by='.lastTimestamp' | tail -20

# CRD status
kubectl get crd nexuscapabilities.nexus.ai

Common Issues¶

Issue: kubectl Access Denied (401 Unauthorized)¶

Symptoms:

error: You must be logged in to the server (Unauthorized)

Cause: Not in aws-auth ConfigMap or Access Entry not configured.

Solution: The deploy script automatically fixes this. Run:

./operator-nexus-dev.sh deploy

Or manually add to aws-auth:

kubectl patch configmap aws-auth -n kube-system --type merge -p '{
  "data": {
    "mapRoles": "- rolearn: YOUR_ROLE_ARN\n  username: admin:{{SessionName}}\n  groups:\n    - system:masters\n"
  }
}'

Issue: ImagePullBackOff¶

Symptoms:

NAME                              READY   STATUS             RESTARTS   AGE
nexus-operator-xxx-xxx           0/1     ImagePullBackOff   0          5m

Cause: Operator image doesn't exist in ECR.

Solution: Build and push the image:

./build-and-push.sh
./operator-nexus-dev.sh deploy --force

Issue: Operator Pod Not Starting¶

Symptoms: - Pod stuck in Pending or CrashLoopBackOff - Status shows 0/1 Ready

Diagnosis:

kubectl -n nexus-system describe pod -l app.kubernetes.io/name=nexus-operator
kubectl -n nexus-system logs deployment/nexus-operator

Common causes and solutions:

Cause	Solution
Image pull errors	Check ECR permissions, run `./build-and-push.sh`
IRSA misconfiguration	Verify ServiceAccount annotation
Resource constraints	Check node capacity, adjust resource limits
Missing CRD	Run `kubectl apply -f manifests/crd.yaml`

Issue: Token Expiry Errors¶

Symptoms:

Unauthorized: token expired

Status: ✅ FIXED in current version

The operator uses kubeconfig which calls aws eks get-token for auto-refreshing tokens.

If still occurring:

# Refresh credentials
aws sso login --profile external-access
# or
aws sts get-caller-identity --profile external-access

Issue: Capabilities Not Deploying¶

Symptoms: - Capability stuck in Pending or Provisioning - No AWS resources created

Diagnosis:

# Check operator logs
kubectl -n nexus-system logs -f deployment/nexus-operator

# Check capability status
kubectl describe nexuscapability <name>

# Check events
kubectl get events -n <capability-namespace>

Common causes:

Cause	Solution
AWS permissions	Check IAM role has required policies
Invalid spec	Validate capability YAML against CRD
Resource conflicts	Check for existing resources with same names
Region mismatch	Ensure region in spec matches cluster region

Issue: AWS Resource Creation Failed¶

Symptoms:

ERROR | Failed to create DynamoDB table: AccessDeniedException

Solution:

Check IAM role policy:

aws iam get-role-policy --role-name NexusAIOperatorRole --policy-name NexusAIOperatorPolicy

Verify required permissions:
- dynamodb:* for DynamoDB operations
- s3:* for S3 operations
- glue:* for Glue operations
- iam:* for IAM role creation
- ssm:* for SSM parameters
- secretsmanager:* for secrets

Issue: IRSA Not Working¶

Symptoms:

Unable to locate credentials
NoCredentialProviders: no valid providers in chain

Diagnosis:

# Check ServiceAccount annotation
kubectl get sa nexus-operator -n nexus-system -o yaml | grep eks.amazonaws.com

# Verify OIDC provider
aws iam list-open-id-connect-providers

# Check IAM role trust policy
aws iam get-role --role-name NexusAIOperatorRole --query 'Role.AssumeRolePolicyDocument'

Solution:

Ensure OIDC provider is associated:

eksctl utils associate-iam-oidc-provider --cluster <cluster-name> --approve

Verify ServiceAccount annotation matches IAM role ARN

Issue: Capability Stuck in Deleting¶

Symptoms: - Capability CR won't delete - Stuck with finalizers

Solution:

# Remove finalizer
kubectl patch nexuscapability <name> -n <namespace> \
  -p '{"metadata":{"finalizers":null}}' --type=merge

# Force delete
kubectl delete nexuscapability <name> -n <namespace> --force --grace-period=0

Issue: LoadBalancer Not Getting External IP¶

Symptoms: - Service stuck in <pending> for external IP - No load balancer created

Diagnosis:

kubectl describe service <service-name> -n <namespace>
kubectl get events -n <namespace> | grep LoadBalancer

Common causes:

Cause	Solution
Missing AWS LB Controller	Install AWS Load Balancer Controller
Subnet tags missing	Add required Kubernetes tags to subnets
IAM permissions	Check LB controller role permissions

Advanced Troubleshooting¶

Enable Debug Logging¶

# Set debug level
kubectl -n nexus-system set env deployment/nexus-operator LOG_LEVEL=DEBUG

# View debug logs
kubectl -n nexus-system logs -f deployment/nexus-operator

Check AWS API Calls¶

# Enable CloudTrail logging
# View recent API calls for troubleshooting

Verify Network Connectivity¶

# From operator pod
kubectl -n nexus-system exec -it deploy/nexus-operator -- python -c "
import boto3
sts = boto3.client('sts')
print(sts.get_caller_identity())
"

Check Resource Quotas¶

kubectl describe resourcequota -n nexus-system
kubectl describe limitrange -n nexus-system

Recovery Procedures¶

Recover from Failed Deployment¶

# 1. Check what failed
kubectl describe nexuscapability <name>

# 2. Delete the failed capability
kubectl delete nexuscapability <name>

# 3. Clean up any partial AWS resources
aws dynamodb delete-table --table-name <capability>-<env>-transformation-system
aws s3 rb s3://<capability>-<env>-call-data --force

# 4. Redeploy
kubectl apply -f capability.yaml

Recover from Operator Crash¶

# 1. Check crash reason
kubectl -n nexus-system logs deployment/nexus-operator --previous

# 2. Restart operator
kubectl -n nexus-system rollout restart deployment/nexus-operator

# 3. If persistent, redeploy
./operator-nexus-dev.sh delete
./build-and-deploy.sh

Reset Operator State¶

# 1. Delete operator
./operator-nexus-dev.sh delete

# 2. Clean up CRs manually
kubectl delete nexuscapabilities --all -A

# 3. Redeploy
./build-and-deploy.sh

Getting Help¶

Gather Information¶

Before reporting an issue, gather:

Operator status: ./operator-nexus-dev.sh status
Operator logs: kubectl -n nexus-system logs deployment/nexus-operator
Capability description: kubectl describe tc <name>
Events: kubectl get events -A --sort-by='.lastTimestamp'
AWS credentials: aws sts get-caller-identity

Support Contacts¶

For internal support, contact the platform team with: - Cluster name and region - Error messages - Steps to reproduce - Diagnostic information gathered above

← Back to Kubernetes Operator | ← Previous: Operations Guide