Skip to content

Troubleshooting

Common issues and solutions for the Nexus Kubernetes Operator.

Quick Diagnostics

Run these commands to gather diagnostic information:

# Operator status
./operator-nexus-dev.sh status

# Operator pods
kubectl -n nexus-system get pods

# Operator logs
kubectl -n nexus-system logs deployment/nexus-operator --tail=100

# Events
kubectl -n nexus-system get events --sort-by='.lastTimestamp' | tail -20

# CRD status
kubectl get crd nexuscapabilities.nexus.ai

Common Issues

Issue: kubectl Access Denied (401 Unauthorized)

Symptoms:

error: You must be logged in to the server (Unauthorized)

Cause: Not in aws-auth ConfigMap or Access Entry not configured.

Solution: The deploy script automatically fixes this. Run:

./operator-nexus-dev.sh deploy

Or manually add to aws-auth:

kubectl patch configmap aws-auth -n kube-system --type merge -p '{
  "data": {
    "mapRoles": "- rolearn: YOUR_ROLE_ARN\n  username: admin:{{SessionName}}\n  groups:\n    - system:masters\n"
  }
}'

Issue: ImagePullBackOff

Symptoms:

NAME                              READY   STATUS             RESTARTS   AGE
nexus-operator-xxx-xxx           0/1     ImagePullBackOff   0          5m

Cause: Operator image doesn't exist in ECR.

Solution: Build and push the image:

./build-and-push.sh
./operator-nexus-dev.sh deploy --force

Issue: Operator Pod Not Starting

Symptoms: - Pod stuck in Pending or CrashLoopBackOff - Status shows 0/1 Ready

Diagnosis:

kubectl -n nexus-system describe pod -l app.kubernetes.io/name=nexus-operator
kubectl -n nexus-system logs deployment/nexus-operator

Common causes and solutions:

Cause Solution
Image pull errors Check ECR permissions, run ./build-and-push.sh
IRSA misconfiguration Verify ServiceAccount annotation
Resource constraints Check node capacity, adjust resource limits
Missing CRD Run kubectl apply -f manifests/crd.yaml

Issue: Token Expiry Errors

Symptoms:

Unauthorized: token expired

Status: ✅ FIXED in current version

The operator uses kubeconfig which calls aws eks get-token for auto-refreshing tokens.

If still occurring:

# Refresh credentials
aws sso login --profile external-access
# or
aws sts get-caller-identity --profile external-access

Issue: Capabilities Not Deploying

Symptoms: - Capability stuck in Pending or Provisioning - No AWS resources created

Diagnosis:

# Check operator logs
kubectl -n nexus-system logs -f deployment/nexus-operator

# Check capability status
kubectl describe nexuscapability <name>

# Check events
kubectl get events -n <capability-namespace>

Common causes:

Cause Solution
AWS permissions Check IAM role has required policies
Invalid spec Validate capability YAML against CRD
Resource conflicts Check for existing resources with same names
Region mismatch Ensure region in spec matches cluster region

Issue: AWS Resource Creation Failed

Symptoms:

ERROR | Failed to create DynamoDB table: AccessDeniedException

Solution:

  1. Check IAM role policy:

    aws iam get-role-policy --role-name NexusAIOperatorRole --policy-name NexusAIOperatorPolicy
    

  2. Verify required permissions:

    • dynamodb:* for DynamoDB operations
    • s3:* for S3 operations
    • glue:* for Glue operations
    • iam:* for IAM role creation
    • ssm:* for SSM parameters
    • secretsmanager:* for secrets

Issue: IRSA Not Working

Symptoms:

Unable to locate credentials
NoCredentialProviders: no valid providers in chain

Diagnosis:

# Check ServiceAccount annotation
kubectl get sa nexus-operator -n nexus-system -o yaml | grep eks.amazonaws.com

# Verify OIDC provider
aws iam list-open-id-connect-providers

# Check IAM role trust policy
aws iam get-role --role-name NexusAIOperatorRole --query 'Role.AssumeRolePolicyDocument'

Solution:

  1. Ensure OIDC provider is associated:

    eksctl utils associate-iam-oidc-provider --cluster <cluster-name> --approve
    

  2. Verify ServiceAccount annotation matches IAM role ARN

Issue: Capability Stuck in Deleting

Symptoms: - Capability CR won't delete - Stuck with finalizers

Solution:

# Remove finalizer
kubectl patch nexuscapability <name> -n <namespace> \
  -p '{"metadata":{"finalizers":null}}' --type=merge

# Force delete
kubectl delete nexuscapability <name> -n <namespace> --force --grace-period=0

Issue: LoadBalancer Not Getting External IP

Symptoms: - Service stuck in <pending> for external IP - No load balancer created

Diagnosis:

kubectl describe service <service-name> -n <namespace>
kubectl get events -n <namespace> | grep LoadBalancer

Common causes:

Cause Solution
Missing AWS LB Controller Install AWS Load Balancer Controller
Subnet tags missing Add required Kubernetes tags to subnets
IAM permissions Check LB controller role permissions

Advanced Troubleshooting

Enable Debug Logging

# Set debug level
kubectl -n nexus-system set env deployment/nexus-operator LOG_LEVEL=DEBUG

# View debug logs
kubectl -n nexus-system logs -f deployment/nexus-operator

Check AWS API Calls

# Enable CloudTrail logging
# View recent API calls for troubleshooting

Verify Network Connectivity

# From operator pod
kubectl -n nexus-system exec -it deploy/nexus-operator -- python -c "
import boto3
sts = boto3.client('sts')
print(sts.get_caller_identity())
"

Check Resource Quotas

kubectl describe resourcequota -n nexus-system
kubectl describe limitrange -n nexus-system

Recovery Procedures

Recover from Failed Deployment

# 1. Check what failed
kubectl describe nexuscapability <name>

# 2. Delete the failed capability
kubectl delete nexuscapability <name>

# 3. Clean up any partial AWS resources
aws dynamodb delete-table --table-name <capability>-<env>-transformation-system
aws s3 rb s3://<capability>-<env>-call-data --force

# 4. Redeploy
kubectl apply -f capability.yaml

Recover from Operator Crash

# 1. Check crash reason
kubectl -n nexus-system logs deployment/nexus-operator --previous

# 2. Restart operator
kubectl -n nexus-system rollout restart deployment/nexus-operator

# 3. If persistent, redeploy
./operator-nexus-dev.sh delete
./build-and-deploy.sh

Reset Operator State

# 1. Delete operator
./operator-nexus-dev.sh delete

# 2. Clean up CRs manually
kubectl delete nexuscapabilities --all -A

# 3. Redeploy
./build-and-deploy.sh

Getting Help

Gather Information

Before reporting an issue, gather:

  1. Operator status: ./operator-nexus-dev.sh status
  2. Operator logs: kubectl -n nexus-system logs deployment/nexus-operator
  3. Capability description: kubectl describe tc <name>
  4. Events: kubectl get events -A --sort-by='.lastTimestamp'
  5. AWS credentials: aws sts get-caller-identity

Support Contacts

For internal support, contact the platform team with: - Cluster name and region - Error messages - Steps to reproduce - Diagnostic information gathered above


← Back to Kubernetes Operator | ← Previous: Operations Guide