Troubleshooting

Common issues and solutions for container deployment and management on the CLI App Platform.

Deployment Issues

Image Pull Errors

Symptoms: Container stuck in "Pending" or "ImagePullBackOff" state

Common Causes:

Incorrect image name or tag
Image doesn't exist in the specified registry
Private registry authentication issues
Network connectivity problems

Solutions:

Verify Image Name: Check image name and tag spelling
Test Image Access: Verify image exists and is accessible
Check Registry Credentials: Ensure authentication is properly configured
Network Connectivity: Verify cluster can reach the registry

Example Debug Commands:

# Test image pull locally
docker pull your-image:tag

# Check image exists
docker search your-image

Resource Constraint Errors

Symptoms: Pod stuck in "Pending" state with scheduling failures

Common Causes:

Insufficient CPU or memory in the cluster
Resource requests exceed node capacity
Node taints preventing scheduling
Storage provisioning failures

Solutions:

Check Cluster Capacity: Review available resources in dashboard
Reduce Resource Requests: Lower CPU/memory requirements
Review Node Taints: Ensure tolerations match node taints
Scale Cluster: Add more nodes if consistently resource-constrained

Diagnostic Steps:

Check pod events in the Overview section
Review cluster resource utilization
Verify node availability and capacity
Check for node taints and scheduling constraints

Configuration Errors

Symptoms: Container fails to start or crashes immediately

Common Causes:

Invalid environment variables
Missing required configuration
Incorrect volume mount paths
Application configuration errors

Solutions:

Review Container Logs: Check startup logs for error messages
Validate Configuration: Verify all environment variables and mounts
Test Locally: Test configuration with local container runtime
Minimal Configuration: Start with minimal config and add incrementally

Runtime Issues

Container Crashes and Restarts

Symptoms: High restart count, container status cycling

Common Causes:

Application bugs or exceptions
Memory limits exceeded (OOMKilled)
Health check failures
External dependency failures

Solutions:

Analyze Logs: Review container logs for error patterns
Check Resource Usage: Monitor CPU and memory consumption
Review Health Checks: Ensure health check endpoints are responsive
Test Dependencies: Verify external services are accessible

Memory Issues Specific:

Symptoms: OOMKilled events in container logs
Solutions: Increase memory limits or optimize application memory usage

Performance Issues

Symptoms: Slow response times, high resource utilization

Common Causes:

Insufficient resource allocation
Inefficient application code
Database or external service bottlenecks
Network latency issues

Solutions:

Monitor Metrics: Use performance metrics to identify bottlenecks
Scale Resources: Increase CPU/memory allocation as needed
Profile Application: Use application profiling tools
Optimize Queries: Review and optimize database queries

Network Connectivity Issues

Symptoms: Unable to access application, connection timeouts

Common Causes:

Incorrect port configuration
Network policy restrictions
Load balancer configuration issues
DNS resolution problems

Solutions:

Verify Port Configuration: Check exposed ports match application
Test Internal Connectivity: Use console to test from within cluster
Check Network Policies: Review network security policies
Validate DNS: Ensure service discovery is working correctly

Storage Issues

Volume Mount Failures

Symptoms: Container fails to start with mount errors

Common Causes:

Invalid mount paths
Volume doesn't exist
Permission issues
Storage class problems

Solutions:

Check Mount Paths: Verify mount paths are valid and don't conflict
Verify Volumes: Ensure referenced volumes exist
Review Permissions: Check file system permissions
Storage Class: Verify storage class is available and accessible

Persistent Volume Issues

Symptoms: Data loss, volume not available

Common Causes:

Volume not properly configured as persistent
Storage backend failures
Incorrect volume size
Backup/restore issues

Solutions:

Verify PVC Configuration: Check persistent volume claim settings
Monitor Storage Usage: Check available storage space
Test Backup/Restore: Verify backup procedures work correctly
Check Storage Backend: Ensure storage infrastructure is healthy

Security Issues

Secret Access Problems

Symptoms: Application can't access required secrets

Common Causes:

Secret doesn't exist
Incorrect secret name or key
Permission issues
Secret not mounted correctly

Solutions:

Verify Secret Exists: Check secret is created and accessible
Check Mount Configuration: Verify secret mount configuration
Review Permissions: Ensure service account has access
Test Secret Access: Use console to verify secret is available

Permission Denied Errors

Symptoms: Application fails with permission errors

Common Causes:

Security context restrictions
File system permissions
Service account permissions
Network policy restrictions

Solutions:

Review Security Context: Check security context configuration
Check File Permissions: Verify file and directory permissions
Service Account: Ensure service account has required permissions
Network Policies: Review network access policies

Monitoring and Logging Issues

Missing Logs

Symptoms: No logs appearing in dashboard

Common Causes:

Application not writing to stdout/stderr
Log level configuration issues
Logging system problems
Container not running

Solutions:

Check Application Logging: Ensure app writes to stdout/stderr
Verify Log Levels: Check log level configuration
Container Status: Ensure container is running
Test Log Output: Use console to verify log generation

Metrics Not Available

Symptoms: Performance metrics not showing

Common Causes:

Metrics collection not enabled
Application not exposing metrics
Network connectivity issues
Monitoring system problems

Solutions:

Enable Metrics Collection: Ensure metrics are enabled
Check Metrics Endpoints: Verify application exposes metrics
Network Connectivity: Test connectivity to metrics endpoints
Refresh Dashboard: Use refresh button to update metrics

Platform-Specific Issues

Dashboard Access Problems

Symptoms: Cannot access container management interface

Common Causes:

Authentication issues
Browser compatibility
Network connectivity
Platform maintenance

Solutions:

Refresh Session: Log out and log back in
Clear Browser Cache: Clear browser cache and cookies
Try Different Browser: Test with different web browser
Check Platform Status: Verify platform is operational

Console Connection Issues

Symptoms: Cannot connect to container console

Common Causes:

Container not running
Shell not available in container
Network connectivity issues
Browser WebSocket support

Solutions:

Check Container Status: Ensure container is running
Verify Shell: Ensure bash/sh is available in container
Browser Compatibility: Check WebSocket support
Network Configuration: Review firewall and proxy settings

Diagnostic Tools and Commands

Container Inspection

# Check container processes
ps aux

# Check disk usage
df -h

# Check memory usage
free -h

# Check network configuration
ip addr show

# Test network connectivity
ping google.com
nslookup service-name

Application Debugging

# Check application configuration
env | grep APP_

# Test application endpoints
curl http://localhost:8080/health

# Check application logs
tail -f /var/log/app.log

# Check file permissions
ls -la /app/

Resource Monitoring

# Monitor CPU usage
top

# Monitor memory usage
watch -n 1 free -h

# Monitor disk I/O
iostat 1

# Monitor network traffic
netstat -i

When to Contact Support

Contact platform support when:

Platform Issues: Dashboard or platform functionality problems
Persistent Problems: Issues that persist despite troubleshooting
Security Concerns: Suspected security issues or breaches
Performance Degradation: Unexplained performance problems
Data Loss: Issues with data persistence or backups

Information to Provide

When contacting support, include:

Container Details: Name, ID, and deployment configuration
Error Messages: Exact error messages and logs
Timeline: When the issue started and any recent changes
Steps Taken: Troubleshooting steps already attempted
Environment: Details about your deployment environment

Emergency Procedures

For critical issues:

Immediate Response: Stop affected containers if necessary
Isolate Impact: Prevent issue from spreading
Document Everything: Capture logs and system state
Contact Support: Escalate to platform support immediately
Communication: Notify stakeholders of impact and status

Keyboard shortcuts