Error Handling

Build resilient workflows with retries, handlers, and notifications.

Retry Policies

Basic Retry

yaml

steps:
  # Basic retry with fixed interval
  - name: flaky-api
    command: curl https://api.example.com
    retryPolicy:
      limit: 3
      intervalSec: 5
      
  # Retry specific errors
  - name: api-call
    command: make-request
    retryPolicy:
      limit: 3
      intervalSec: 30
      exitCodes: [429, 503]  # Rate limit or unavailable

Exponential Backoff

Increase retry intervals exponentially to avoid overwhelming failed services:

yaml

steps:
  # Exponential backoff with default multiplier (2.0)
  - name: api-with-backoff
    command: curl https://api.example.com/data
    retryPolicy:
      limit: 5
      intervalSec: 2
      backoff: true        # true = 2.0 multiplier
      # Intervals: 2s, 4s, 8s, 16s, 32s
      
  # Custom backoff multiplier
  - name: gentle-backoff
    command: ./check-service.sh
    retryPolicy:
      limit: 4
      intervalSec: 1
      backoff: 1.5         # Custom multiplier
      # Intervals: 1s, 1.5s, 2.25s, 3.375s
      
  # Backoff with max interval cap
  - name: capped-backoff
    command: ./sync-data.sh
    retryPolicy:
      limit: 10
      intervalSec: 1
      backoff: 2.0
      maxIntervalSec: 30   # Cap at 30 seconds
      # Intervals: 1s, 2s, 4s, 8s, 16s, 30s, 30s, 30s...

Backoff Formula: interval * (backoff ^ attemptCount)

Note: Backoff values must be greater than 1.0 for exponential growth.

Continue On Conditions

Control workflow execution flow when steps encounter errors or specific conditions.

Basic Usage

yaml

steps:
  # Continue on any failure
  - name: optional-cleanup
    command: rm -f /tmp/cache/*
    continueOn:
      failure: true
      
  # Continue on specific exit codes
  - name: check-status
    command: check-status.sh
    continueOn:
      exitCode: [0, 1, 2]  # 0=success, 1=warning, 2=info
      
  # Continue on output patterns
  - name: validate
    command: validate.sh
    continueOn:
      output: 
        - "WARNING"
        - "SKIP"
        - "re:^INFO:.*"      # Regex pattern
        - "re:WARN-[0-9]+"   # Another regex
      
  # Mark as success when continuing
  - name: best-effort
    command: optimize.sh
    continueOn:
      failure: true
      markSuccess: true  # Shows as successful in UI

Advanced Patterns

yaml

steps:
  # Database migration with known warnings
  - name: migrate-db
    command: ./migrate.sh
    continueOn:
      output:
        - "re:WARNING:.*already exists"
        - "re:NOTICE:.*will be created"
      exitCode: [0, 1]
      
  # Service health check with fallback
  - name: check-primary
    command: curl -f https://primary.example.com/health
    continueOn:
      exitCode: [0, 22, 7]  # 22=HTTP error, 7=connection failed
      
  # Conditional cleanup
  - name: cleanup-temp
    command: find /tmp -name "*.tmp" -mtime +7 -delete
    continueOn:
      failure: true       # Continue even if cleanup fails
      exitCode: [0, 1]   # find returns 1 if no files found
      
  # Tool with non-standard exit codes
  - name: security-scan
    command: security-scanner --strict
    continueOn:
      exitCode: [0, 4, 8]  # 0=clean, 4=warnings, 8=info
      output:
        - "re:LOW SEVERITY:"
        - "re:INFORMATIONAL:"

See the Continue On Reference for complete documentation.

Lifecycle Handlers

yaml

handlerOn:
  success:
    command: notify-success.sh
    
  failure:
    command: alert-oncall.sh "${DAG_NAME} failed"
    
  cancel:
    command: cleanup.sh
    
  exit:
    command: rm -rf /tmp/dag-${DAG_RUN_ID}  # Always runs

# With email
handlerOn:
  failure:
    executor:
      type: mail
      config:
        to: [email protected]
        from: [email protected]
        subject: "Failed: ${DAG_NAME}"
        message: "Check logs: ${DAG_RUN_LOG_FILE}"

Execution order: step → status handlers → exit handler

Email Notifications

yaml

# Global configuration
smtp:
  host: "smtp.gmail.com"
  port: "587"
  username: "${SMTP_USER}"
  password: "${SMTP_PASS}"

mailOn:
  failure: true
  success: false

errorMail:
  from: "[email protected]"
  to: "[email protected]"
  prefix: "[ALERT]"
  attachLogs: true

# Step-level
steps:
  - name: critical
    command: backup.sh
    mailOnError: true
    
  # Send custom email
  - name: notify
    executor:
      type: mail
      config:
        to: [email protected]
        from: [email protected]
        subject: "Report Ready"
        message: "See attached"
        attachments:
          - /tmp/report.pdf

Timeouts and Cleanup

yaml

# DAG timeout
timeoutSec: 3600  # 1 hour

# Cleanup timeout
maxCleanUpTimeSec: 300  # 5 minutes

steps:
  # Step with graceful shutdown
  - name: service
    command: server.sh
    signalOnStop: SIGTERM  # Default
    
  # Always cleanup
  - name: process
    command: analyze.sh
    continueOn:
      failure: true
      
  - name: cleanup
    command: cleanup.sh  # Runs even if process fails

Error Handling ​

Retry Policies ​

Basic Retry ​

Exponential Backoff ​

Continue On Conditions ​

Basic Usage ​

Advanced Patterns ​

Lifecycle Handlers ​

Email Notifications ​

Timeouts and Cleanup ​

Error Handling

Retry Policies

Basic Retry

Exponential Backoff

Continue On Conditions

Basic Usage

Advanced Patterns

Lifecycle Handlers

Email Notifications

Timeouts and Cleanup