Sub-Agents

Agents

Six specialist sub-agents, each handling a distinct layer of the stack. Pilot orchestrates them so you don't have to know which one to call.

On this page

HolmesGPT kagent Nightshift OpenSRE Plural SRE Guard

HolmesGPT

Python

AI-powered incident investigation

HolmesGPT correlates logs, metrics, and distributed traces to surface root causes automatically during an incident. Instead of manually pivoting between Grafana, Loki, and your alerting tool, you ask HolmesGPT a question in plain English and it returns a structured diagnosis.

It integrates natively with PagerDuty, OpsGenie, Prometheus, and Loki. When an alert fires, HolmesGPT can be invoked automatically via the SRE Guard daemon, running its investigation before a human is even paged.

All findings are written to a structured runbook-compatible output format so OpenSRE can pick up remediation automatically.

Key Capabilities

Natural-language incident queries against live observability data
Automatic root-cause hypothesis ranking with confidence scores
PagerDuty / OpsGenie alert enrichment
Prometheus + Loki + Jaeger integration
Structured JSON output compatible with OpenSRE runbooks

Installation

bash

cd holmesgpt
pip install -r requirements.txt
export HOLMES_API_KEY=<your-key>

Usage

bash

python3 holmes.py ask \
  "Why are pods in the payment-api namespace crash-looping?"

/holmesgptInvoke from Claude Code: /holmesgpt "your incident question"

kagent

Kubernetes-native AI agent for fleet operations

kagent is a Kubernetes operator that embeds an AI planner directly into your cluster. It watches deployments, monitors for anomalies, and can execute corrective actions — rolling restarts, HPA adjustments, node cordon/drain — without leaving the cluster.

Unlike kubectl-based scripts, kagent understands intent. You describe the desired outcome (e.g. 'drain node ip-10-0-1-5 with zero downtime') and it generates the safe action sequence, confirming before executing in production.

kagent exposes a CRD-based API so all operations are GitOps-friendly and auditable.

Key Capabilities

CRD-based intent API (KAgentTask, KAgentPolicy)
Safe drain/cordon with PDB awareness
Automated rollout progression and canary promotion
Multi-cluster fleet operations via kubeconfig federation
Dry-run mode with diff output before any mutation

Installation

bash

helm repo add kagent https://kagent.dev/charts
helm install kagent kagent/kagent \
  -n kagent-system --create-namespace

Usage

bash

kubectl apply -f - <<EOF
apiVersion: kagent.dev/v1
kind: KAgentTask
metadata:
  name: drain-node
spec:
  intent: "Drain ip-10-0-1-5 with zero disruption to the payment-api deployment"
EOF

/kagentInvoke from Claude Code: /kagent — runs fleet operation tasks interactively

Nightshift

Cost optimization scheduler

Nightshift scales down non-production Kubernetes workloads on a cron schedule and restores them before business hours. A typical dev/staging cluster running 8×5 instead of 24×7 saves 60–70% of compute cost with zero manual effort.

It integrates with Pilot's KEDA scale-to-zero pattern for maximum savings: Nightshift handles overnight shutdown while KEDA handles intra-day idle scale-to-zero. The two are additive.

Nightshift stores intended replica counts as annotations before scaling down, so it always restores to the exact pre-shutdown state — including HPA min/max values.

Key Capabilities

Cron-based scale-down / scale-up schedules per namespace or label selector
Stores intended replica counts as annotations before shutdown
Respects PodDisruptionBudgets during shutdown sequencing
Slack / PagerDuty notification on each schedule event
Cost report: estimated monthly savings per workload

Installation

bash

helm repo add nightshift https://nightshift.dev/charts
helm install nightshift nightshift/nightshift \
  -n nightshift-system --create-namespace \
  --set schedule.timezone=Asia/Kolkata

Usage

bash

kubectl annotate namespace staging \
  nightshift.io/schedule="0 20 * * 1-5|0 8 * * 1-5"

/nightshiftInvoke from Claude Code: /nightshift — configures schedules interactively

OpenSRE

Python + Node

Runbook automation engine

OpenSRE converts Markdown runbooks into executable incident workflows. Each runbook is a YAML-annotated Markdown file: human-readable for engineers, machine-executable for the automation layer.

When an alert fires (via HolmesGPT, PagerDuty webhook, or manual trigger), OpenSRE selects the matching runbook, executes each step, and reports status back to your incident channel. Steps can include kubectl commands, API calls, database queries, and escalation logic.

All runbooks live in Git. Changes go through normal PR review. No proprietary runbook DSL to learn.

Key Capabilities

Markdown-native runbook format with YAML frontmatter for metadata
Step execution: kubectl, bash, HTTP, SQL, PagerDuty escalate
Alert-to-runbook routing via label matchers
Dry-run mode: simulate runbook execution without side effects
Audit log: every step, its output, and the triggering alert are recorded

Installation

bash

cd opensre
pip install -r requirements.txt
npm install   # for the Node.js webhook listener
cp config.example.yaml config.yaml

Usage

bash

# Start the webhook listener
python3 opensre/server.py --config config.yaml

# Manually trigger a runbook
python3 opensre/cli.py run runbooks/pod-crashloop.md \
  --vars service=payment-api,namespace=production

/opensreInvoke from Claude Code: /opensre — runs a runbook or lists available ones

Plural

Elixir

GitOps multi-cloud deployment platform

Plural is a GitOps-native platform for managing Helm releases across multiple clusters and cloud providers from a single control plane. It abstracts the differences between EKS, AKS, and GKE so you can promote a release from dev to staging to prod with a single Git merge.

Pilot uses Plural as the deployment layer when you need multi-cloud or multi-region consistency. The /plural slash command scaffolds your Plural configuration from Pilot's generated Helm charts.

Plural's marketplace also provides pre-configured stacks (PostgreSQL, Redis, monitoring) that slot into Pilot's Terraform output.

Key Capabilities

Single control plane for EKS, AKS, GKE, and bare-metal clusters
Promotion pipelines: dev → staging → prod with automated gate checks
Drift detection: alerts when cluster state diverges from Git
Built-in marketplace of pre-configured infrastructure stacks
OIDC-native: all cloud credentials via Workload Identity

Installation

bash

# Install Plural CLI
brew install pluralsh/plural/plural

# Bootstrap your workspace
plural init
plural build --only <your-app>
plural deploy --commit "initial deploy"

Usage

bash

# Promote from staging to prod
plural deploy --context prod --commit "promote v1.4.2"

/pluralInvoke from Claude Code: /plural — manages Plural deployments and promotions

SRE Guard

Python

Monitoring daemon with auto-remediation

SRE Guard is a lightweight Python daemon that watches your deployments for SLO breaches and triggers the OpenSRE runbook engine automatically. It bridges the gap between alerting (Prometheus/Alertmanager) and remediation (OpenSRE) without requiring a human to be the relay.

Configuration is a single YAML file listing which alerts map to which runbooks, with optional approval gates for destructive actions. In production, destructive steps always require human confirmation via Slack. In dev/staging, SRE Guard can run fully autonomously.

SRE Guard ships as a GitHub Actions workflow (ci-sre-guard.yml) that runs on every deployment, and as a standalone Kubernetes deployment for continuous monitoring.

Key Capabilities

Prometheus Alertmanager webhook receiver
Alert-to-runbook routing with label-based matching
Approval gates for destructive runbook steps (Slack DM confirmation)
GitHub Actions integration: runs post-deploy health checks automatically
Structured incident report written to GitHub Step Summary

Installation

bash

pip install -r requirements.txt

# Copy and edit the config
cp sre-guard/config.example.yaml sre-guard/config.yaml

Usage

bash

# Run as a daemon
python3 sre-guard/daemon.py --config sre-guard/config.yaml

# Or run a one-shot health check (used in CI)
python3 sre-guard/check.py \
  --service payment-api \
  --namespace production

/sre-guardInvoke from Claude Code: /sre-guard — runs a health check or configures the daemon