← Docs
Sub-Agents

Agents

Six specialist sub-agents, each handling a distinct layer of the stack. Pilot orchestrates them so you don't have to know which one to call.

On this page
HolmesGPTkagentNightshiftOpenSREPluralSRE Guard

HolmesGPT

Python

AI-powered incident investigation

HolmesGPT correlates logs, metrics, and distributed traces to surface root causes automatically during an incident. Instead of manually pivoting between Grafana, Loki, and your alerting tool, you ask HolmesGPT a question in plain English and it returns a structured diagnosis.

It integrates natively with PagerDuty, OpsGenie, Prometheus, and Loki. When an alert fires, HolmesGPT can be invoked automatically via the SRE Guard daemon, running its investigation before a human is even paged.

All findings are written to a structured runbook-compatible output format so OpenSRE can pick up remediation automatically.

Key Capabilities
  • Natural-language incident queries against live observability data
  • Automatic root-cause hypothesis ranking with confidence scores
  • PagerDuty / OpsGenie alert enrichment
  • Prometheus + Loki + Jaeger integration
  • Structured JSON output compatible with OpenSRE runbooks
Installation
bash
cd holmesgpt
pip install -r requirements.txt
export HOLMES_API_KEY=<your-key>
Usage
bash
python3 holmes.py ask \
  "Why are pods in the payment-api namespace crash-looping?"
/holmesgptInvoke from Claude Code: /holmesgpt "your incident question"

kagent

Go

Kubernetes-native AI agent for fleet operations

kagent is a Kubernetes operator that embeds an AI planner directly into your cluster. It watches deployments, monitors for anomalies, and can execute corrective actions — rolling restarts, HPA adjustments, node cordon/drain — without leaving the cluster.

Unlike kubectl-based scripts, kagent understands intent. You describe the desired outcome (e.g. 'drain node ip-10-0-1-5 with zero downtime') and it generates the safe action sequence, confirming before executing in production.

kagent exposes a CRD-based API so all operations are GitOps-friendly and auditable.

Key Capabilities
  • CRD-based intent API (KAgentTask, KAgentPolicy)
  • Safe drain/cordon with PDB awareness
  • Automated rollout progression and canary promotion
  • Multi-cluster fleet operations via kubeconfig federation
  • Dry-run mode with diff output before any mutation
Installation
bash
helm repo add kagent https://kagent.dev/charts
helm install kagent kagent/kagent \
  -n kagent-system --create-namespace
Usage
bash
kubectl apply -f - <<EOF
apiVersion: kagent.dev/v1
kind: KAgentTask
metadata:
  name: drain-node
spec:
  intent: "Drain ip-10-0-1-5 with zero disruption to the payment-api deployment"
EOF
/kagentInvoke from Claude Code: /kagent — runs fleet operation tasks interactively

Nightshift

Go

Cost optimization scheduler

Nightshift scales down non-production Kubernetes workloads on a cron schedule and restores them before business hours. A typical dev/staging cluster running 8×5 instead of 24×7 saves 60–70% of compute cost with zero manual effort.

It integrates with Pilot's KEDA scale-to-zero pattern for maximum savings: Nightshift handles overnight shutdown while KEDA handles intra-day idle scale-to-zero. The two are additive.

Nightshift stores intended replica counts as annotations before scaling down, so it always restores to the exact pre-shutdown state — including HPA min/max values.

Key Capabilities
  • Cron-based scale-down / scale-up schedules per namespace or label selector
  • Stores intended replica counts as annotations before shutdown
  • Respects PodDisruptionBudgets during shutdown sequencing
  • Slack / PagerDuty notification on each schedule event
  • Cost report: estimated monthly savings per workload
Installation
bash
helm repo add nightshift https://nightshift.dev/charts
helm install nightshift nightshift/nightshift \
  -n nightshift-system --create-namespace \
  --set schedule.timezone=Asia/Kolkata
Usage
bash
kubectl annotate namespace staging \
  nightshift.io/schedule="0 20 * * 1-5|0 8 * * 1-5"
/nightshiftInvoke from Claude Code: /nightshift — configures schedules interactively

OpenSRE

Python + Node

Runbook automation engine

OpenSRE converts Markdown runbooks into executable incident workflows. Each runbook is a YAML-annotated Markdown file: human-readable for engineers, machine-executable for the automation layer.

When an alert fires (via HolmesGPT, PagerDuty webhook, or manual trigger), OpenSRE selects the matching runbook, executes each step, and reports status back to your incident channel. Steps can include kubectl commands, API calls, database queries, and escalation logic.

All runbooks live in Git. Changes go through normal PR review. No proprietary runbook DSL to learn.

Key Capabilities
  • Markdown-native runbook format with YAML frontmatter for metadata
  • Step execution: kubectl, bash, HTTP, SQL, PagerDuty escalate
  • Alert-to-runbook routing via label matchers
  • Dry-run mode: simulate runbook execution without side effects
  • Audit log: every step, its output, and the triggering alert are recorded
Installation
bash
cd opensre
pip install -r requirements.txt
npm install   # for the Node.js webhook listener
cp config.example.yaml config.yaml
Usage
bash
# Start the webhook listener
python3 opensre/server.py --config config.yaml

# Manually trigger a runbook
python3 opensre/cli.py run runbooks/pod-crashloop.md \
  --vars service=payment-api,namespace=production
/opensreInvoke from Claude Code: /opensre — runs a runbook or lists available ones

Plural

Elixir

GitOps multi-cloud deployment platform

Plural is a GitOps-native platform for managing Helm releases across multiple clusters and cloud providers from a single control plane. It abstracts the differences between EKS, AKS, and GKE so you can promote a release from dev to staging to prod with a single Git merge.

Pilot uses Plural as the deployment layer when you need multi-cloud or multi-region consistency. The /plural slash command scaffolds your Plural configuration from Pilot's generated Helm charts.

Plural's marketplace also provides pre-configured stacks (PostgreSQL, Redis, monitoring) that slot into Pilot's Terraform output.

Key Capabilities
  • Single control plane for EKS, AKS, GKE, and bare-metal clusters
  • Promotion pipelines: dev → staging → prod with automated gate checks
  • Drift detection: alerts when cluster state diverges from Git
  • Built-in marketplace of pre-configured infrastructure stacks
  • OIDC-native: all cloud credentials via Workload Identity
Installation
bash
# Install Plural CLI
brew install pluralsh/plural/plural

# Bootstrap your workspace
plural init
plural build --only <your-app>
plural deploy --commit "initial deploy"
Usage
bash
# Promote from staging to prod
plural deploy --context prod --commit "promote v1.4.2"
/pluralInvoke from Claude Code: /plural — manages Plural deployments and promotions

SRE Guard

Python

Monitoring daemon with auto-remediation

SRE Guard is a lightweight Python daemon that watches your deployments for SLO breaches and triggers the OpenSRE runbook engine automatically. It bridges the gap between alerting (Prometheus/Alertmanager) and remediation (OpenSRE) without requiring a human to be the relay.

Configuration is a single YAML file listing which alerts map to which runbooks, with optional approval gates for destructive actions. In production, destructive steps always require human confirmation via Slack. In dev/staging, SRE Guard can run fully autonomously.

SRE Guard ships as a GitHub Actions workflow (ci-sre-guard.yml) that runs on every deployment, and as a standalone Kubernetes deployment for continuous monitoring.

Key Capabilities
  • Prometheus Alertmanager webhook receiver
  • Alert-to-runbook routing with label-based matching
  • Approval gates for destructive runbook steps (Slack DM confirmation)
  • GitHub Actions integration: runs post-deploy health checks automatically
  • Structured incident report written to GitHub Step Summary
Installation
bash
pip install -r requirements.txt

# Copy and edit the config
cp sre-guard/config.example.yaml sre-guard/config.yaml
Usage
bash
# Run as a daemon
python3 sre-guard/daemon.py --config sre-guard/config.yaml

# Or run a one-shot health check (used in CI)
python3 sre-guard/check.py \
  --service payment-api \
  --namespace production
/sre-guardInvoke from Claude Code: /sre-guard — runs a health check or configures the daemon