Agents
Six specialist sub-agents, each handling a distinct layer of the stack. Pilot orchestrates them so you don't have to know which one to call.
HolmesGPT
PythonAI-powered incident investigation
HolmesGPT correlates logs, metrics, and distributed traces to surface root causes automatically during an incident. Instead of manually pivoting between Grafana, Loki, and your alerting tool, you ask HolmesGPT a question in plain English and it returns a structured diagnosis.
It integrates natively with PagerDuty, OpsGenie, Prometheus, and Loki. When an alert fires, HolmesGPT can be invoked automatically via the SRE Guard daemon, running its investigation before a human is even paged.
All findings are written to a structured runbook-compatible output format so OpenSRE can pick up remediation automatically.
- Natural-language incident queries against live observability data
- Automatic root-cause hypothesis ranking with confidence scores
- PagerDuty / OpsGenie alert enrichment
- Prometheus + Loki + Jaeger integration
- Structured JSON output compatible with OpenSRE runbooks
/holmesgptInvoke from Claude Code: /holmesgpt "your incident question"kagent
GoKubernetes-native AI agent for fleet operations
kagent is a Kubernetes operator that embeds an AI planner directly into your cluster. It watches deployments, monitors for anomalies, and can execute corrective actions — rolling restarts, HPA adjustments, node cordon/drain — without leaving the cluster.
Unlike kubectl-based scripts, kagent understands intent. You describe the desired outcome (e.g. 'drain node ip-10-0-1-5 with zero downtime') and it generates the safe action sequence, confirming before executing in production.
kagent exposes a CRD-based API so all operations are GitOps-friendly and auditable.
- CRD-based intent API (KAgentTask, KAgentPolicy)
- Safe drain/cordon with PDB awareness
- Automated rollout progression and canary promotion
- Multi-cluster fleet operations via kubeconfig federation
- Dry-run mode with diff output before any mutation
/kagentInvoke from Claude Code: /kagent — runs fleet operation tasks interactivelyNightshift
GoCost optimization scheduler
Nightshift scales down non-production Kubernetes workloads on a cron schedule and restores them before business hours. A typical dev/staging cluster running 8×5 instead of 24×7 saves 60–70% of compute cost with zero manual effort.
It integrates with Pilot's KEDA scale-to-zero pattern for maximum savings: Nightshift handles overnight shutdown while KEDA handles intra-day idle scale-to-zero. The two are additive.
Nightshift stores intended replica counts as annotations before scaling down, so it always restores to the exact pre-shutdown state — including HPA min/max values.
- Cron-based scale-down / scale-up schedules per namespace or label selector
- Stores intended replica counts as annotations before shutdown
- Respects PodDisruptionBudgets during shutdown sequencing
- Slack / PagerDuty notification on each schedule event
- Cost report: estimated monthly savings per workload
/nightshiftInvoke from Claude Code: /nightshift — configures schedules interactivelyOpenSRE
Python + NodeRunbook automation engine
OpenSRE converts Markdown runbooks into executable incident workflows. Each runbook is a YAML-annotated Markdown file: human-readable for engineers, machine-executable for the automation layer.
When an alert fires (via HolmesGPT, PagerDuty webhook, or manual trigger), OpenSRE selects the matching runbook, executes each step, and reports status back to your incident channel. Steps can include kubectl commands, API calls, database queries, and escalation logic.
All runbooks live in Git. Changes go through normal PR review. No proprietary runbook DSL to learn.
- Markdown-native runbook format with YAML frontmatter for metadata
- Step execution: kubectl, bash, HTTP, SQL, PagerDuty escalate
- Alert-to-runbook routing via label matchers
- Dry-run mode: simulate runbook execution without side effects
- Audit log: every step, its output, and the triggering alert are recorded
/opensreInvoke from Claude Code: /opensre — runs a runbook or lists available onesPlural
ElixirGitOps multi-cloud deployment platform
Plural is a GitOps-native platform for managing Helm releases across multiple clusters and cloud providers from a single control plane. It abstracts the differences between EKS, AKS, and GKE so you can promote a release from dev to staging to prod with a single Git merge.
Pilot uses Plural as the deployment layer when you need multi-cloud or multi-region consistency. The /plural slash command scaffolds your Plural configuration from Pilot's generated Helm charts.
Plural's marketplace also provides pre-configured stacks (PostgreSQL, Redis, monitoring) that slot into Pilot's Terraform output.
- Single control plane for EKS, AKS, GKE, and bare-metal clusters
- Promotion pipelines: dev → staging → prod with automated gate checks
- Drift detection: alerts when cluster state diverges from Git
- Built-in marketplace of pre-configured infrastructure stacks
- OIDC-native: all cloud credentials via Workload Identity
/pluralInvoke from Claude Code: /plural — manages Plural deployments and promotionsSRE Guard
PythonMonitoring daemon with auto-remediation
SRE Guard is a lightweight Python daemon that watches your deployments for SLO breaches and triggers the OpenSRE runbook engine automatically. It bridges the gap between alerting (Prometheus/Alertmanager) and remediation (OpenSRE) without requiring a human to be the relay.
Configuration is a single YAML file listing which alerts map to which runbooks, with optional approval gates for destructive actions. In production, destructive steps always require human confirmation via Slack. In dev/staging, SRE Guard can run fully autonomously.
SRE Guard ships as a GitHub Actions workflow (ci-sre-guard.yml) that runs on every deployment, and as a standalone Kubernetes deployment for continuous monitoring.
- Prometheus Alertmanager webhook receiver
- Alert-to-runbook routing with label-based matching
- Approval gates for destructive runbook steps (Slack DM confirmation)
- GitHub Actions integration: runs post-deploy health checks automatically
- Structured incident report written to GitHub Step Summary
/sre-guardInvoke from Claude Code: /sre-guard — runs a health check or configures the daemon