Softprobe AI SRE

AI for production ops

Deploy with absolute confidence with our AI SRE. Catches regressions pre-release, debugs production issues deterministically, and uses your existing tools to orchestrate fixes. Zero instrumentation required.

Dev
SRE Engineer
QA Engineer
IT Ops
Investigating ...
Thinking T+04s

Correlating checkout 5xx spike with deploy checkout-api@2026.02.11.4.

Tool Call T+08s

Fetching evidence from Grafana, Sentry CHK-991, and Argo rollout events.

grafana.query(service="checkout", metric="http_5xx_rate", window="15m")
Observation T+13s

p95 latency jumped 480ms -> 2.1s only on revision tied to commit a18f2c9.

Tool Call T+18s
kubectl logs deploy/checkout-api -n prod --since=15m | rg "timeout|pool exhausted"
Action T+24s

Executing rollback to checkout-api@2026.02.11.3 and opening INC-214.

Verify T+31s

5xx recovered from 8.2% to 0.3%. Posted incident summary to #prod-incidents.

Softprobe Context Engine
Observability
Prometheus Grafana Datadog Sentry New Relic opentelemetry
Knowledge
Slack Jira Notion Confluence Linear Drive
Code
GitLab GitHub Bitbucket
Testing
Cypress Saucelabs Curl
DevOps & CI/CD
AWS GCP K8s Helm Argo Actions TravisCI Jenkins CircleCI

Trusted by leading enterprises

G7LexinLinkedCarePing AnTripUnited ImagingXingyunXYDYeahka

Why Customers Choose Softprobe?

80%

Fewer bugs leaking to production

90%

Faster debugging via AI SRE agent

60%

Fewer support escalations

Softprobe AI SRE

Softprobe AI SRE in action

Helps you resolve incidents, optimize costs, and build on existing systems β€” directly inside your on-call workflow.

Softprobe
Softprobe
APP β€’ Today at 10:42 PM
🚦 Safe Canary Alert | Release v2.3.1 (Paused)
Softprobe Safe Canary
Softprobe Safe Canary replayed live production traffic against the new release in an isolated environment.
No customers were impacted.
Behavioral difference detected in checkout results
  • 21% of replayed requests returned different pricing because of tax calucation
  • Affects customers in EU
  • Error rate and latency unchanged
Status
Release paused automatically
Next steps
  1. Ask Softprobe AI to generate a fix PR
  2. Approve release with exception
  3. Expand replay window for confirmation
Reply with the option number or ask for details.

Prevent issues before customers see them with Softprobe Safe Canary Testingβ„’

works for both read and write operations

Softprobe
Softprobe
APP Today at 8:17 PM
Grafana Grafana Alert | Order Service createOrder success rate < 90%
Observed: 95% β†’ 50% Start: 20:15 Window: last 5m
Softprobe Agentic RCA
The incident is driven by a dominant business-level failure pattern. Known noisy errors were filtered out.
Findings
  1. Metric inflection: Prometheus confirms a sudden success-rate drop at 20:15, isolated to POST /createOrder.
  2. Dominant error signature: Loki shows a spike of product 002 out of stock accounting for ~80% of createOrder failures during the incident window.
  3. Noise filtered: cache is disabled errors existed before 20:15 and do not correlate with the drop.
Root Cause
Product 002 inventory depletion caused validation failures and a surge of createOrder errors, reducing success rate from 95% β†’ 50%.
Recommended Actions
  • Immediate (low risk): Temporarily disable product 002 from checkout.
  • Graceful degradation: Return "item unavailable" outcome instead of hard failure.
  • Continuity: Enable backorder / fallback SKU if allowed by policy.
View details in Softprobe Open Jira ticket Will ask confirmation before production actions
Reply with "approve option 1", "show evidence", or "check deploy diff".

Softprobe agent responds to your alert

Softprobe agent pulls the right evidence (metrics, logs, traces), code, knowledge and history, identifies the root cause and proposes actionable recommendations.

User avatar
Andrew Today 21:05
Why is checkout latency spiking only for EU users?
Softprobe
Softprobe APP Today 21:06
I won't assume a single cause β€” validating competing hypotheses in parallel:
CDN edge vs origin Flag divergence DB query plans Cross-region fan-out
Softprobe
Softprobe APP Today 21:07
Finding: Feature flag pricing_v2 is enabled only in eu-west-1. It added a sync call to a US-only pricing dependency (+400–600ms RTT). The inflection timestamp matches the rollout.
Hypotheses Tested
  • CDN edge latency vs origin latency
  • EU-only feature-flag divergence
  • DB query plan regression
  • Cross-region dependency fan-out
Evidence
  • Only EU traces show the sync pricing hop
  • Cross-region RTT adds ~400–600ms
  • Inflection timestamp matches rollout

Gets you to the real root cause

Softprobe agent learns, knows how to user your tools, explore competing hypotheses with you and converges on the real root cause with evidence.

User avatar
Andrew Today 21:12
Can you propose a fix and make it safe?
Softprobe
Softprobe APP Today 21:13
Options:
Add idempotency key Dedupe at DB layer Retry w/ backoff
Recommendation: idempotency + DB uniqueness to stop duplicate charges. Approve to open a PR?
User avatar
Andrew Today 21:14
Apply the idempotency fix and open a PR.
Softprobe
Softprobe APP Today 21:16
Done β€” here's the idempotency fix and the PR summary.
payments/charge.ts Adds idempotency
42
-const charge = await stripe.charge(params)
43
-return charge
42
+const key = req.headers["Idempotency-Key"]
43
+const existing = await charges.findByKey(key)
44
+if (existing) return existing
45
+const charge = await stripe.charge({ ...params, idempotencyKey: key })
46
+await charges.save({ key, chargeId: charge.id })
47
+return charge

PR: Prevent duplicate charges on retries with idempotency

Summary: retries could re-charge when network times out Metrics: duplicates β†’ 0 Rollback: feature flag off

Fixes it with you (and opens the PR)

Softprobe moves from analysis to action β€” with explicit confirmation, reviewable changes and a PR.