Softprobe AI SRE

面向生产环境的智能运维 AI

借助 Softprobe AI SRE,以更高信心发布。它在发布前捕获回归问题,在生产环境中确定性复现并调试故障,并通过你现有的工具自动编排修复,无需额外埋点。

Dev
SRE Engineer
QA Engineer
IT Ops
Investigating ...
Thinking T+04s

Correlating checkout 5xx spike with deploy checkout-api@2026.02.11.4.

Tool Call T+08s

Fetching evidence from Grafana, Sentry CHK-991, and Argo rollout events.

grafana.query(service="checkout", metric="http_5xx_rate", window="15m")
Observation T+13s

p95 latency jumped 480ms -> 2.1s only on revision tied to commit a18f2c9.

Tool Call T+18s
kubectl logs deploy/checkout-api -n prod --since=15m | rg "timeout|pool exhausted"
Action T+24s

Executing rollback to checkout-api@2026.02.11.3 and opening INC-214.

Verify T+31s

5xx recovered from 8.2% to 0.3%. Posted incident summary to #prod-incidents.

Softprobe Context Engine
Observability
Prometheus Grafana Datadog Sentry New Relic opentelemetry
Knowledge
Slack Jira Notion Confluence Linear Drive
Code
GitLab GitHub Bitbucket
Testing
Cypress Saucelabs Curl
DevOps & CI/CD
AWS GCP K8s Helm Argo Actions TravisCI Jenkins CircleCI

值得领先企业信赖

G7LexinLinkedCarePing AnTripUnited ImagingXingyunXYDYeahka

为什么团队选择 Softprobe?

80%

将泄漏到生产环境的缺陷减少 80%

90%

通过 AI SRE 加速调试速度 90%

60%

支持升级案例减少 60%

Softprobe AI SRE

Softprobe AI SRE 实战场景

帮助你在现有体系之上解决事故、优化成本,并无缝融入现有值班与运维流程。

Softprobe
Softprobe
APP Today at 10:42 PM
🚦 Safe Canary Alert | Release v2.3.1 (Paused)
Softprobe Safe Canary
Softprobe Safe Canary replayed live production traffic against the new release in an isolated environment.
No customers were impacted.
Behavioral difference detected in checkout results
  • 21% of replayed requests returned different pricing because of tax calucation
  • Affects customers in EU
  • Error rate and latency unchanged
Status
Release paused automatically
Next steps
  1. Ask Softprobe AI to generate a fix PR
  2. Approve release with exception
  3. Expand replay window for confirmation
Reply with the option number or ask for details.

通过 Softprobe Safe Canary Testing™ 在客户看到问题前就将其拦截

适用于读 / 写接口

Softprobe
Softprobe
APP Today at 8:17 PM
Grafana Grafana Alert | Order Service createOrder success rate < 90%
Observed: 95% → 50% Start: 20:15 Window: last 5m
Softprobe Agentic RCA
The incident is driven by a dominant business-level failure pattern. Known noisy errors were filtered out.
Findings
  1. Metric inflection: Prometheus confirms a sudden success-rate drop at 20:15, isolated to POST /createOrder.
  2. Dominant error signature: Loki shows a spike of product 002 out of stock accounting for ~80% of createOrder failures during the incident window.
  3. Noise filtered: cache is disabled errors existed before 20:15 and do not correlate with the drop.
Root Cause
Product 002 inventory depletion caused validation failures and a surge of createOrder errors, reducing success rate from 95% → 50%.
Recommended Actions
  • Immediate (low risk): Temporarily disable product 002 from checkout.
  • Graceful degradation: Return "item unavailable" outcome instead of hard failure.
  • Continuity: Enable backorder / fallback SKU if allowed by policy.
View details in Softprobe Open Jira ticket Will ask confirmation before production actions
Reply with "approve option 1", "show evidence", or "check deploy diff".

Softprobe 智能体响应你的告警

Softprobe 智能体调用合适的证据(指标、日志、链路)、代码与知识库,筛除噪声,定位真正的业务级根因并给出可执行建议。

User avatar
Andrew Today 21:05
Why is checkout latency spiking only for EU users?
Softprobe
Softprobe APP Today 21:06
I won't assume a single cause — validating competing hypotheses in parallel:
CDN edge vs origin Flag divergence DB query plans Cross-region fan-out
Softprobe
Softprobe APP Today 21:07
Finding: Feature flag pricing_v2 is enabled only in eu-west-1. It added a sync call to a US-only pricing dependency (+400–600ms RTT). The inflection timestamp matches the rollout.
Hypotheses Tested
  • CDN edge latency vs origin latency
  • EU-only feature-flag divergence
  • DB query plan regression
  • Cross-region dependency fan-out
Evidence
  • Only EU traces show the sync pricing hop
  • Cross-region RTT adds ~400–600ms
  • Inflection timestamp matches rollout

帮你找到真正的根因

Softprobe 智能体会与你一起在你的工具中并行验证多种假设,并基于证据收敛到真实根因。

User avatar
Andrew Today 21:12
Can you propose a fix and make it safe?
Softprobe
Softprobe APP Today 21:13
Options:
Add idempotency key Dedupe at DB layer Retry w/ backoff
Recommendation: idempotency + DB uniqueness to stop duplicate charges. Approve to open a PR?
User avatar
Andrew Today 21:14
Apply the idempotency fix and open a PR.
Softprobe
Softprobe APP Today 21:16
Done — here's the idempotency fix and the PR summary.
payments/charge.ts Adds idempotency
42
-const charge = await stripe.charge(params)
43
-return charge
42
+const key = req.headers["Idempotency-Key"]
43
+const existing = await charges.findByKey(key)
44
+if (existing) return existing
45
+const charge = await stripe.charge({ ...params, idempotencyKey: key })
46
+await charges.save({ key, chargeId: charge.id })
47
+return charge

PR: Prevent duplicate charges on retries with idempotency

Summary: retries could re-charge when network times out Metrics: duplicates → 0 Rollback: feature flag off

和你一起修复问题(并创建 PR)

Softprobe 不止分析问题,还会在你确认的前提下给出可审查的改动与 PR,配合回放验证确保上线安全。