Dev
SRE Engineer
QA Engineer
IT Ops
Planning
Session Graph Context
Observability
Prometheus Grafana Datadog Sentry New Relic opentelemetry
Knowledge
Jira Notion Confluence Linear Drive
Code
GitLab GitHub Bitbucket
Testing
Cypress saucelabs Postman
DevOps & CI/CD
AWS GCP K8s Helm Argo Actions CircleCI TravisCI Jenkins CircleCI

Softprobe AI

Softprobe AI in action

Helps you resolve incidents, optimize costs, and build on existing systems — directly inside your on-call workflow.

Softprobe
Softprobe
APP Today at 8:17 PM
🚨 Grafana Alert | Order Service createOrder success rate < 90%
Observed: 95% → 50% Start: 20:15 Window: last 5m
Softprobe Agentic RCA
The incident is driven by a dominant business-level failure pattern. Known noisy errors were filtered out.
Findings
  1. Metric inflection: Prometheus confirms a sudden success-rate drop at 20:15, isolated to POST /createOrder.
  2. Dominant error signature: Loki shows a spike of product 002 out of stock accounting for ~80% of createOrder failures during the incident window.
  3. Noise filtered: cache is disabled errors existed before 20:15 and do not correlate with the drop.
Root Cause
Product 002 inventory depletion caused validation failures and a surge of createOrder errors, reducing success rate from 95% → 50%.
Recommended Actions
  • Immediate (low risk): Temporarily disable product 002 from checkout.
  • Graceful degradation: Return “item unavailable” outcome instead of hard failure.
  • Continuity: Enable backorder / fallback SKU if allowed by policy.
View details in Softprobe Open Jira ticket Will ask confirmation before production actions
Reply with “approve option 1”, “show evidence”, or “check deploy diff”.

Understands the alert and proposes next steps

Softprobe reads the alert, pulls the right evidence (metrics, logs, traces), and summarizes the likely root cause with safe, actionable recommendations.

  • Investigates first: correlates signals and filters noise.
  • Explains clearly: highlights the dominant failure pattern and impact.
  • Acts safely: suggests reversible mitigations and asks for confirmation.
Thread
RCA exploration
User avatar
Andrew Today 21:05
Why is checkout latency spiking only for EU users?
Softprobe
Softprobe APP Today 21:06
I won’t assume a single cause — validating competing hypotheses in parallel:
CDN edge vs origin Flag divergence DB query plans Cross-region fan-out
Softprobe
Softprobe APP Today 21:07
Finding: Feature flag pricing_v2 is enabled only in eu-west-1. It added a sync call to a US-only pricing dependency (+400–600ms RTT). The inflection timestamp matches the rollout.
Hypotheses Tested
  • CDN edge latency vs origin latency
  • EU-only feature-flag divergence
  • DB query plan regression
  • Cross-region dependency fan-out
Evidence
  • Only EU traces show the sync pricing hop
  • Cross-region RTT adds ~400–600ms
  • Inflection timestamp matches rollout

Gets you to the real root cause

Explores competing hypotheses with you — not just dashboards — and converges on the real root cause with evidence.

Thread
RCA → Fix → PR
User avatar
Andrew Today 21:12
Can you propose a fix and make it safe?
Softprobe
Softprobe APP Today 21:13
Options:
Add idempotency key Dedupe at DB layer Retry w/ backoff
Recommendation: idempotency + DB uniqueness to stop duplicate charges. Approve to open a PR?
User avatar
Andrew Today 21:14
Apply the idempotency fix and open a PR.
Softprobe
Softprobe APP Today 21:16
Done — here’s the minimal diff and the PR summary.
payments/charge.ts · minimal change adds idempotency
42
-const charge = await stripe.charge(params)
43
-return charge
42
+const key = req.headers["Idempotency-Key"]
43
+const existing = await charges.findByKey(key)
44
+if (existing) return existing
45
+const charge = await stripe.charge({ ...params, idempotencyKey: key })
46
+await charges.save({ key, chargeId: charge.id })
47
+return charge

PR: Prevent duplicate charges on retries with idempotency

Summary: retries could re-charge when network times out Metrics: duplicates → 0 Rollback: feature flag off

Fixes it with you (and opens the PR)

Softprobe moves from analysis to action — with explicit confirmation, reviewable changes and a PR.

Softprobe doesn’t just explain incidents — it collaborates with you to fix them and make sure they don’t come back.