Softprobe Context Engine

No more Garbage in, Garbage out - The AI SRE doesn't require perfect logging.

The Softprobe Context Engine provides right context for debugging across infrastructure and business logic, without requiring perfect logging.

Move from reactive alert chasing to preventive reliability with runtime proof.

Infrastructure
Session Context Graph
Logs & metrics
Code
Knowledge
Softprobe Context Engine Runtime graph context from full message body tracing
sequenceDiagram
    participant Checkout
    participant Discount
    participant Tax
    participant OrderDB
    participant Response
    Checkout->>Discount: POST /discounts/apply
    Discount->>Tax: subtotal + discount
    Tax->>OrderDB: INSERT order_id=88421
    OrderDB-->>Response: EUR 118.47

checkout.api

{
  "trace_id": "tr_7af21e",
  "user_id": 118204,
  "request": {
    "sku": "SKU-847",
    "currency": "EUR",
    "region": "DE"
  }
}

[INFO] checkout.api accepted request for user 118204

p95 62ms error 0.03% cpu 41%

Softprobe AI

AI for product operations

The AI knows your production at runtime, not just code & docs.

Example is better than precept

Proactively learn your production context, doesn't require you to tell it everything

Learn from both runtime sessions and static source code/docs.

Grounded context

Operate at full production runtime picture

Make decisions from production behavior, not just static docs.

Caching final price is unsafe in production

Problem: "Add Redis cache to speed up pricing" sounds safe, until you see real request context.

Evidence:

  • Same SKU produces different prices across coupon_set / loyalty_tier / region
  • Pricing path branches into tax -> discount -> rounding decisions
  • 2.4% of real checkout traffic would receive an incorrect cached price

Conclusion: Cache final price for 5 minutes TTL won't work.

Recommendation: Cache only stable components. Recompute contextual modifiers.

safe_cache_policy.yaml YAML
# safe_cache_policy.yaml
cache_targets:
  - name: base_sku_price
    key: "sku:{sku_id}"
    ttl: 300s

do_not_cache:
  - final_price  # depends on context
  - tax_amount   # depends on region + address
  - discount     # depends on loyalty_tier + coupon_set

required_cache_dimensions:
  - region
  - loyalty_tier
  - coupon_set_hash

Runtime workflow proof

Observed runtime paths:

Checkout -> Pricing -> DiscountEngine -> Tax -> Rounding -> Response

Branching drivers:

  • coupon_set present? (Y/N)
  • loyalty_tier (0/1/2/3)
  • region (US/EU/UK/JP)

2:03 AM β€” Checkout degradation detected

  • CPU: 95%
  • Checkout error rate: ↑ 18%
  • Payment p95 latency: 3.2s
  • No new deployment
  • Traffic normal

Root Cause Identified

Retry amplification triggered by upstream latency spike.

Checkout -> Payment (retry x5)
         -> Fraud (retry x4)
              -> Bank API slowdown

1 request -> 20 downstream calls

CPU saturation in 2 minutes.

Unsafe Configuration

  • Independent retry policies
  • No global retry budget
  • Timeout < upstream p95

Patch Generated

destination-rule.yaml YAML
# destination-rule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
spec:
  trafficPolicy:
    connectionPool:
      http:
        maxRetries: 3
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 10s
      baseEjectionTime: 30s

Safe Validation

  • Replay last 2 hours traffic
  • Confirm retry amplification eliminated
  • Gradual rollout

Fast remediation

Debug production like an SRE with x-ray vision

From alert to safe resolution, with validation gates before rollout.

  • Investigate root cause of prod failures instantly
  • Generate safe remediation and step by step guide
  • Validate before full rollout