Show HN: SiClaw – an open-source agent for debugging infrastructure incidents

1 points by SherryWong 8 hours ago

Hi everyone, I’m Fred, one of the builders of SiClaw (https://github.com/scitix/siclaw). We work as SREs at an AI infrastructure company, operating GPU clusters and large-scale model infrastructure across heterogeneous compute environments.

As SREs working on AI infrastructure, we spend a lot of time debugging production systems. The hard part usually isn’t fixing the problem. It’s figuring out what’s actually happening.

A typical incident means opening logs, metrics, dashboards, and cloud consoles just to piece together the story. Even something simple like a CrashLoopBackOff can turn into a small investigation. We started wondering if an agent could do that part of the work.

We recently open-sourced SiClaw: https://github.com/scitix/siclaw

There’s also a small project site with demos here: https://www.siclaw.ai/

Recently we’ve been experimenting with OpenClaw-style agents for infrastructure diagnostics.

OpenClaw today reminds us a bit of early React: the primitives are powerful, but the patterns for real-world usage are still emerging.

In practice, most teams are still figuring out how to connect agents to real production environments.

With SiClaw, we’re exploring what that might look like for DevOps and SRE teams.

One thing that surprised us is how quickly SiClaw became part of our daily workflow.

Now when something weird happens in our infrastructure, we usually start by asking it to investigate instead of opening five dashboards and running a bunch of kubectl commands.

For example, the other day a pod went into CrashLoopBackOff in a Kubernetes cluster.

Normally this means digging through logs and configs to figure out which dependency failed.

With SiClaw we just described the issue and let it investigate. It gathered logs, checked service dependencies, and proposed a ranked list of root-cause hypotheses.

We’d love people here to try it, break it, and see where it fails.

In particular, we’re curious how it performs on real infrastructure incidents — things like pod crashes, dependency failures, or weird configuration issues.

Looking forward to feedback.