Hi HN, I'm Robel. I built LogClaw because I was tired of paying for Datadog and still waking up to pages that said "something is wrong" with no context.
LogClaw is an open-source log intelligence platform that runs on Kubernetes. It ingests logs via OpenTelemetry and detects anomalies using signal-based composite scoring — not simple threshold alerting. The system extracts 8 failure-type signals (OOM, crashes, resource exhaustion, dependency failures, DB deadlocks, timeouts, connection errors, auth failures), combines them with statistical z-score analysis, blast radius, error velocity, and recurrence signals into a composite score. Critical failures (OOM, panics) trigger the immediate detection path in <100ms — before a time window even completes. The detection achieves 99.8% for critical failures while filtering noise (validation errors and 404s don't fire incidents).
Once an anomaly is confirmed, a 5-layer trace correlation engine groups logs by traceId, maps service dependencies, tracks error propagation cascades, and computes blast radius across affected services. Then the Ticketing Agent pulls the correlated timeline, sends it to an LLM for root cause analysis, and creates a deduplicated ticket on Jira, ServiceNow, PagerDuty, OpsGenie, Slack, or Zammad. The loop from log noise to a filed ticket is about 90 seconds.
Architecture: OTel Collector → Kafka (Strimzi, KRaft mode) → Bridge (Python, 4 concurrent threads: ETL, anomaly detection, OpenSearch indexing, trace correlation) → OpenSearch + Ticketing Agent. The AI layer supports OpenAI, Claude, or Ollama for fully air-gapped deployments. Everything deploys with a single Helm chart per tenant, namespace-isolated, no shared data plane.
To try it locally: https://docs.logclaw.ai/local-development
What it does NOT do yet: - Metrics and traces — this is logs-only right now. Metrics support is on the roadmap. - The anomaly detection is signal-based + statistical (composite scoring with z-score), not deep learning. It catches 99.8% of critical failures but won't detect subtle performance drift patterns yet. - The dashboard is functional but basic. We use OpenSearch Dashboards for the heavy lifting.
Licensed Apache 2.0. The managed cloud version is $0.30/GB ingested if you don't want to self-host.
Hi HN — I’m Robel. I built LogClaw after getting tired of waking up to alerts that only said “something is wrong” with no context. LogClaw is an open-source log intelligence platform for Kubernetes. It ingests logs via OpenTelemetry and detects operational failures using signal-based anomaly detection rather than simple thresholds. Instead of looking at a single metric, LogClaw extracts failure signals from logs (OOMs, crashes, dependency failures, DB deadlocks, timeouts, etc.) and combines them with statistical signals like error velocity, recurrence, z-score anomalies, and blast radius to compute a composite anomaly score. Critical failures bypass time windows and trigger detection in <100ms. Once an anomaly is confirmed, a correlation engine reconstructs the trace timeline across services, detects error propagation, and computes the blast radius. A ticketing agent then generates a root-cause summary and creates deduplicated incidents in Jira, ServiceNow, PagerDuty, OpsGenie, Slack, or Zammad. Architecture: OTel Collector → Kafka → Detection Engine → OpenSearch → Ticketing Agent Repo: https://github.com/logclaw/logclaw Would love feedback from people running large production systems.
I'm a little confused. An agent's value-add is to automate what a human actor (in this case, an SRE) does and thus reduces the time taken to recovery, etc. A human SRE never manually detects an error - we already have well-established anomaly detection implementations and wiring them to some ticket generation tool is also an established pattern. My confusion is, what value the "agent" is bringing here. Nothing wrong in competing with the Datadogs of the world.
The problem is a developer spending time to set up alerts for their new feature. I have done it many times on splunk yet that is so inconvenient. it's limited to what the developer expects the error are. example setting up status code based alert on a feature. and what happens if error alerts. A developer has to manual trace logs for a bunch of traceID. LogClaw wants to solve this issue. LogClaw 24/7 monitors your logs no need to set up alerts. when error a rise it will create a ticket with all logs for a particular traceId. No spending time on splunk/datadog log dashboard. Besides that, most of incidents happen unplanned errors on production. Those planned ones a developer has already set up a graceful way of handling them. What happens if your feature works right, but it happens to be it used frequently and Out of memory, or database queries slows, or external api exhausted ...etc and causing the error. There are many unplanned errors that LogClaw will monitor. LogClaw injects all the logs so it knows what's happening through out your whole codebase.
I guess if you don’t want to have to pay for Rapid7 or are too lazy to configure the Teams/Slack integration for your EDR?
But I mean you still have to pay for a Claude API with Moltclaw or whatever no?
It's designed to be SOC 2 compliant with your existing infra. You can spin up local Ollama instead of Claude/openAI APIs. But if you can use external Claude/OpenAI APIs over local Ollama [in-cluster llm].
I am confused on the SOC2 compliance part you keep mentioning. How is it SOC2 compliant? You have completed an audit? Is that report or at least an executive summary available? Or it’s all locally hosted and shouldn’t impact my controls?
And the second part about models, if model choice doesn’t matter, what do they do? If LogClaw injests my logs, applies your custom algorithm to automatically create intelligent alerts without me having to configure anything, what does the LLM do?
If the LLMs are necessary for this, then mode choice should matter no? Some 2 year old version of Mistral or OLLAMA or NanoGPT isn’t going to perform as well as OpenAI or Claude no?
I have not done SOC 2 audit yet. LogClaw is configure to run locally and you can deploy it in your org. so technically all your data you can own them. Your logs go thru many steps. First thru ranking, only the flagged logs go to LLM usually 1-30% of your logs, LLM is used to understand the root cause and in creating a rich context incident ticket. LLM is not used to flag your logs. Currently we support standardized logs OTEL. so we can determine using our algo 99% of incidents.
Also developer configure the alerting conditions. LogClaw it automatically finds your incidents with out manual setting up alerting conditions on your log dashboard [splunk/datadog logs]
>A human SRE never manually detects an error - we already have well-established anomaly detection implementations and wiring them to some ticket generation tool is also an established pattern.
I'm currently dealing with fallout at job because we were doing all this with humans with no alerts and we missed a couple major issues. This product could have prevented a lot of stress in my case, but it'd be a bit like a bandage on a missing limb.
Exactly. Incidents happen with uncaught issue. A simple of database query slowness or out of memory ...etc can cause your "perfectly designed feature" to cause P1. So it is super convienet for a system that invests all of your logs and monitors it for you. No need to customary set up alerts or trace traceIDs, connect logs through out micro-services.
That still begs the question though: There are existing tools and solutions that do this. Why not, and would this being AI make a difference?
"My boss would be more likely to approve it" is a cynical but valid answer.
ALL existing product simply let you set up alerting system, and that alerting system is manually done by you. still un-expected issue can arise. LogClaw is not altering system. you just send all your logs, its capable of injecting terabytes of logs per day, and it automatically ignores all the successful logs, and it works on the uncaught exceptions, errors from all services, infrastructure itself.
Logs are pretty dry sometimes.
INFO gives you a ton but it's low SNR.
WARN/ERROR may tell you that something could happen or is happening, but it doesn't tell you the ramifications of that may be. It could be nothing!
Now imagine you're getting hundreds, thousands, millions of messages like this an hour? How do you determine what's really important? For instance, if a kubernetes pod on a single node runs out of space, that could be a problem if your app is only running in that node. But what if your app is spread against 30x nodes?
It's a triage system with context, at least it sounds like it. It's helping you classify based on actual current or potential problems with the app in the ways that a plain log message does not.
LogClaw capable of injesting terabytes of logs a day. Our algorithm simply ignores successful request lifecycles which can help reduce the strains in analyzing terabytes of logs. Our algorithm then ranks and flags potential logs. later on we retrieve all logs associated with that log and analyze it more based on metrics if its worthy of a ticket/incident.
Deciphering ramifications from a log message alone is a pretty unusual way to approach a problem. You still have your 1990s Nagios-style application monitoring, right? So when you wake up to a message that the web monitor says it's not possible to add items to the shopping basket right now, the database monitor signals an unusually long response time, the application metrics tells you number of buys is at a fraction of what is normal for this time of day, then that WARN log message from the application telling you about a foreign index constraint is violated is pretty informative!
The quality of your logs is critical. Our algo/LLM has no idea about your code but the "Logs". We currently push toward standardizing Otel based logs. You can read about it here https://opentelemetry.io/docs/specs/otel/logs/
How effective are LLMs at triaging issues? Has anyone found success using them to find the root cause? I've only been able to triage effectively for toy examples.
LogClaw algorithm is the moat here that flags logs first. Those only flagged usually less than 10% of the logs are analyzed by LLM. LLM is great at finding root cause if the logs are clear and detailed. So the LLM heavily depends on the quality of your logs. So if your logs are rich with info, it will have a better insights at understanding it.
Wild Moose just made a blog post[0] about this. They found that putting things into foundation models wasn't cutting it, that you had to have small finely-tuned models along with deterministic processes to use AI for RCA.
[0] https://www.wildmoose.ai/post/micro-agents-ai-powered-invest...
Thanks! Looks like I have to request the whitepaper to take a look at the details.
Analyzing logs is not a LLM Foundation model issue.
Please upvote if you lie our idea: https://www.producthunt.com/products/logclaw
as an iteration: what i'd want from an SRE agent is that it sets up and tests automated alarms
i don't want non-determinism in whether my pager goes off when something breaks.
I also want the agent to get a first look at issues once a ticket has been written. Find relevant logs metrics, dashboards, and put them into the ticket.
then, i want it to take a first guess at an RCA, and whether it will solve itself by waiting.
such that by the time i actually am awake, i can read through and decide if anything actually needs to be done.
id also be fine writing up agent skills for how to solve common problems, and be able to run through those, but only if its rock solid. I dont want the agent to make a second issue when i just woke up.
Yes LogClaw does that. It has ranking algorithm if anomaly is worthy of a ticket it will give you metrics blast radius and evidence logs. all the logs related to that issue. Similar incidents are grouped into one ticket.
And YES LogClaw.ai is able to injest terabytes of logs a day.
LLMs aren't the fastest thing in the world, how much data can you realistically parse per second?
We have Algorithm in place. Our system is capable of injesting terabytes of logs first it goes thru an algorithm that ranks every logs. ti simple put this, majority of logs are successful request or health checks or similar are ranked low and not sent to llm. Only those flagged above certain threshold are then analyzed by retrieving all the logs associated with it and analyzed for second time with LLM if its worthy of a ticket.
You forgot to remove the bottom part, which is the same message but shortened. Did people just give up in general? I hate this world so much
Thanks for pointing out. Try logclaw and we need your input instead,
Why is this upvoted? The author did not even bother to read what he wrote.
> SOC 2 Type II ready
Huh? You vibecoded the repo in a week and claim it ready?
I meant since this is designed to be deployed in companies private VPC, their data stays with them. Zero vendor data risk. Corrected it. Thanks for pointing it out.
Hey bud, forgot to delete the original prompt at the end.
Appreciate it. try logclaw, we need your software insights on it.
when are you renaming it to LogMolt?
I'm waiting for Anthropic's Email. I guess it is not as important as OpenClaw.
[dead]