hckrnws
Launch HN: Sonarly (YC W26) – AI agent to triage and fix your production alerts
by Dimittri
Hey HN, I am Dimittri and we’re building Sonarly (https://sonarly.com), an AI engineer for production. It connects to your observability tools like Sentry, Datadog, or user feedback channels, triages issues, and fixes them to cut your resolution time. Here's a demo: https://www.youtube.com/watch?v=rr3VHv0eRdw.
Sonarly is really about removing the noise from production alerts by grouping duplicates and returning a root cause analysis to save time to on-call engineers and literally cut your MTTR.
Before starting this company, my co-founder and I had a B2C app in edtech and had, some days, thousands of users using the app. We pushed several times a day, relying on user feedback. Then we set up Sentry, it was catching a lot of bugs, but we had up to 50 alerts a day. With 2 people it's a lot. We took a lot of time filtering the noise to find the real signal so we knew which bug to focus on.
At the same time, we saw how important it is to fix a bug fast when it hits users. A bug means in the worst case a churn and at best a frustrated user. And there are always bugs in production, due to code errors, database mismatches, infrastructure overload, and many issues are linked to a specific user behavior. You can't catch all these beforehand, even with E2E tests or AI code reviews (which catch a lot of bugs but obviously not all, plus it takes time to run at each deployment). This is even more true with vibe-coding (or agentic engineering).
We started Sonarly with this idea. More software than ever is being built and users should have the best experience possible on every product. The main idea of Sonarly is to reduce the MTTR (Mean Time To Repair).
We started by recreating a Sentry-like tool but without the noise, using only text and session replays as the interface. We built our own frontend tracker (based on open-source rrweb) and used the backend Sentry SDK (open source as well). Companies could just add another tracker in the frontend and add a DSN in their Sentry config to send data to us in addition to Sentry.
We wanted to build an interface where you don't need to check logs, dashboards, traces, metrics, and code, as the agent would do it for you with plain English to explain the "what," "why," and "how do I fix it."
We quickly realized companies don't want to add a new tracker or change their monitoring stack, as these platforms do the job they're supposed to do. So we decided to build above them. Now we connect to tools like Sentry, Datadog, Slack user feedback channels, and other integrations.
Claude Code is so good at writing code, but handling runtime issues requires more than just raw coding ability. It demands deep runtime context, immediate reactivity, and intelligent triage, you can’t simply pipe every alert directly into an agent. That’s why our first step is converting noise into signal. We group duplicates and filter false positives to isolate clear issues. Once we have a confirmed signal, we trigger Claude Code with the exact context it needs, like the specific Sentry issue and relevant logs fetched via MCP (mostly using grep on Datadog/Grafana). However, things get exponentially harder with multi-repo and multi-service architectures.
So we built an internal map of the production system that is basically a .md file updated dynamically. It shows every link between different services, logs, and metrics so that Claude Code can understand the issue faster.
One of our users using Sentry was receiving ~180 alerts/day. Here is what their workflow looked like:
- Receive the alert
- 1) Defocus from their current task or wake up, or 2) don't look at the alert at all (most of the time)
- Go check dashboards to find the root cause (if infra type) or read the stack trace, events, etc.
- Try to figure out if it was a false positive or a real problem (or a known problem already in the fixes pipeline)
- Then fix by giving Claude Code the correct context
We started by cutting the noise and went from 180/day to 50/day (by grouping issues) and giving a severity based on the impact on the user/infra. This brings it down to 5 issues to focus on in the current day. Triage happens in 3 steps: deduplicating before triggering a coding agent, gathering the root cause for each alert, and re-grouping by RCA.
We launched self-serve (https://sonarly.com) and we would love to have feedback from engineers. Especially curious about your current workflows when you receive an alert from any of these channels like Sentry (error tracking), Datadog (APM), or user feedback. How do you assign who should fix it? Where do you take your context from to fix the issue? Do you have any automated workflow to fix every bug, and do you have anything you use currently to filter the noise from alerts?
We have a large free tier as we mainly want feedback. You can self-serve under 2 min. I'll be in the thread with my co-founder to answer your questions, give more technical details, and take your feedback: positive, negative, brutal, everything's constructive!
The dynamic system map (.md file) approach is the most interesting part of this to me. The hardest problem in automated alert triage isn't the deduplication or even the RCA -- it's that the agent doesn't know what "normal" looks like for your system.
I've seen teams try to solve this by feeding every alert directly into an LLM (like nojs describes above), and the failure mode is predictable: the model treats each alert as an isolated incident because it has no topology awareness. It doesn't know that Service A calling Service B with 500ms latency is fine on Tuesday mornings because of the batch job, but a P1 on Wednesday afternoons.
The real question is how you keep that system map accurate as architecture evolves. In my experience, the map drifts within weeks unless it's generated from runtime data (traces, dependency graphs from actual traffic) rather than maintained manually. Static architecture docs are lies within a quarter.
Also curious about the severity scoring -- are you using user impact signals (error rates on user-facing endpoints, session replay data) or purely technical signals? The gap between "this looks bad in logs" and "users are actually affected" is where most alert fatigue comes from. A 500 error on an internal health check endpoint generates the same Sentry noise as a 500 on the checkout flow, but they're not remotely the same priority.
Yes exactly! The purpose isn't to create one PR per alert—that would just move noise from one place to another. The bottleneck we're solving is triaging: cutting the noise and turning it into signal. Once we've done that, we fix issues and show you a PR.
I think it works well because we have two deduplication steps and we group based on RCA—both before and after Claude Code analysis.
Severity also helps cut noise by highlighting which problems/solutions to review first. It's easier when the issue comes from a frontend trigger, but many alerts come from backend errors only. In those cases, Claude Code assigns severity based on its understanding of how a broken feature blocks product usage.
> Especially curious about your current workflows when you receive an alert from any of these channels like Sentry (error tracking), Datadog (APM), or user feedback.
I have a github action that runs hourly. It pulls new issues from sentry, grabs as much json as it can from the API, and pipes it into claude. Claude is instructed to either make a PR, an issue, or add more logging data if it’s insufficient to diagnose.
I would say 30% of the PRs i can merge, the remainder the LLM has applied a bandaid fix without digging deep enough into the root cause.
Also the volume of sentry alerts is high, and the issues being fixed are often unimportant, so it tends to create a lot of “busy work”.
To avoid this 'busy work', we group alerts by RCA (so no duplicate PRs) and filter by severity (so no PRs for false positives or not-that-important issue). We realized early on that turning every alert into a PR just moves the problem from Sentry to GitHub, which defeats the purpose.
Is having a one-hour cron job enough to ensure the product’s health? do you receive alerts by email/slack/other for specific one or when a PR is created?
interesting. yeah the only reason it’s on cron is because the sentry-github integration didnt work for this (can’t remember why), and i didnt want to maintain another webhook.
the timing is not a huge issue though because the type of bugs being caught at this stage are rarely so critical they need to fixed in less time than that - and the bandwidth is limited by someone reviewing the PR anyway.
the other issue is crazy token wastage, which gets expensive. my gut instinct re triaging is that i want to do it myself in the prompt - but if it prevents noise before reaching claude it may be useful for some folks just for the token savings.
no, I don’t receive alerts because i’m looking at the PR/issues list all day anyway, it would just be noise.
totally get the 'token wastage' point—sending noise to an LLM is literally burning money.
but an other maybe bigger cost might be your time reviewing those 'bandaid fixes.' if you're merging only 30%, that means you're spending 70% of your review bandwidth on PRs that shouldn't exist right?
we deduplicate before the claude analysis with the alert context and after based on the rca so we ensure we have no noise in the PRs you have to review
why don't you trust an agent to triage alerts+issues?
Yeah. what I find in practice is that since the majority of these PRs require manual intervention (even if minor, like a single follow up prompt), it's not significantly better than just hammering them all out in one session myself a few times per week, and giving it my full attention for that period of time.
The exception is when a fix is a) trivial or b) affecting a real user and therefore needs to be fixed quickly, in which case the current workflow is useful. But yeah, the real step-change was having Claude hitting the Sentry APIs directly and getting the info it needs, whether async or not.
I'd also imagine that people's experiences with this vary a lot depending on the size and stage of the company - our focus is developing new features quickly rather than maintaining a 100% available critical production service, for example.
Interesting. it makes sense that it depends on the number of alerts you receive. but I’d think that if 70% of the PRs you receive are noise, an AI triager could be useful—if you give it the context it needs based on your best practices. I’m very curious about the kinds of manual intervention you do on PRs when one is required. What does the follow-up prompt look like? Is it because the fix was bad, because the RCA itself was wrong, or because of something else?
I tried the onboarding, but I think it timed out on the Analyzing screen because it couldn't find any issues in my Sentry environment. So I couldn't get too much further.
EDIT: It did let me in, but I don't know why it took so long.
I've worked on teams where there's been one person on rotation every sprint to catch and field issues like these, so taking that job and giving it to an AI agent seems like a reasonable approach.
I think I'd be most concerned about having a separate development process outside of the main issue queue, where agents aren't necessarily integrated into the main workstream.
hey thanks for the feedback! After the onboarding, we process your last issues to show you the triage and analysis, it works only if you have past alerts. do you have alerts in sentry?
we have a bot Slack feature to be inside their workflow so they don't have to go check the dashboard
Sounds interesting. Do you sponsor or otherwise support the open source projects you build on as mentioned in the your description?
We don't have plans to open-source the platform yet, but we prioritize transparency. For example, we display all tool calls and system prompts to help developers verify the RCA immediately. Regarding the open-source projects—are you referring to rrweb and the Sentry SDK? We used them for the first version of our product, but we’ve since switched to connecting directly via OAuth and native integrations.
so we don't build above them as we did before
Oh ffs my manager is going to be talking about this in the stand up
hopefully it makes your life easier in the end!
Adding more complexity to uncover issues caused by complexity defeats the purpose but I guess people need to sell shovels.
It's hard to make it simple. the complexity is on our side, but our goal is to cut the noise from production alerts so we're removing complexity rather than adding it.
Crafted by Rajat
Source Code