Running incident triage from your AI assistant: a walkthrough
AlertKick ships an MCP server, which means your AI assistant can query alerts, inspect hosts, acknowledge pages, and dig into security events in natural language. Here's what that actually looks like during an incident.
The AlertKick team
There’s a particular shape of work that happens during an incident. You wake up, you open a laptop, you start clicking through dashboards. What’s alerting. What’s the host. What are the recent events from that host. When did it start. What changed in the last hour. Is anyone else seeing it.
Most of that work isn’t analytical. It’s context-assembly. You’re a human, gathering the facts that will let you form a hypothesis. The analytical bit - deciding what the problem actually is and what to do about it - happens after. But you spend a disproportionate fraction of each incident chasing the context.
MCP (the Model Context Protocol) changes that, if you use it well. AlertKick ships an MCP server that exposes every interesting bit of the system as a tool your AI assistant can call. Instead of clicking through five dashboards, you open a conversation with your assistant - Claude, Cursor, or anything else that speaks MCP - and ask it what’s happening. The assistant uses the MCP tools to gather the facts for you, in parallel, and tells you what it found.
This is not a speculative feature. It’s live today, it’s open source, and this post is a walkthrough of what it looks like during an actual triage.
What AlertKick exposes over MCP
The AlertKick MCP server exposes a specific set of tools. Every tool corresponds to an action the dashboard lets you do - which means your assistant can do the same actions, from conversation, without needing to translate between the UI and anything else.
The main ones:
list_alerts- active alerts, with filters for status, severity, host, service.get_alert- the full detail for a specific alert: what fired, when, on what host, why.acknowledge_alert,resolve_alert- change the state of an alert.list_servers,get_server- your fleet, either all of it or one host with full details.get_server_containers- what’s running on a host, with resource usage.list_monitors,get_monitor- the configured checks and their recent history.list_heartbeats- every heartbeat, its expected schedule, last ping, current state.list_incidents,get_incident- grouped incidents (multiple related alerts).list_security_events,get_security_event_stats- the eBPF-detection feed, with AI triage verdicts attached.
Every tool is scoped to your team’s data. No configuration beyond pointing your MCP client at the AlertKick endpoint with your API key.
A typical triage conversation
Here’s what this looks like at 3 AM, when a page hits. The prompts below are lightly cleaned up for readability, but the shape is what actually happens.
You: “Something just paged me - can you tell me what’s going on?”
The assistant calls list_alerts with a filter for active, severity=Critical, ordered by recent. It finds the alert. It then calls get_alert for the full detail, and get_server on the affected host.
Assistant: “There’s one Critical alert active. It fired 6 minutes ago on host payments-api-02: disk usage on /var crossed 95% and is still climbing. The host is otherwise healthy - CPU normal, memory fine, network steady. The /var partition has gained about 40% of its capacity in the last three hours, which is well above the host’s historical baseline. Looks like something is writing to /var/log or /var/lib/docker much faster than usual.”
You: “Is there anything else running on that host that could be writing a lot?”
The assistant calls get_server_containers for payments-api-02.
Assistant: “There are 12 containers on the host. One of them - payments-api, the one the host is named after - has a log volume on /var/lib/docker/volumes/ that appears in df output as 73GB and growing. All other containers look normal. The payments-api container itself has been running for 14 days without a restart and is using about what you’d expect for CPU and memory.”
At this point you know where the problem is. You can also ask follow-ups that would have taken several minutes of dashboard-clicking:
You: “Any security events on that host recently that might be related?”
The assistant calls list_security_events filtered to the host and the last few hours.
Assistant: “There are three security events on that host in the last 24 hours. All three are triaged as benign - routine kubectl exec sessions during a deploy this morning. Nothing suspicious, nothing related to the disk issue.”
You: “OK, ack the alert for now - going to restart the container and flush the logs.”
The assistant calls acknowledge_alert with a reason you dictate.
You then do the manual work (SSH in, truncate the log volume, restart the container). While you’re at it, the assistant is, in a separate message, pulling up the last time the container restarted and what changed in its deploy history, so you’ve got that context ready when you come back to explain it to the team.
That whole exchange took maybe three minutes. The same triage without MCP would have been a good ten minutes of clicking across three views.
What works well
Context assembly, in parallel. The assistant can call three tools at once and synthesise the results. You’d be clicking sequentially. This matters most when you’re half-awake and the pressure is on.
Reading security events. Security events in particular have a lot of context attached - the MITRE technique, the parent process, the AI triage verdict. Asking the assistant to summarise what’s in the feed, with a natural-language lens, is materially faster than reading the events yourself, especially when there are several of them.
Writing post-mortems. After the incident is resolved, asking the assistant to walk through the alert history, the relevant hosts, the security events, and produce a timeline is very close to what you’d want in a post-mortem draft. You edit it for voice, not for structure.
Onboarding new team members. The same MCP connection lets a new engineer ask their assistant “show me what a healthy host looks like on this fleet” and get a summary. They don’t need a week to learn the dashboard first.
What doesn’t work well yet
Honesty matters here.
Actions that change state require deliberate intent. Acknowledging and resolving alerts are allowed through MCP. Silencing a check for a maintenance window is allowed. But we don’t let the assistant change thresholds, modify escalation policies, or edit on-call schedules on its own - even if asked. State-changing actions that persist are always confirmed in the UI. That’s a conscious choice, not a limitation to be fixed. An AI assistant making a bad structural change to your monitoring policy is a category of failure worth a design constraint.
Reasoning about correlations across hosts. The model is good at pulling context about individual entities. It’s less reliable at “is this fleet-wide?” without the right prompting. We’re iterating on tool shapes that make fleet-wide questions more natural - in the meantime, asking explicitly (“is this happening on any other host?”) gets better results than expecting the model to think of it.
Novel incidents. For patterns the model has seen before - disk filling, a stuck container, a classic post-exploitation shell - it’s very useful. For genuinely novel failure modes, it’s a good context-gatherer but not a good diagnostician. Use it to assemble the facts; make the call yourself.
How to set it up
If you’re using Claude Desktop, Claude Code, Cursor, or any MCP-compatible client, the AlertKick MCP server is a one-line addition to your client configuration. Point it at our MCP endpoint, give it your API key, and every tool listed at the top of this post is available.
Full configuration instructions are on the MCP feature page. The MCP server itself is open source, so if you want to audit what it does, extend it, or self-host, you can.
Why ship this at all
The short version: investigating an incident is a context-assembly task, and an AI assistant with the right tools is genuinely faster at context assembly than a human clicking through dashboards. That’s a real, measurable improvement in how an on-call person feels about being on-call. It’s not a gimmick.
Teams using it on real incidents don’t go back. The next post in this series will be a tour of every MCP tool with example prompts - the “what can you actually ask?” cheat sheet.
Try it
If you already have AlertKick, connect your AI assistant through MCP and try a triage conversation the next time something pages you. If you don’t yet, get started. The MCP server works on every plan.
One ask: after you’ve used it in an incident or two, tell us what you’d like an MCP tool to do that it doesn’t yet. The surface is small today, and it’s growing along the lines of what actually helps during a triage - not along the lines of what sounds impressive in a demo.