Event-Driven Remediation with Azure Service Bus and Ansible - Solution Guide

Overview

Organizations running workloads on Microsoft Azure use Azure Service Bus as their enterprise messaging backbone – routing events, alerts, and telemetry between services, monitoring tools, and operational systems. When Azure Monitor, Defender for Cloud, or a custom application publishes a critical event to a Service Bus queue or topic, the response today is manual: an engineer reads the message, investigates the affected resource, and remediates by hand.

This guide demonstrates how to connect Azure Service Bus Queues directly to Event-Driven Ansible (EDA), creating a real-time pipeline that consumes Azure events, enriches them with AI-driven root cause analysis, and triggers automated remediation – turning Azure’s messaging layer into the trigger for a closed-loop AIOps workflow.

This guide builds on the AIOps reference architecture.

For the full end-to-end AIOps pipeline – including AI inference, Lightspeed playbook generation, and the Crawl/Walk/Run maturity model – see AIOps automation with Ansible. This guide focuses specifically on using Azure Service Bus as the event transport layer.

Background

Azure Service Bus is a fully managed enterprise messaging service from Microsoft that supports message queuing and publish-subscribe patterns. It decouples event producers (monitoring tools, applications, Azure services) from consumers (automation systems, dashboards, alerting tools), providing reliable message delivery with features like dead-letter queues, sessions, and scheduled delivery.

Azure Service Bus Messaging – microsoft.com

In an AIOps context, Azure Service Bus acts as the event transport layer – sitting between Azure’s monitoring and alerting tools (Azure Monitor, Azure Alerts, Defender for Cloud) and Ansible’s automation engine. Events are published to a Service Bus queue or topic, and Event-Driven Ansible subscribes to consume them in real time.

This architecture is particularly valuable for organizations that already use Azure Service Bus as their event backbone. Instead of building custom integrations for each alert type, you route all events through Service Bus and let a single EDA rulebook handle the dispatch to enrichment and remediation workflows.

What is AIOps? – redhat.com

Solution

What makes up the solution?

EDA is part of Ansible Automation Platform.

EDA uses rulebooks to monitor events, then executes specified job templates or workflows based on the event. Think of it simply as inputs and outputs. EDA is an automatic way for inputs into Ansible Automation Platform, where Automation controller is the output (running a job template or workflow).

Who Benefits

Persona Challenge What They Gain
IT Ops Engineer / SRE Monitoring Azure Service Bus queues for critical events and manually investigating each one across hybrid Azure and on-prem infrastructure Azure events automatically trigger enrichment and remediation workflows – the fix starts before the engineer even sees the alert
Automation Architect Integrating Azure’s messaging layer with Ansible requires understanding Service Bus authentication, EDA event source plugins, and credential management A reference architecture with production-ready EDA rulebooks, Azure Service Bus configuration, and tested collection usage
IT Manager / Director Azure event volume is growing but response times aren’t improving; hybrid cloud operations span Azure and on-prem with no unified automation trigger A single event pipeline that bridges Azure monitoring with Ansible remediation across hybrid environments – measurable MTTR reduction

Prerequisites

Ansible Automation Platform

Collection Type Purpose
ansible.eda Certified EDA event sources and filters – includes the azure_service_bus event source plugin
azure.azcollection Certified Manage Azure resources (VMs, networking, storage, etc.) for remediation tasks
ansible.controller Certified Automation Controller configuration as code (job templates, workflows, surveys)
redhat.ai Certified AI model inference using the OpenAI-compatible API via InstructLab
ansible.scm Certified Git operations (commit and push generated playbooks)

External Systems

System Required Notes
Azure Service Bus namespace Yes With at least one queue or topic/subscription configured
Azure subscription Yes With permissions to create Service Bus resources and configure Shared Access Policies
AI inference endpoint Yes Red Hat AI (RHEL AI + InstructLab) or any OpenAI-compatible API
Ansible Lightspeed Recommended For dynamic playbook generation at the Run maturity level
Git repository Yes GitHub, GitLab, or Gitea for storing generated playbooks
Chat or ITSM tool Recommended Slack, Mattermost, or ServiceNow for human-in-the-loop notifications

Azure Service Bus to Ansible Workflow

The workflow has four stages, matching the AIOps reference architecture:

  1. Azure Event → Service Bus → EDA – Azure Monitor, Defender for Cloud, or a custom application publishes an event to a Service Bus queue. EDA subscribes to the queue and consumes the message.
  2. Enrichment Workflow – AAP gathers additional context from the affected Azure resource or on-prem host, sends the enriched data to Red Hat AI for root cause analysis, and notifies the operations team.
  3. Remediation Workflow – Ansible Lightspeed generates a remediation playbook from the AI analysis, commits it to Git, and creates a Job Template.
  4. Execute Remediation – The generated playbook runs against the affected infrastructure (Azure VMs, on-prem hosts, or network devices), resolving the issue.

Operational Impact per Stage

Stage Operational Impact Why
1. Azure Event → EDA None Read-only – EDA reads a message from the Service Bus queue. No changes to systems.
2. Enrichment Workflow Low Collects logs and system info, calls an AI API, posts to chat/ITSM. No infrastructure changes.
3. Remediation Workflow Low Generates a playbook, commits to Git, creates a Job Template. Prepares the fix but does not touch production.
4. Execute Remediation High Modifies production infrastructure. Should go through a change window or approval gate.

Workflow Diagram

Azure Monitor/Alert → Service Bus Queue → EDA Rulebook → Enrichment Workflow → AI Analysis → Remediation Workflow → Execute Fix

Azure Service Bus is the event transport layer.

In the AIOps reference architecture, Kafka or a direct webhook handles event transport. When using Azure Service Bus, it replaces that layer – Azure services publish events to a queue, and EDA subscribes directly using the built-in ansible.eda.azure_service_bus plugin.

Solution Walkthrough

1. Configure Azure Service Bus Queue

Operational Impact: None (Azure configuration only)

Create a Service Bus namespace and queue in Azure, then configure a Shared Access Policy that EDA will use to authenticate.

Using the Azure CLI:

az servicebus namespace create \
  --name aiops-events \
  --resource-group rg-aiops \
  --location eastus \
  --sku Standard

az servicebus queue create \
  --name infrastructure-alerts \
  --namespace-name aiops-events \
  --resource-group rg-aiops

az servicebus namespace authorization-rule create \
  --name eda-listener \
  --namespace-name aiops-events \
  --resource-group rg-aiops \
  --rights Listen

Or automate the same setup with the azure.azcollection:

- name: Create Azure Service Bus queue for AIOps events
  azure.azcollection.azure_rm_servicebusqueue:
    resource_group: rg-aiops
    namespace: aiops-events
    name: infrastructure-alerts
    max_size_in_mb: 1024
    default_message_time_to_live: "PT1H"
  delegate_to: localhost

Route Azure Monitor alerts to Service Bus.

In the Azure Portal, navigate to Monitor → Alerts → Action Groups and create an action group with a Service Bus Queue action type. This routes Azure Monitor alerts directly into your queue where EDA can consume them.

2. EDA Rulebook for Azure Service Bus Events

Operational Impact: None

The EDA rulebook subscribes to the Azure Service Bus queue using the built-in ansible.eda.azure_service_bus event source plugin. When a message arrives, the rulebook evaluates conditions and triggers the enrichment workflow.

---
- name: Azure Service Bus AIOps response
  hosts: all
  sources:
    - ansible.eda.azure_service_bus:
        conn_str: "{{ azure_service_bus_connection_string }}"
        queue_name: infrastructure-alerts
        logging_enable: true

  rules:
    - name: Azure infrastructure alert detected
      condition: event.body is defined
      action:
        run_workflow_template:
          organization: "Default"
          name: "Azure Alert Enrichment Workflow"
          extra_vars:
            alert_name: "{{ event.body.data.essentials.alertRule | default('Unknown Alert') }}"
            severity: "{{ event.body.data.essentials.severity | default('Sev3') }}"
            affected_resource: "{{ event.body.data.essentials.alertTargetIDs[0] | default('') }}"
            description: "{{ event.body.data.essentials.description | default('') }}"
            fired_time: "{{ event.body.data.essentials.firedDateTime | default('') }}"

The rulebook extracts key fields from the Azure Monitor alert schema – the alert rule name, severity, affected Azure resource ID, description, and timestamp – and passes them to the enrichment workflow.

Azure Monitor Common Alert Schema.

Azure Monitor alerts follow a Common Alert Schema with essentials (severity, alert rule, target resource) and alertContext (metric/log details). Your rulebook conditions can filter on any of these fields – for example, only triggering automation on Sev0 or Sev1 alerts.

3. Enrichment Workflow – Gather Context and Analyze with AI

Operational Impact: Low

Once EDA triggers the enrichment workflow, Ansible gathers additional context from the affected Azure resource and sends everything to Red Hat AI for root cause analysis. This is the same pattern described in the AIOps reference architecture – Log Enrichment and Prompt Generation Workflow.

Step 3a: Gather context from the affected Azure VM

- name: Gather Azure VM context for alert
  hosts: localhost
  tasks:
    - name: Get Azure VM details
      azure.azcollection.azure_rm_virtualmachine_info:
        resource_group: "{{ resource_group }}"
        name: "{{ vm_name }}"
      register: vm_info

    - name: Get recent Azure activity log entries
      azure.azcollection.azure_rm_resource_info:
        url: "/subscriptions/{{ subscription_id }}/providers/Microsoft.Insights/eventtypes/management/values"
        api_version: "2015-04-01"
      register: activity_log

Step 3b: Gather diagnostics from the host itself

- name: Gather host diagnostics
  hosts: "{{ affected_host }}"
  become: true
  tasks:
    - name: Check for failed systemd services
      ansible.builtin.shell:
        cmd: "systemctl list-units --state=failed --no-pager"
      register: failed_services

    - name: Collect recent error logs
      ansible.builtin.command:
        cmd: "journalctl --since '2 hours ago' --priority=err --no-pager"
      register: recent_errors

    - name: Check disk and memory usage
      ansible.builtin.setup:
        filter:
          - ansible_memtotal_mb
          - ansible_memfree_mb
          - ansible_mounts

Step 3c: Analyze with Red Hat AI

- name: Analyze Azure alert with Red Hat AI
  hosts: localhost
  tasks:
    - name: Build enriched prompt
      ansible.builtin.set_fact:
        enriched_prompt: |
          An Azure Monitor alert fired: {{ alert_name }}
          Severity: {{ severity }}
          Affected resource: {{ affected_resource }}
          Description: {{ description }}
          VM status: {{ vm_info.vms[0].power_state | default('unknown') }}
          Failed services: {{ hostvars[affected_host].failed_services.stdout | default('none') }}
          Recent errors: {{ hostvars[affected_host].recent_errors.stdout | default('none') | truncate(2000) }}

          Diagnose the root cause and recommend a remediation step.

    - name: Send to Red Hat AI for analysis
      redhat.ai.completion:
        base_url: "http://{{ rhelai_server }}:{{ rhelai_port }}"
        token: "{{ rhelai_token }}"
        prompt: "{{ enriched_prompt }}"
        model_path: "/root/.cache/instructlab/models/granite-8b-lab-v1"
      delegate_to: localhost
      register: ai_response

4. Notify and Remediate

Operational Impact: Low (notification) → High (remediation execution)

The enrichment workflow posts the AI analysis to your team’s communication channel and – depending on your maturity level – either queues a remediation playbook for human approval or executes it automatically.

    - name: Post AI diagnosis to Slack
      ansible.builtin.uri:
        url: "{{ slack_webhook_url }}"
        method: POST
        body_format: json
        body:
          text: |
            :rotating_light: *Azure Alert:* {{ alert_name }}
            *Severity:* {{ severity }}
            *Resource:* {{ affected_resource }}
            *AI Diagnosis:* {{ ai_response.choice_0_text }}

    - name: Create ServiceNow incident with enriched context
      servicenow.itsm.incident:
        instance:
          host: "{{ snow_instance }}"
          username: "{{ snow_username }}"
          password: "{{ snow_password }}"
        short_description: "Azure Alert: {{ alert_name }}"
        description: "AI Analysis: {{ ai_response.choice_0_text }}"
        priority: "{{ '1' if severity == 'Sev0' else '2' if severity == 'Sev1' else '3' }}"
      when: snow_instance is defined

From here, the remediation follows the same pattern as the AIOps Remediation Workflow – Lightspeed generates a playbook, it gets committed to Git, and a Job Template is created for execution.

Validation

Stage What to Verify Success Indicator
1. Azure Event → EDA Message arrives in queue and EDA consumes it EDA Controller shows the rulebook activation as Running; event log shows the Azure alert payload
2. Enrichment Workflow AI analyzed the alert and notifications were sent Workflow Visualizer shows all nodes green; Slack/ITSM received the AI diagnosis
3. Remediation Workflow Playbook was generated and committed New playbook file exists in the Git repository; Job Template was created
4. Execute Remediation The fix was applied Azure resource returns to healthy state; Azure Monitor alert auto-resolves

Quick validation test – Send a test message to the Service Bus queue using the Azure CLI:

az servicebus queue send \
  --namespace-name aiops-events \
  --queue-name infrastructure-alerts \
  --resource-group rg-aiops \
  --body '{
    "data": {
      "essentials": {
        "alertRule": "Test High CPU Alert",
        "severity": "Sev2",
        "alertTargetIDs": ["/subscriptions/xxx/resourceGroups/rg-aiops/providers/Microsoft.Compute/virtualMachines/web01"],
        "description": "CPU usage exceeded 95% for 5 minutes",
        "firedDateTime": "2026-02-18T14:30:00Z"
      }
    }
  }'

Troubleshooting

Symptom Likely Cause Fix
EDA rulebook is active but no events are received Connection string is wrong or queue name doesn’t match Verify conn_str has Listen permissions; check queue name matches exactly
EDA receives event but condition doesn’t match Azure alert payload structure differs from expected schema Use debug action in the rulebook to print event.body and compare against your conditions
Enrichment workflow runs but AI response is empty Prompt is too vague or AI endpoint is unreachable Verify the Red Hat AI server is running; test with a simple prompt first
Azure VM info retrieval fails Azure credentials in AAP are expired or lack permissions Verify the Azure Service Principal credential; test with azure.azcollection.azure_rm_virtualmachine_info
Messages pile up in the dead-letter queue EDA consumer is crashing or message format is unexpected Check EDA Controller logs; inspect dead-letter messages in Azure Portal for error details

Maturity Path

Maturity Approach How It Works AI Role
Crawl Alert Enrichment Azure alert → Service Bus → EDA → AI diagnoses → enriched context posted to Slack/ITSM → human investigates and remediates Read-only: AI interprets, humans act
Walk Curated Remediation Azure alert → Service Bus → EDA → AI diagnoses → AI selects a pre-approved playbook → human approves → playbook executes AI selects from existing automation
Run Auto-Remediation Azure alert → Service Bus → EDA → AI diagnoses → Lightspeed generates a remediation playbook → policy engine validates → playbook executes AI generates new automation within policy boundaries

Start with Crawl.

Route Azure Monitor alerts to Service Bus and deploy the EDA rulebook and enrichment workflow first. Let your team see AI-enriched Azure alerts in Slack for a few weeks before adding automated remediation. This builds confidence in the pipeline and catches edge cases early.


Summary

With Azure Service Bus connected to Event-Driven Ansible, your Azure monitoring alerts become the trigger for automated enrichment and remediation. Instead of engineers manually triaging every Azure Monitor alert, events flow through Service Bus into an AIOps pipeline that diagnoses root cause with AI and remediates with Ansible – reducing MTTR and bridging the gap between Azure’s cloud monitoring and your operational automation, whether the affected systems are in Azure, on-prem, or hybrid.