AIOps automation with Ansible - Solution Guide

Overview

aiops

Traditional event-driven automation is deterministic – for every event you want to handle, you write a specific rule and a corresponding action. Ten known failure scenarios means ten rules. A hundred means a hundred. This creates a linear scaling problem: as your IT environment grows in complexity, so does the number of rules you must author, test, and maintain.

Approach Events Rules Required Actions
Traditional EDA 10 10 10
Traditional EDA 100 100 100
Traditional EDA 1,000 1,000 1,000
AIOps with EDA 1,000 1 (+ AI inference) Dynamic

AIOps breaks this linear relationship by inserting AI inference between the event and the action. Instead of hand-coding a rule for every possible failure mode, a single intelligent workflow captures the event, uses AI to diagnose the root cause, and generates the remediation dynamically. This guide demonstrates how to build that workflow using Ansible Automation Platform.

Where does AIOps fit?

The industry is moving through three generations of IT automation: task-based automation (running known playbooks on a schedule or trigger), event-driven automation (reacting to known conditions with pre-written rules), and agent-driven automation (handling novel situations with real-time contextual judgment). Most organizations today rely heavily on task-based automation, with event-driven adoption growing and agent-driven still emerging. AIOps with Ansible bridges event-driven and agent-driven by using AI to handle situations you didn’t explicitly write rules for, without requiring a fully autonomous agent.

Background

AIOps stands for artificial intelligence for IT operations. It refers both to a modern approach to managing IT operations and to the software systems that implement it. AIOps uses data science, big data, and machine learning to augment–or even automate–many traditionally manual IT tasks. The goal is to improve issue detection, root cause analysis, and system resolution.

What is AIOps? – redhat.com

There are three major parts of AIOps:

aiops diagram

Ansible Automation Platform connects observability and inference to build self-healing infrastructure.

AIOps adoption can be incremental.

You don’t need full automation on Day One. Start small, think big!

Solution

What makes up the solution?

EDA is part of Ansible Automation Platform.

It is referred to separately sometimes depending on the workflow. EDA uses rulebooks to monitor events, then executes specified job templates or workflows based on the event. Think of it simply as inputs and outputs. EDA is an automatic way for inputs into Ansible Automation Platform, where Automation controller / Automation execution is the output (running a job template or workflow).

Who Benefits

Persona Challenge What They Gain
IT Ops Engineer / SRE Manually triaging alerts, running the same diagnostic steps repeatedly, and writing one-off remediation scripts Automated incident detection, AI-generated diagnosis, and dynamically generated playbooks – less toil, faster recovery
Automation Architect Designing event-driven workflows that scale beyond what deterministic rules can cover A reference architecture for bridging EDA, AI inference, and playbook generation – adaptable to any observability tool or ITSM
IT Manager / Director Justifying AI investment and managing operational risk while reducing MTTR Incremental adoption path (Crawl → Walk → Run), measurable reduction in manual intervention, and governance guardrails before full automation

Prerequisites

Ansible Automation Platform

Collection Type Purpose
ansible.eda Certified EDA event sources and filters (Kafka, webhooks, etc.)
ansible.controller Certified Automation Controller configuration as code (job templates, workflows, surveys)
ansible.scm Certified Git operations (commit and push generated playbooks)
infra.ai Validated Provisions RHEL AI infrastructure (AWS, Azure, GCP, bare metal)
redhat.ai Certified Configures and serves AI models using InstructLab

Need to deploy your own AI inference endpoint?

The infra.ai and redhat.ai collections automate the full stack – from provisioning a GPU instance to serving a model. See the companion guide AI Infrastructure automation with Ansible for a complete walkthrough.

External Systems

System Required Examples
Observability tool Yes Filebeat, IBM Instana, Splunk, Dynatrace, Prometheus
Message queue Optional (depends on observability tool) Apache Kafka, AWS SQS, Azure Service Bus
AI inference endpoint Yes Red Hat AI (RHEL AI + InstructLab) or any OpenAI-compatible API
Ansible Lightspeed Yes Ansible Lightspeed with IBM watsonx Code Assistant
Git repository Yes GitHub, GitLab, Gitea
Chat or ITSM tool Recommended Mattermost, Slack, ServiceNow

AIOps Workflow

An AIOps workflow has four (4) parts:

  1. Event-Driven Ansible (EDA) Response

    Respond to an event, such as a systemd application outage. This falls under the Observability part of the AIOps workflow. EDA allows us to plug-in observability to an automation workflow.

  2. Log Enrichment and Prompt Generation Workflow

    AAP coordinates with Red Hat AI, notifies your chat application or ITSM (IT service management tool). This falls under the Inference part of the AIOps workflow. Again, Ansible Automation Platform ties this into an automation workflow.

  3. Remediation Workflow

    Generates a playbook via Ansible Lightspeed, syncs it to Git, builds another Job Template. This also falls under the Inference part of the AIOps workflow. For this solution we are using one AI LLM endpoint (Red Hat AI) to figure out what the issue is and diagnose it, and one AI LLM endpoint to create an Ansible Playbook to remediate the issue. This is considered a multi-LLM workflow, as we are using 2 or more LLM endpoints within our workflow.

  4. Execute Remediation

    The final Job Template that fixes the issue on your IT infrastructure. This is executing the Ansible Playbook that was generated in the previous workflow. This falls under the automation part of AIOps and wraps up our self healing infrastructure use-case.

Operational Impact per Stage

Stage Operational Impact Why
1. EDA Response None Read-only – EDA listens to events and triggers a workflow. No changes to systems.
2. Enrichment Workflow Low Collects logs, calls an AI API, posts to chat/ITSM. The only write is a notification message – no infrastructure changes.
3. Remediation Workflow Low Generates a playbook, commits to Git, creates a Job Template. Prepares the fix but does not touch production infrastructure.
4. Execute Remediation High Modifies production infrastructure (restarting services, changing config files, etc.). Should go through a change window or approval gate.

Stages 1-3 are safe to experiment with in any environment. Stage 4 is where production risk lives – which is why the guide recommends a manual approval gate at the Walk maturity level and policy-governed auto-approval at the Run level.

Could this be one workflow?

Yes – but it’s broken up for review points and easier adoption.

Example Workflow Diagram

This is a workflow example from our hands-on workshop Introduction to AI-Driven Ansible Automation

overview_diagram

This is a high level diagram.

It shows an opinionated approach for AIOps that is easily customizable for a variety of IT infrastructure use-cases.

1. Event-Driven Ansible (EDA) Response

The first part of the AIOps workflow is the Event-Driven Ansible (EDA) Response. Here is a breakdown of the four main components:

  1. IT infrastructure event
  2. Observability tool picks up event
  3. EDA sees event in message queue
  4. Execute Enrichment workflow

Example Walkthrough for EDA Response

In our hands-on workshop we simulate an httpd application outage. This is how the workflow works:

  1. IT infrastructure event: The student will break httpd using a Job Template
  2. Observability tool picks up event: Filebeat monitors Apache logs, Kafka acts as the event transport
  3. EDA sees event in message queue: EDA listens to Kafka, launches automation workflows
  4. Execute Enrichment workflow Ansible Automation Platform kicks off part 2, Log Enrichment and Prompt Generation Workflow

Why Kafka?

Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming applications, enabling applications to publish, consume, and process high volumes of data streams. It is all open source and self hosted and works great for workshops. This could be replaced by any event bus of your choosing. Event-Driven Ansible has numerous plugins including integrations with AWS SQS, AWS CloudTrail, Azure Service Bus, and Prometheus.

Why Filebeat?

Filebeat is a lightweight shipper for logs. It is also free and open source and works great for lab environments. Event-Driven Ansible has numerous plugins including integrations with BigPanda, Dynatrace, IBM Instana, Zabbix, and CyberArk.

Do you need both a message bus and an observability tool?

It depends on the particular integration. Generally combining a message bus and an observability tool will scale the most, but it really depends on your particular use-case, amount of events, etc. Many observability platforms can work directly with Event-Driven Ansible just fine.

Now that you understand our workflow example, lets dive into production-grade examples:

1. IT infrastructure event:

What kind of events are relevant in an AIOps workflow? EDA’s value proposition is that it is extremely versatile and pluggable. This means that Ansible Automation Platform can effectively handle any type of event.

However here is a great list of ideas:

Application-Level Events

Event Source Example Response
application service crash or stopped systemd, monit, or Prometheus alert Restart, notify on Slack, check last logs via journald
High 5xx error rate in NGINX/Apache Web server logs, Prometheus metrics Trigger Ansible to roll back a recent deployment or redirect traffic
Application log shows exception spike Log aggregator (ELK, Loki, Datadog) Run Ansible remediation that restarts service and clears cache
Web app fails readiness check Kubernetes liveness/readiness probe Reboot pod, scale another replica, or notify developer team
API latency exceeds threshold APM tool (e.g., Dynatrace, Instana) Provision more backend instances or restart slow services

Infrastructure & Platform Events

Event Source Example Response
Disk space usage > 90% Prometheus Node Exporter, Zabbix Clean temp files, archive logs, or extend volume
High CPU load on EC2 or VM CloudWatch, Telegraf, etc. Scale out VM set or move workloads
OOM Kill in container Kubernetes Events Restart pod, notify engineering, increase memory limits
Node goes NotReady in Kubernetes K8s API Cordon node, reassign pods, and open a ticket
Filesystem becomes read-only dmesg, audit logs, OS-level alerts Unmount and remount or migrate app to another host

Network & Security Events

Event Source Example Response
Interface down / Link failure SNMP traps, syslog Notify NOC, run Ansible playbook to reroute traffic
Unauthorized SSH attempt detected Fail2Ban, syslog, SIEM Block IP, rotate SSH keys, notify SOC
SSL certificate expiring soon Certbot, monitoring tool Auto-renew with Let’s Encrypt, push new cert with Ansible
DNS resolution failures systemd-resolved, DNS logs Switch DNS provider or fix /etc/resolv.conf

Observability-Driven Triggers

Event Source Example Response
SLO breach warning (latency or error rate) SRE observability tool Pre-emptively scale or rollback new features
New critical error introduced in logs post-deploy ELK, Honeycomb, etc. Rollback deployment via Ansible
User reports spike via feedback or ticket ITSM tools Use AI to correlate symptoms, gather diagnostics, and kick off a fix

2. Observability tool picks up event:

Now that you have a great understanding of types of events, what are great examples of observability tools that can plug-in to an AIOps workflow:

Filebeat

Filebeat is a lightweight, open-source log shipper from Elastic that forwards logs from end-systems to a message aggregator. It is not an observability platform on its own – it requires a message bus like Kafka to transport events to EDA. We use it in the AIOps workshop because it is free, low-overhead, and easy to deploy in lab environments.

IBM Instana

IBM Instana on Automation hub

IBM Instana provides real-time observability across hybrid and multicloud environments with automatic change detection, end-to-end tracing, and context-rich alerts. Its built-in anomaly detection and contextual correlation make it a natural event source for EDA rulebooks – Instana can trigger automation workflows directly or through a message bus like Kafka.

Splunk

Splunk on Automation hub

Splunk ingests logs, metrics, traces, and events from virtually any source, providing a centralized view of system health across complex IT environments. Its machine learning and anomaly detection capabilities can proactively surface issues, making it an ideal trigger source for EDA-driven remediation workflows.

3. EDA sees event in message queue

Message queues are optional depending on the observability tool. For example IBM Instana can work directly with Event-Driven Ansible to trigger automation jobs, or it can work with a message queue like Apache Kafka. Here are some examples of other message queues that Ansible Automation Platform works with:

AWS SQS

AWS on Automation hub

Amazon SQS (Simple Queue Service) is a managed message queuing service that decouples event producers from consumers. In an AIOps workflow, observability tools or AWS CloudWatch can publish events to an SQS queue, and EDA subscribes to that queue to trigger automation.

Azure Service Bus

Azure Service Bus on Automation hub

Azure Service Bus is a fully managed enterprise messaging service with message queuing and publish-subscribe capabilities. EDA can subscribe to Service Bus topics or queues to detect events from Azure resources and third-party systems, making it a natural fit for cloud-centric AIOps workflows.

Kafka

Kafka on Automation hub

Apache Kafka is a high-throughput, fault-tolerant event streaming platform that collects telemetry data, logs, alerts, and state changes from diverse sources. EDA listens to Kafka topics for specific patterns and triggers remediation workflows. Kafka’s ability to decouple producers and consumers makes it ideal for scaling AIOps pipelines across large environments.

Example rulebook for Kafka:

---
- name: Web app issue
  hosts: all
  sources:
   - ansible.eda.kafka:
       host: service1
       port: 9092
       topic: httpd-error-logs

  rules:
    - name: apache shutdown detected
      condition: event.body.message is search("shutting down")
      action:
        run_workflow_template:
          organization: "Default"
          name: "{{ workflow_template_name | default('AI Insights and Lightspeed prompt generation') }}"

    - name: show event messages
      condition: event.body.message is defined
      action:
        debug:
          msg: "{{ event.body.message }}"

2. Log Enrichment and Prompt Generation Workflow

The second part of the AIOps workflow is the Log Enrichment and Prompt Generation Workflow (or for shorthand the Enrichment Workflow). Here is a breakdown of the four main components:

  1. Capture Additional Information
  2. Red Hat AI: Analyze Incident
  3. Notify Chat / ITSM
  4. Build Ansible Lightspeed Job Template

1. Capture Additional Information

In our AIops workshop we have an Ansible Playbook that captures additional information from the host. If server01 reports an outage with an application (e.g. httpd), we can have Ansible collect additional information to

  1. Confirm the systemd process is down
  2. Collect relevant logs
  3. Collect system information (operating system, memory, etc).

This process is similar to agentic workflow, where we capture information, just as we need it.

What is an agentic workflow?

An agentic workflow with AI refers to a system where an AI agent is empowered to make decisions, take actions, and pursue goals across multiple steps–often autonomously. Unlike a single API call or a static prompt, agentic workflows involve planning, reasoning, tool use, and possibly interacting with other agents or services. These agents maintain state, adjust behavior based on feedback, and operate in loops (like ReAct or AutoGPT). The goal is to replicate more human-like problem solving, where the AI isn’t just responding, but actively working through a task. This is especially useful in AIOps, automation, and multi-step orchestration.

In this case we have a static workflow versus a fully agentic workflow, but unlike just a static LLM query, we are giving inputs from multiple sources, the event, the observability tool, system logs, and an Ansible Playbook running on the host to retrieve any additional info. In the future you will see increasingly more and more ability for the operations team to allow AI tools the ability to act more autonomously.

2. Red Hat AI: Analyze Incident

To interface with Red Hat AI we can use the redhat.ai content collection, which wraps the OpenAI-compatible API into native Ansible modules. Many AI tools, including Red Hat AI, are using the OpenAI API standard.

Why is the OpenAI API the standard?

The OpenAI API (/v1/completions, /v1/chat/completions) became a standard for interacting with LLMs because:

  • It was the first widely adopted commercial LLM API
  • It has a clean, JSON-based format that is easy to integrate
  • Tons of apps, SDKs, frameworks (like LangChain, AutoGen, Semantic Kernel) built around it

In an AIOps context, inference refers to the process where an AI model receives a question or input–such as a system error, log snippet, or performance metric–via an API, and returns a prediction or insight. This could include identifying root causes, classifying incidents, or suggesting remediation steps. The model has already been trained, so inference is the real-time application of that knowledge to live operational data. It’s a key part of integrating AI into IT workflows, enabling automated, intelligent responses without human intervention.

Here is an Ansible task for inference to Red Hat AI using the redhat.ai collection:

    - name: Analyze incident with Red Hat AI
      redhat.ai.completion:
        base_url: "http://{{ rhelai_server }}:{{ rhelai_port }}"
        token: "{{ rhelai_token }}"
        prompt: "{{ gpt_prompt | default('What is the capital of the USA?') }}"
        model_path: "/root/.cache/instructlab/models/granite-8b-lab-v1"
      delegate_to: localhost
      register: gpt_response

The response is available as gpt_response.choice_0_text (the first answer) and gpt_response.raw_response (the full JSON payload).

For advanced use cases that need full control over parameters like temperature, max_tokens, top_p, and n, use the raw_input parameter instead:

    - name: Analyze incident with Red Hat AI (advanced)
      redhat.ai.completion:
        base_url: "http://{{ rhelai_server }}:{{ rhelai_port }}"
        token: "{{ rhelai_token }}"
        raw_input:
          prompt: "{{ gpt_prompt }}"
          model: "/root/.cache/instructlab/models/granite-8b-lab-v1"
          max_tokens: 50
          temperature: 0
          top_p: 1
          n: 1
      delegate_to: localhost
      register: gpt_response

Alternatively, you can call the OpenAI-compatible API directly with ansible.builtin.uri. This approach works with any model server – not just Red Hat AI:

    - name: Send POST request using uri module
      ansible.builtin.uri:
        url: "http://{{ rhelai_server }}:{{ rhelai_port }}/v1/completions"
        method: POST
        headers:
          accept: "application/json"
          Content-Type: "application/json"
          Authorization: "Bearer {{ rhelai_token }}"
        body:
          prompt: "{{ gpt_prompt | default('What is the capital of the USA?') }}"
          max_tokens: "{{ max_tokens | default('50') }}"
          model: "/root/.cache/instructlab/models/granite-8b-lab-v1"
          temperature: "{{ input_temperature | default(0) }}"
          top_p: "{{ input_top_p | default(1) }}"
          n: "{{ input_n | default(1) }}"
        body_format: json
        return_content: true
        status_code: 200
        timeout: 60
      register: gpt_response

Both approaches produce the same result – the redhat.ai.completion module handles the HTTP details (headers, URL path, body format) for you, while ansible.builtin.uri gives you full control over the raw request.

Why set n: 1?

You’d want n: 1 when you’re only interested in getting a single best response from the AI:

  • Reduces overhead: Less processing time and memory usage, especially important when running local models like Red Hat AI.
  • Simplifies parsing: You don’t have to iterate over multiple completions or choose the best one.
  • Keeps things deterministic (especially with temperature: 0): When you’re aiming for predictable, repeatable automation, one clear response is ideal.
  • n: 1 is perfect for most automation tasks where you just need one solid, confident answer without extra noise.

Tools That Support the OpenAI-Compatible API

Tool Description
vLLM Efficient LLM serving engine with OpenAI-compatible endpoints for both completion and chat APIs
llama.cpp Lightweight C++ LLM inference with a --server mode that mimics OpenAI API
Ollama CLI + local model manager that serves LLaMA and Mistral models via OpenAI-compatible endpoints
LM Studio GUI for local LLMs that exposes an OpenAI-compatible endpoint
Text Generation WebUI Hugely flexible GUI for multiple models (GGUF, HF, etc), includes OpenAI API adapter
LocalAI Drop-in OpenAI replacement server with multi-backend support (llama.cpp, GPT4All, etc)
DeepInfra Model hosting provider offering OpenAI-compatible APIs
Replicate Platform for hosting ML models with optional OpenAI-style API access
Anyscale Provides OpenAI-compatible APIs for hosted open models with high performance

3. Notify Chat / ITSM

This is where we synchronize the output from Red Hat AI to another outside system. Here some great options:

Mattermost

Mattermost Documentation

Mattermost is an open-source, self-hostable chat platform with robust API and webhook integrations. We use it in the AIOps workshop as a free alternative to Slack or Microsoft Teams – Ansible posts AI-generated diagnoses and remediation updates to a Mattermost channel in real time.

ServiceNow

ServiceNow on Automation hub

ServiceNow is an enterprise-grade IT Service Management (ITSM) platform and the system of record for incidents, changes, and problem management. In AIOps workflows, AI-driven insights enrich ServiceNow tickets automatically, and Ansible closes the loop by triggering remediation directly from ticket data.

Slack

Slack is an enterprise messaging platform with a rich API and app ecosystem that integrates seamlessly with Ansible, ServiceNow, and monitoring tools. In AIOps workflows, Slack channels serve as a real-time hub where AI-driven alerts, diagnostics, and remediation updates are delivered directly to operations teams.

4. Build Ansible Lightspeed Job Template

The final piece of this workflow is creating (or updating) a Ansible Automation Platform job template with the insights we just gained from Red Hat AI.

We have a problem, such as an application outage, and we now have a solution: for example: “there is a mis-configuration on this line” and now we can use this information we gleaned and now prompt Ansible Lightspeed in the next workflow, the Remediation workflow. We are basically using one AI endpoint (Red Hat AI) to help create a prompt to a second AI endpoint (Ansible Lightspeed) to create an Ansible Playbook to help remediate the issue.

A way to do this (an opinionated way, but not the only way) is to use an Ansible Survey.

In the above screenshot, the left is the prompt we will use in the next workflow, while the right is the insights we gleaned from Red Hat AI based on all the information it had. This is a natural breakpoint where the human can course correct the prompt. We will still have time to review the solution before we move it into production, but it may make sense for your IT operations team to review this prompt before we move onto Lightspeed.

Could we make this one workflow?

YES! Absolutely. This is just showing how it is easy to adopt AIOps incrementally and add natural breakpoints to review what is happening as you adopt AI into your IT workflows.

Here is an excerpt from the Ansible Playbook used in our AIOps workshop:

    - name: Create Job Template
      ansible.controller.job_template:
        name: "<img src="https://cdn.jsdelivr.net/gh/twitter/twemoji@14.0.2/assets/72x72/1f9e0.png" width="20" style="vertical-align:text-bottom;"> Lightspeed Remediation Playbook Generator"
        job_type: "run"
        inventory: "{{ input_inventory | default('Demo Inventory') }}"
        project: "{{ input_project | default('AI-EDA') }}"
        playbook: "{{ input_playbook | default('playbooks/lightspeed_generate.yml') }}"
        credentials:
          - "{{ input_credential | default('AAP') }}"
          - "{{ vault_credential | default('ansible-vault') }}"
        validate_certs: true
        execution_environment: "Default execution environment"

    - name: Load remediation workflow template
      ansible.controller.workflow_job_template:
        name: Remediation Workflow
        organization: Default
        state: present
        survey_enabled: true
        survey_spec: "{{ lookup('template', playbook_dir + '/templates/survey.j2') }}"

We are able to dynamically update the survey inside the workflow by using the survey_spec. The survey spec looks like the following:

{
  "name": "Prompt Lightspeed",
  "description": "Survey for working with Ansible Lightspeed",
  "spec": [
    {
      "type": "textarea",
      "question_name": "Enter a prompt to create a remediation playbook?",
      "question_description": "This prompt will be used with Ansible Lightspeed to automatically generate an Ansible Playbook",
      "variable": "lightspeed_prompt",
      "required": true,
      "default": "For all hosts, use become true,  Remove line that contains InvalidDirectiveHere from /etc/httpd/conf/httpd.conf and restart httpd"
    },
    {
      "type": "textarea",
      "question_name": "This box shows the prompt generated from our Workflow job template",
      "question_description": "This area shows you that AI was able to understand the error and create a prompt for Ansible Lightspeed.",
      "variable": "ai_lightspeed_prompt",
      "required": false,
      "default": "{{ gpt_generated_prompt }}"
    }
  ]
}

Now everytime the first workflow Log Enrichment and Prompt Generation Workflow runs, the second workflow Remediation Workflow automatically has its survey updates to be relevant to that workflow. This is doing Config as Code using the ansible.controller content collection.

Please consider using the AAP Configuration Collection on Automation hub. This content collection simplifies interactions for Ansible Automation Platform.

3. Remediation Workflow

The third part of the AIOps workflow is the Remediation Workflow. This workflow will take a prompt from the previous workflow, allow the human operator to customize this prompt, then build an Ansible Playbook to remediate the issue, sync this to git and build a job template that will run this playbook for the final step.

Here is a breakdown of the four main components:

  1. Lightspeed Remediation Playbook Generator
  2. Commit Fix to Git
  3. Sync Project
  4. Build Remediation Template

1. Lightspeed Remediation Playbook Generator

Ansible Lightspeed also has an API. We can communicate to this API similarly as we do with Red Hat AI solutions. Here is an example task:

    - name: Send request to AI API
      ansible.builtin.uri:
        url: "{{ input_lightspeed_url | default('https://c.ai.ansible.redhat.com/api/v0/ai/generations/') }}"
        method: POST
        headers:
          Content-Type: "application/json"
          Authorization: "Bearer {{ lightspeed_wca_token }}"
        body_format: json
        body:
          text: "{{ lightspeed_prompt }}"
      register: response

The API will respond with the Ansible Playbook as part of the payload under the playbook keyword. You can see the response in the Job Output window:

2. Commit Fix to Git

Once the Ansible Playbook is retrieved from Ansible Lightspeed, we need to store it in a Git repo. We can use the Ansible SCM (source control management) content collection to easily publish the content to any specified repo.

Here is an excerpt from our AIOps Workshop:

    - name: Commit and push final version of playbook to Gitea
      ansible.scm.git_publish:
        path: "{{ repository['path'] }}"
        token: "{{ gitea_token }}"

3. Sync Project

This is a special node type inside the Workflow Visualizer that will sync a Project. Since we just pushed this Ansible Playbook to Git we need to sync the Project to update and retrieve this playbook to be used in an Ansible Job Template. Here is a screen shot from the Workflow visualizer showing the Project Sync:

4. Build Remediation Template

The final job template inside this workflow is creating a new job template with the newly created Ansible Playbook. We can use the ansible.controller content collection to dynamically build this. Here is an example:

    - name: Create Job Template
      ansible.controller.job_template:
        name: "<img src="https://cdn.jsdelivr.net/gh/twitter/twemoji@14.0.2/assets/72x72/1f527.png" width="20" style="vertical-align:text-bottom;"><img src="https://cdn.jsdelivr.net/gh/twitter/twemoji@14.0.2/assets/72x72/2705.png" width="20" style="vertical-align:text-bottom;"> Execute HTTPD Remediation"
        job_type: "run"
        inventory: "{{ input_inventory | default('lab-inventory') }}"
        project: "{{ input_project | default('Lightspeed-Playbooks') }}"
        playbook: "{{ input_playbook | default('lightspeed-response.yml') }}"
        credential: "{{ input_credential | default('lab-credential') }}"
        validate_certs: true
        execution_environment: "Default execution environment"
        become_enabled: true
        ask_limit_on_launch: true

Could we just run the playbook now?

YES, but you may want to run the playbook during a specific change window. This is another natural breakpoint where you can add more guard rails before you push an Ansible job into production. Now that you have the Job Template queued up, you can run it whenever you want.

4. Execute Remediation

The final step is running the remediation Job Template that was dynamically created in the previous workflow. This is the Ansible Playbook that Lightspeed generated – committed to Git, synced to a project, and loaded into a Job Template – now ready to execute against the affected host.

In the workflow diagram above, this is the Execute HTTPD Remediation node at the bottom – the final step before the system returns to steady state. Notice this is marked as a Manual Step. This is intentional: the human operator reviews the AI-generated playbook and decides when to execute, giving organizations a natural approval gate before changes reach production.

Why not fully automate this last step?

You absolutely can. For organizations further along their AIOps maturity, this final step can be wired directly into the Remediation Workflow so the fix executes automatically. The manual breakpoint exists for teams that want to adopt AIOps incrementally – gaining confidence in the AI-generated playbooks before removing the human gate.

Policy Enforcement

Before executing AI-generated playbooks in production, organizations should consider adding policy guardrails. Ansible Automated Policy as Code enables teams to define and enforce rules about what automation is allowed to do – for example, restricting which hosts can be targeted, which modules are permitted, or requiring approval workflows before high-impact changes.

In an AIOps context, policy enforcement is the safety net that lets you increase automation confidence over time:

Maturity Policy Approach
Crawl Human reviews every AI-generated playbook before execution
Walk Policy engine validates playbook content automatically; human approves execution
Run Policy engine validates and auto-approves within defined boundaries; exceptions escalate to human

Validation

Because this guide covers a reference architecture rather than a single tool integration, validation is done per workflow stage rather than with a single test command. Each stage has a clear success indicator that confirms the pipeline is working before moving to the next.

Stage What to Verify Success Indicator
1. EDA Response Rulebook activation is running and receiving events EDA Controller shows the rulebook activation as Running; event log shows received events
2. Enrichment Workflow AI analyzed the incident and notifications were sent Workflow Visualizer shows all nodes green; chat/ITSM received the AI-generated diagnosis
3. Remediation Workflow Lightspeed generated a playbook and it was committed to Git New playbook file exists in the Git repository; Job Template was created with the correct playbook
4. Execute Remediation The AI-generated playbook resolved the issue Job Template run completes successfully; the application or service returns to steady state

Want hands-on validation?

The companion workshop Hands-On AIOps: Building Self-Healing, Observability-Driven Automation with Ansible walks through this entire pipeline end-to-end with a live lab environment. Part 1 covers Apache service remediation, Part 2 extends the same pattern to network automation with Splunk and Cisco routers.

Troubleshooting Across Workflow Boundaries

Most failures in an AIOps pipeline happen at the integration boundaries – where one stage hands off to the next. Here are the most common issues:

Symptom Boundary Likely Cause Fix
EDA rulebook is active but workflow never launches EDA → Enrichment Workflow Event payload doesn’t match rulebook condition Compare the actual event JSON against the rulebook condition field
Enrichment Workflow runs but AI response is empty or generic Enrichment → Red Hat AI Prompt is missing context (no logs, no system info) Verify the “Capture Additional Information” job collected data and passed it to the AI prompt
Lightspeed returns a playbook but it doesn’t fix the issue Remediation → Execute Prompt to Lightspeed was too vague or AI diagnosis was incorrect Review the prompt in the Remediation Workflow survey; refine the AI-generated prompt before Lightspeed generates the playbook
Job Template was created but playbook is missing Remediation → Git/Project Sync Git commit failed or project sync didn’t run Check Gitea/GitHub for the commit; verify the Project Sync node succeeded in the Workflow Visualizer

AIOps Maturity Path

The Broader AIOps Journey

AIOps is not a single use case – it is a maturity journey. Organizations typically progress through increasing levels of complexity, starting with read-only enrichment and advancing toward autonomous, self-healing systems. Each stage builds on the previous one.

Maturity Use Cases What AI Does
Crawl Incident & Ticket Enrichment, Cost & Resource Optimization AI interprets operational signals and attaches context – no changes to systems
Walk Curated Automation Remediation, Intelligent Capacity Orchestration AI selects from pre-approved automation – proven playbooks, governed execution
Run Self-Healing Infrastructure, System-Level Drift & Policy Enforcement AI generates or adapts remediation automation on-the-fly – constrained by policy guardrails

AIOps is the outcome. Agentic is a capability.

Agentic workflows – where AI plans, uses tools, reflects on results, and iterates – can enhance any stage of this journey. But AIOps does not require agentic capabilities to deliver value. A deterministic EDA rulebook that enriches a ServiceNow ticket is AIOps at the Crawl stage. A fully autonomous agent that reasons about OSPF failures is AIOps at the Run stage. Start where you are.

Self-Healing Infrastructure: Crawl, Walk, Run

This guide covers the self-healing infrastructure use case, which has its own maturity progression. The key difference at each stage is how much autonomy AI has over the remediation:

Maturity Approach How It Works AI Role
Crawl Ticket Enrichment EDA detects event → AI diagnoses root cause → enriched context is posted to chat/ITSM → human remediates manually Read-only: AI interprets, humans act
Walk Curated Remediation EDA detects event → AI diagnoses root cause → AI selects the right playbook from a pre-approved library → human approves → playbook executes AI selects from existing automation; no new code is created
Run Self-Healing EDA detects event → AI diagnoses root cause → Ansible Lightspeed generates a new remediation playbook → policy engine validates → playbook executes AI generates new automation on-the-fly within policy boundaries

The workflow in this guide demonstrates the Run stage – AI generates a remediation playbook that didn’t exist before. But organizations can start at Crawl by using only the Enrichment Workflow (parts 1-2) and stopping before the Remediation Workflow. Each stage reuses the same underlying architecture; the difference is how far down the pipeline you automate.


Summary

With this workflow in place, your team moves from manually triaging every alert to AI-diagnosed, dynamically remediated incidents. Instead of writing hundreds of rules to match hundreds of failure modes, a single AIOps pipeline uses AI inference to diagnose issues and Ansible Lightspeed to generate remediation playbooks on-the-fly. The result is faster mean time to resolution (MTTR), less operational toil, and an automation strategy that scales with your environment rather than against it.