Dinindu Suriyamudali
·Modernising Apps | Unlocking Potential with AI/ML | Building the Future, One Cloud at a Time ☁️

Building an AI-Powered Cloud Engineer Agent

How I built a comprehensive cloud engineering solution powered by Amazon Bedrock, MCP servers, and AWS Strands for automated operations, cost optimisation, root cause analysis, and intelligent infrastructure management

Introduction

Managing cloud infrastructure at scale requires constant monitoring, rapid response to issues, deep expertise across multiple AWS services, proactive cost optimisation, and comprehensive architectural guidance. What if you could have an AI-powered cloud engineer that never sleeps, automatically responds to errors, creates Jira tickets, generates pull requests, performs Well-Architected reviews, conducts root cause analysis, optimises costs, and provides expert guidance 24/7?

That's exactly what I built - a comprehensive Cloud Engineer Agent that combines the power of Amazon Bedrock's Claude model, Model Context Protocol (MCP) servers, and AWS Strands to create an intelligent, automated cloud operations platform accessible through Slack. This system goes far beyond a simple error response. It's a complete cloud engineering companion that handles everything from routine operations to complex architectural assessments.

Architecture Overview

Our solution represents a sophisticated multi-component architecture that seamlessly integrates various AWS services, external APIs, and AI capabilities:

Key Components Deep Dive

1. Multi-Input Architecture

Our system is designed to handle two primary input sources:

Slack Interface: Users can interact naturally with the Cloud Engineer Agent through Slack channels, asking questions about AWS services, requesting infrastructure changes, or seeking troubleshooting assistance.

CloudWatch Log Events: The system automatically monitors CloudWatch logs for errors and anomalies, triggering automated response workflows without human intervention.

2. AWS Strands Integration

At the heart of our Lambda function lies AWS Strands, which provides a powerful toolkit of integrated capabilities:

  • aws_cloudwatch: CloudWatch monitoring, logging, and alerting capabilities

  • aws_cost_explorer: Cost analysis, budget tracking, and optimisation recommendation

  • atlassian: Jira integration for issue tracking and project management

  • aws_eks: Elastic Kubernetes Service management and orchestration

  • aws_ecs: Elastic Container Service operations and container management

  • use_aws: Direct AWS service interactions and resource management

  • memory: Context retention and conversation history across sessions

3. MCP Server Architecture

AWS Documentation MCP Server: Maintains up-to-date access to AWS documentation, architectural patterns, and technical guides.

Atlassian MCP Server: Provides comprehensive Jira integration for automated ticket creation, project management, and workflow orchestration.

GitHub MCP Server: Enables seamless repository management, automated pull request creation, and version control integration.

AWS EKS MCP Server: Offers Kubernetes cluster management, pod orchestration, and container deployment capabilities.

Cost Explorer MCP Server: Delivers real-time cost analysis, billing insights, and resource optimisation recommendations.

CloudWatch MCP Server: Provides comprehensive monitoring, logging, and alerting services with custom dashboard creation and metric analysis.

4. Amazon Bedrock Integration

Amazon Bedrock's comprehensive suite powers AI capabilities:

Claude Model: Advanced language understanding and generation

Guardrails: Content filtering and safety validation

Knowledge Base: RAG implementation for internal knowledge repository

Enhanced Data Flow

The system follows a sophisticated data flow pattern:

  1. Input Processing: Slack messages or CloudWatch log events trigger API Gateway

  2. Lambda Orchestration: AWS Strands-powered Lambda processes requests using integrated tools

  3. Service Integration: MCP Proxy provides load-balanced access to Fargate-hosted MCP servers

  4. AI Processing: Amazon Bedrock processes requests with the Claude model and safety guardrails

  5. Response Aggregation: Lambda combines responses from all integrated services

  6. Output Delivery: Processed responses return to Slack, with automated Jira ticket creation and GitHub PR generation

CloudAgent Capabilities Showcase

Fast-Tracking Errors to Pull Requests

Time Reduction: 15-30 minutes → 2 minutes (87% faster)

Manual Process:

  1. DETECT - Check alerts, user reports, monitoring (varies)

  2. INVESTIGATE - Review logs, search errors, check metrics (varies)

  3. LOCATE - Code review, recent changes, dependencies (varies)

  4. ANALYSE - Root cause, trace flow, find pattern (varies)

  5. IMPLEMENT - Write fix, code review, raise PR (varies)

  6. DOCUMENT - Update JIRA, notify team, postmortem (varies)

CloudAgent Automation: Automated 9-step workflow from error detection to pull request creation:

  1. Error simulation triggers the Lambda agent

  2. MCP initialisation and root cause analysis

  3. Automated issue creation and PR generation

  4. Complete Slack integration for team notifications

Well-Architected Reviews at Speed

Time Reduction: 1-2 hours → 7 minutes (94% faster)

Manual Process:

  1. Preparation - Identify instance and gather docs (15-30 mins)

  2. Assessment - Review each pillar comprehensively (30-60 mins)

  3. Documentation - Document findings, improvements & risks in Jira (15-30 mins)

CloudAgent Automation:

  • Comprehensive AWS Well-Architected Review completed automatically

  • Detailed analysis across all six pillars (Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimisation, Sustainability)

  • Automated findings documentation with specific recommendations

  • Priority-based action items with immediate implementation guidance

ClickOps Tracked NL Ops

Every action is traceable. Every command is accountable.

Demonstrated Capabilities:

  • Natural language AWS operations with full audit trails

  • Automated security group rule management with Jira ticket creation

  • Real-time execution tracking with detailed command logs

  • Complete integration with existing ITSM processes (Jira, Slack)

Example Operations:

  • "Open port 3389 on 'win_test' (Sydney) for 10.122.0.0/16. Create a Jira to track removal—temporary rule"

  • Automated security group modifications with rule IDs and descriptions

  • Immediate Jira ticket creation for tracking and compliance

Talk to the Cloud

Time Reduction: 30-60 minutes → ~5 minutes (92% faster)

Troubleshooting through conversation. No more digging through logs or CloudTrail

Manual Process:

  1. Identify - Root cause from logs (10-20 mins)

  2. Cross-check - CloudTrail events (10-20 mins)

  3. Find related - metrics or alarms (5-10 mins)

  4. Raising - Jira (5-10 mins)

CloudAgent Automation:

  • Conversational troubleshooting interface

  • Automatic log analysis and correlation

  • Real-time CloudTrail event investigation

  • Intelligent error pattern recognition

  • Automated ticket creation with complete context

Unified DevOps Intelligence via MCP Servers

Time Reduction: 30-60 minutes → ~5 minutes (92% faster)

Explore code, tickets, and changes - all through one interface

Manual Process:

  1. Understand - the Framework (15-30 mins)

  2. Search - GitHub PRs (15-30 mins)

  3. Map - PRs to Framework (15-30 mins)

  4. Verify - Jira Tickets (15-30 mins)

CloudAgent Automation:

  • Unified interface connecting GitHub, Jira, and compliance frameworks

  • Automated Enterprise Security & Compliance Framework analysis

  • Real-time PR mapping to security requirements

  • Comprehensive vulnerability assessments with remediation guidance

  • Colour-coded priority system (Critical/High/Medium/Low)

Predicting Spend, Made Simple

Real-time AWS cost forecasting through natural language queries

Demonstrated Capabilities:

  • "What is the forecasted cost for ECS clusters in the Sydney region?"

  • 3-month cost predictions with confidence levels (80% confidence)

  • Monthly breakdown with range estimates

  • Historical usage pattern analysis

  • Automated trend identification and cost optimisation insights

Example Output:

  • Forecast Period: August 12 - November 12, 2025

  • Total 3-Month Forecast: $37.12 (range: $28.12 - $46.13)

  • Monthly breakdown with detailed range predictions

  • Trend analysis identifying cost patterns and optimisation opportunities

Rapid Compliance at Scale

Run a compliance check linked to the definition in Confluence, generate a PR + Jira ticket

Development Journey: Lessons Learned

The AI Tooling Revolution

This project showcased the power of modern AI development tools:

  • Product Development & Planning: Claude assisted with PRD creation and architectural planning

  • Large-Scale Development: Cline + Mantel API Gateway enabled rapid codebase development and refactoring

  • Documentation: Gemini generated comprehensive documentation from demo screenshots

  • Visual Assets: aws-diagram-mcp automated architecture diagram creation

  • Surgical Code Fixes: Amazon Q provided precise, targeted problem resolution

  • Development Acceleration: GitHub Copilot delivered real-time completions and commit message generation

System Prompt Engineering Challenges

Achieving surgical precision in automated responses required extensive system prompt refinement. The challenge was balancing comprehensive capabilities with focused execution - ensuring the agent could handle complex scenarios while maintaining minimal, targeted fixes for specific issues.

Multi-Agent vs. Single-Agent Architecture

Initial exploration of a multi-agent architecture revealed significant challenges:

Multi-Agent Challenges:

  • Context fragmentation across specialised agents

  • Over-specialisation leading to broader changes than necessary

  • Communication overhead and information loss during handoffs

  • Competing objectives between different agents

Single-Agent Superiority:

  • Complete context awareness without information fragmentation

  • Clear single objective focused on specific problem resolution

  • Simplified execution path, eliminating orchestration overhead

  • Consistent precision in delivering minimal, targeted changes

This architectural insight proved crucial for achieving surgical precision in automated error response workflows.

Future Roadmap

Planned enhancements include:

  • Enhanced RAG Implementation: Bedrock Knowledge Base or S3 Vector integration for improved contextual responses

  • Advanced Memory Management: Memory Strands tool for sophisticated context retention

  • Cost Intelligence: CloudWatch Dashboard integration for comprehensive cost monitoring

  • Enterprise Security: Advanced API security and authentication mechanisms

Conclusion

Building an AI-powered Cloud Engineer Agent represents a significant leap forward in cloud operations automation. By combining Amazon Bedrock's AI capabilities, MCP servers, and AWS Strands, I've created a system that not only responds to infrastructure issues but proactively manages and optimises cloud environments.

The key lessons learned - particularly around single-agent architecture superiority and the power of modern AI development tools - provide valuable insights for anyone building similar systems. The result is a comprehensive solution that transforms how teams interact with and manage their AWS infrastructure.

The future of cloud engineering lies in intelligent automation, and this architecture provides a robust foundation for organisations looking to scale their cloud operations while maintaining reliability, security, and cost effectiveness.

Resources

GitHub Repository

09/03/2025