Claude Agent Skill · by Github

Azure Resource Health Diagnose

The azure-resource-health-diagnose skill analyzes Azure resources to assess their health status, diagnoses issues by examining logs and telemetry data, and gene

Install
Terminal · npx
$npx skills add https://github.com/github/awesome-copilot --skill azure-resource-health-diagnose
Works with Paperclip

How Azure Resource Health Diagnose fits into a Paperclip company.

Azure Resource Health Diagnose drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md290 lines
Expand
---name: azure-resource-health-diagnosedescription: 'Analyze Azure resource health, diagnose issues from logs and telemetry, and create a remediation plan for identified problems.'--- # Azure Resource Health & Issue Diagnosis This workflow analyzes a specific Azure resource to assess its health status, diagnose potential issues using logs and telemetry data, and develop a comprehensive remediation plan for any problems discovered. ## Prerequisites- Azure MCP server configured and authenticated- Target Azure resource identified (name and optionally resource group/subscription)- Resource must be deployed and running to generate logs/telemetry- Prefer Azure MCP tools (`azmcp-*`) over direct Azure CLI when available ## Workflow Steps ### Step 1: Get Azure Best Practices**Action**: Retrieve diagnostic and troubleshooting best practices**Tools**: Azure MCP best practices tool**Process**:1. **Load Best Practices**:   - Execute Azure best practices tool to get diagnostic guidelines   - Focus on health monitoring, log analysis, and issue resolution patterns   - Use these practices to inform diagnostic approach and remediation recommendations ### Step 2: Resource Discovery & Identification**Action**: Locate and identify the target Azure resource**Tools**: Azure MCP tools + Azure CLI fallback**Process**:1. **Resource Lookup**:   - If only resource name provided: Search across subscriptions using `azmcp-subscription-list`   - Use `az resource list --name <resource-name>` to find matching resources   - If multiple matches found, prompt user to specify subscription/resource group   - Gather detailed resource information:     - Resource type and current status     - Location, tags, and configuration     - Associated services and dependencies 2. **Resource Type Detection**:   - Identify resource type to determine appropriate diagnostic approach:     - **Web Apps/Function Apps**: Application logs, performance metrics, dependency tracking     - **Virtual Machines**: System logs, performance counters, boot diagnostics     - **Cosmos DB**: Request metrics, throttling, partition statistics     - **Storage Accounts**: Access logs, performance metrics, availability     - **SQL Database**: Query performance, connection logs, resource utilization     - **Application Insights**: Application telemetry, exceptions, dependencies     - **Key Vault**: Access logs, certificate status, secret usage     - **Service Bus**: Message metrics, dead letter queues, throughput ### Step 3: Health Status Assessment**Action**: Evaluate current resource health and availability**Tools**: Azure MCP monitoring tools + Azure CLI**Process**:1. **Basic Health Check**:   - Check resource provisioning state and operational status   - Verify service availability and responsiveness   - Review recent deployment or configuration changes   - Assess current resource utilization (CPU, memory, storage, etc.) 2. **Service-Specific Health Indicators**:   - **Web Apps**: HTTP response codes, response times, uptime   - **Databases**: Connection success rate, query performance, deadlocks   - **Storage**: Availability percentage, request success rate, latency   - **VMs**: Boot diagnostics, guest OS metrics, network connectivity   - **Functions**: Execution success rate, duration, error frequency ### Step 4: Log & Telemetry Analysis**Action**: Analyze logs and telemetry to identify issues and patterns**Tools**: Azure MCP monitoring tools for Log Analytics queries**Process**:1. **Find Monitoring Sources**:   - Use `azmcp-monitor-workspace-list` to identify Log Analytics workspaces   - Locate Application Insights instances associated with the resource   - Identify relevant log tables using `azmcp-monitor-table-list` 2. **Execute Diagnostic Queries**:   Use `azmcp-monitor-log-query` with targeted KQL queries based on resource type:    **General Error Analysis**:   ```kql   // Recent errors and exceptions   union isfuzzy=true        AzureDiagnostics,       AppServiceHTTPLogs,       AppServiceAppLogs,       AzureActivity   | where TimeGenerated > ago(24h)   | where Level == "Error" or ResultType != "Success"   | summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h)   | order by TimeGenerated desc   ```    **Performance Analysis**:   ```kql   // Performance degradation patterns   Perf   | where TimeGenerated > ago(7d)   | where ObjectName == "Processor" and CounterName == "% Processor Time"   | summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h)   | where avg_CounterValue > 80   ```    **Application-Specific Queries**:   ```kql   // Application Insights - Failed requests   requests   | where timestamp > ago(24h)   | where success == false   | summarize FailureCount=count() by resultCode, bin(timestamp, 1h)   | order by timestamp desc      // Database - Connection failures   AzureDiagnostics   | where ResourceProvider == "MICROSOFT.SQL"   | where Category == "SQLSecurityAuditEvents"   | where action_name_s == "CONNECTION_FAILED"   | summarize ConnectionFailures=count() by bin(TimeGenerated, 1h)   ``` 3. **Pattern Recognition**:   - Identify recurring error patterns or anomalies   - Correlate errors with deployment times or configuration changes   - Analyze performance trends and degradation patterns   - Look for dependency failures or external service issues ### Step 5: Issue Classification & Root Cause Analysis**Action**: Categorize identified issues and determine root causes**Process**:1. **Issue Classification**:   - **Critical**: Service unavailable, data loss, security breaches   - **High**: Performance degradation, intermittent failures, high error rates   - **Medium**: Warnings, suboptimal configuration, minor performance issues   - **Low**: Informational alerts, optimization opportunities 2. **Root Cause Analysis**:   - **Configuration Issues**: Incorrect settings, missing dependencies   - **Resource Constraints**: CPU/memory/disk limitations, throttling   - **Network Issues**: Connectivity problems, DNS resolution, firewall rules   - **Application Issues**: Code bugs, memory leaks, inefficient queries   - **External Dependencies**: Third-party service failures, API limits   - **Security Issues**: Authentication failures, certificate expiration 3. **Impact Assessment**:   - Determine business impact and affected users/systems   - Evaluate data integrity and security implications   - Assess recovery time objectives and priorities ### Step 6: Generate Remediation Plan**Action**: Create a comprehensive plan to address identified issues**Process**:1. **Immediate Actions** (Critical issues):   - Emergency fixes to restore service availability   - Temporary workarounds to mitigate impact   - Escalation procedures for complex issues 2. **Short-term Fixes** (High/Medium issues):   - Configuration adjustments and resource scaling   - Application updates and patches   - Monitoring and alerting improvements 3. **Long-term Improvements** (All issues):   - Architectural changes for better resilience   - Preventive measures and monitoring enhancements   - Documentation and process improvements 4. **Implementation Steps**:   - Prioritized action items with specific Azure CLI commands   - Testing and validation procedures   - Rollback plans for each change   - Monitoring to verify issue resolution ### Step 7: User Confirmation & Report Generation**Action**: Present findings and get approval for remediation actions**Process**:1. **Display Health Assessment Summary**:   ```   🏥 Azure Resource Health Assessment      📊 Resource Overview:   • Resource: [Name] ([Type])   • Status: [Healthy/Warning/Critical]   • Location: [Region]   • Last Analyzed: [Timestamp]      🚨 Issues Identified:   • Critical: X issues requiring immediate attention   • High: Y issues affecting performance/reliability     • Medium: Z issues for optimization   • Low: N informational items      🔍 Top Issues:   1. [Issue Type]: [Description] - Impact: [High/Medium/Low]   2. [Issue Type]: [Description] - Impact: [High/Medium/Low]   3. [Issue Type]: [Description] - Impact: [High/Medium/Low]      🛠️ Remediation Plan:   • Immediate Actions: X items   • Short-term Fixes: Y items     • Long-term Improvements: Z items   • Estimated Resolution Time: [Timeline]      ❓ Proceed with detailed remediation plan? (y/n)   ``` 2. **Generate Detailed Report**:   ```markdown   # Azure Resource Health Report: [Resource Name]      **Generated**: [Timestamp]     **Resource**: [Full Resource ID]     **Overall Health**: [Status with color indicator]      ## 🔍 Executive Summary   [Brief overview of health status and key findings]      ## 📊 Health Metrics   - **Availability**: X% over last 24h   - **Performance**: [Average response time/throughput]   - **Error Rate**: X% over last 24h   - **Resource Utilization**: [CPU/Memory/Storage percentages]      ## 🚨 Issues Identified      ### Critical Issues   - **[Issue 1]**: [Description]     - **Root Cause**: [Analysis]     - **Impact**: [Business impact]     - **Immediate Action**: [Required steps]      ### High Priority Issues     - **[Issue 2]**: [Description]     - **Root Cause**: [Analysis]     - **Impact**: [Performance/reliability impact]     - **Recommended Fix**: [Solution steps]      ## 🛠️ Remediation Plan      ### Phase 1: Immediate Actions (0-2 hours)   ```bash   # Critical fixes to restore service   [Azure CLI commands with explanations]   ```      ### Phase 2: Short-term Fixes (2-24 hours)   ```bash   # Performance and reliability improvements   [Azure CLI commands with explanations]   ```      ### Phase 3: Long-term Improvements (1-4 weeks)   ```bash   # Architectural and preventive measures   [Azure CLI commands and configuration changes]   ```      ## 📈 Monitoring Recommendations   - **Alerts to Configure**: [List of recommended alerts]   - **Dashboards to Create**: [Monitoring dashboard suggestions]   - **Regular Health Checks**: [Recommended frequency and scope]      ## ✅ Validation Steps   - [ ] Verify issue resolution through logs   - [ ] Confirm performance improvements   - [ ] Test application functionality   - [ ] Update monitoring and alerting   - [ ] Document lessons learned      ## 📝 Prevention Measures   - [Recommendations to prevent similar issues]   - [Process improvements]   - [Monitoring enhancements]   ``` ## Error Handling- **Resource Not Found**: Provide guidance on resource name/location specification- **Authentication Issues**: Guide user through Azure authentication setup- **Insufficient Permissions**: List required RBAC roles for resource access- **No Logs Available**: Suggest enabling diagnostic settings and waiting for data- **Query Timeouts**: Break down analysis into smaller time windows- **Service-Specific Issues**: Provide generic health assessment with limitations noted ## Success Criteria- ✅ Resource health status accurately assessed- ✅ All significant issues identified and categorized- ✅ Root cause analysis completed for major problems- ✅ Actionable remediation plan with specific steps provided- ✅ Monitoring and prevention recommendations included- ✅ Clear prioritization of issues by business impact- ✅ Implementation steps include validation and rollback procedures