Claude Agent Skill · by Acedergren

Monitoring Operations

Install Monitoring Operations skill for Claude Code from acedergren/oci-agent-skills.

Install
Terminal · npx
$npx skills add https://github.com/acedergren/oci-agent-skills --skill monitoring-operations
Works with Paperclip

How Monitoring Operations fits into a Paperclip company.

Monitoring Operations drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md183 lines
Expand
---name: monitoring-operationsdescription: Use when setting up metrics, alarms, or troubleshooting missing data in OCI Monitoring. Covers metric namespace confusion, alarm threshold gotchas, log collection setup, and common monitoring gaps.license: MITmetadata:  author: alexander-cedergren  version: "2.0.0"--- # OCI Monitoring and Observability - Expert Knowledge ## 🏗️ Use OCI Landing Zone Terraform Modules **Don't reinvent the wheel.** Use [oracle-terraform-modules/landing-zone](https://github.com/oracle-terraform-modules/terraform-oci-landing-zones) for observability stack. **Landing Zone solves:**- ❌ Bad Practice #10: No logging, monitoring, notifications (Landing Zone deploys complete observability)- ❌ Bad Practice #7: Limited security services (Landing Zone integrates Cloud Guard, VSS, OSMS) **This skill provides**: Metrics, alarms, and troubleshooting for monitoring deployed WITHIN a Landing Zone. --- ## ⚠️ OCI CLI/API Knowledge Gap **You don't know OCI CLI commands or OCI API structure.** Your training data has limited and outdated knowledge of:- OCI CLI syntax and parameters (updates monthly)- OCI API endpoints and request/response formats- Monitoring service CLI operations (`oci monitoring alarm`, `oci monitoring metric`)- Metric namespaces and MQL (Monitoring Query Language)- Latest Logging and Service Connector features **When OCI operations are needed:**1. Use exact CLI commands from this skill's references2. Do NOT guess metric namespace names3. Do NOT assume AWS CloudWatch patterns work in OCI4. Load reference files for detailed MQL documentation **What you DO know:**- General observability concepts- Alerting and threshold design principles- Log aggregation patterns This skill bridges the gap by providing current OCI-specific monitoring patterns and gotchas. --- ## NEVER Do This ❌ **NEVER assume metrics are instant (10-15 minute lag)**- Metrics published every 1-5 minutes- Processing delay: 5-10 minutes- **Total lag**: 10-15 minutes from event to visible metric- Don't debug "missing metrics" within first 15 minutes of resource creation ❌ **NEVER use `=` for alarm thresholds with sparse metrics**```# WRONG - alarm never fires if metric has gapsMetricName[1m].mean() = 0 # RIGHT - handle missing dataMetricName[1m]{dataMissing=zero}.mean() > 0``` ❌ **NEVER forget metric dimensions (causes "no data")**```# WRONG - missing required dimensionCPUUtilization[1m].mean() # RIGHT - include resourceId dimensionCPUUtilization[1m]{resourceId="<instance-ocid>"}.mean()``` ❌ **NEVER set alarm thresholds without trigger delay (alert fatigue)**```# BAD - fires on every CPU spikeCPUUtilization[1m].mean() > 80 # BETTER - sustained high CPUCPUUtilization[5m].mean() > 80Trigger delay: 5 minutes (fires after 5 consecutive breaches)``` ❌ **NEVER create alarms without notification channels**```# WRONG - alarm fires but nobody knowsoci monitoring alarm create ... --destinations '[]' # RIGHT - always link to notification topicoci monitoring alarm create ... --destinations '["<notification-topic-ocid>"]'```Cost impact: Undetected outages cost $5,000-50,000/hour in production ❌ **NEVER ignore Cloud Guard findings (security audit failure)**- Cloud Guard detects misconfigurations BEFORE they become incidents- Integrate Cloud Guard → Notifications → Email/Slack/PagerDuty- Cost impact: $100,000+ per security breach vs $0 for proactive remediation ## Metric Namespace Gotchas **OCI Metrics Use Service-Specific Namespaces:** | Service | Namespace | Example Metric ||---------|-----------|----------------|| Compute | `oci_computeagent` | `CPUUtilization`, `MemoryUtilization` || Autonomous DB | `oci_autonomous_database` | `CpuUtilization`, `StorageUtilization` || Load Balancer | `oci_lbaas` | `HttpRequests`, `UnHealthyBackendServers` || Object Storage | `oci_objectstorage` | `ObjectCount`, `BytesUploaded` | **Common Mistake**: Using wrong namespace (`oci_compute` vs `oci_computeagent`) ## Alarm Missing Data Handling | Setting | Behavior | Use When ||---------|----------|----------|| `treatMissingDataAsBreaching` | Alarm fires if no data | Critical services (outage = breach) || `treatMissingDataAsNotBreaching` | Alarm silent if no data | Optional monitoring || `{dataMissing=zero}` | Treat missing as 0 | Counters (requests/sec) | ## Log Collection Common Gaps **Problem**: Logs not showing in Log Analytics ```Logs not appearing?├─ Is log enabled on resource?│  └─ Compute: oci-compute-agent must be running│  └─ Function: Logging enabled in function config├─ Is Service Connector configured?│  └─ Source: Log Group → Target: Log Analytics│  └─ Check: Service Connector status = ACTIVE├─ IAM policy for Service Connector?│  └─ "Allow any-user to use log-content in tenancy"│  └─ "Allow service loganalytics to READ logcontent in tenancy"└─ 10-15 minute ingestion lag?   └─ Wait before debugging``` ## Metric Query Optimization **Expensive** (slow):```# Queries ALL instancesCPUUtilization[1m].mean()``` **Optimized** (filter by dimension):```# Query specific instanceCPUUtilization[1m]{resourceId='<instance-ocid>'}.mean()``` **Cost**: Queries free, but rate limited (1000 req/min) ## Progressive Loading References ### OCI Monitoring Reference (Official Oracle Documentation) **WHEN TO LOAD** [`oci-monitoring-reference.md`](references/oci-monitoring-reference.md):- Need comprehensive list of all OCI service metrics- Understanding MQL (Monitoring Query Language) in depth- Implementing complex alarm conditions and composites- Need official Oracle guidance on Logging and Service Connector- Setting up Log Analytics and APM integration **Do NOT load** for:- Quick alarm setup (examples in this skill)- Common metric patterns (tables above)- Troubleshooting decision trees (covered above) --- ## When to Use This Skill - Alarms: threshold configuration, missing data handling, trigger delay- Troubleshooting: metrics not showing, alarms not firing, namespace errors- Log collection: Service Connector, IAM policies, missing logs- Performance: query optimization, dimension filtering