Skip to content

SIPmon Root Cause Analyzer

Introduction

What is Root Cause Analysis?

Root Cause Analysis is a systematic process of identifying the fundamental causes of problems or events in your system. Rather than addressing symptoms, RCA helps you discover the underlying issues that lead to anomalies, performance degradation, or failures.

Why Use SIPmon RCA?

Traditional troubleshooting can be time-consuming and requires manual correlation of multiple metrics and logs. SIPmon RCA automates this process by:

  • Analyzing patterns across multiple data sources simultaneously
  • Correlating temporal relationships between different metrics
  • Providing confidence-rated hypotheses based on evidence
  • Reducing mean time to resolution (MTTR)
  • Offering structured action plans for remediation

How SIPmon RCA Works

The RCA module uses advanced algorithms to:

  1. Collect metrics and traces from your selected application
  2. Detect anomalies and unusual patterns in the specified time range
  3. Search for and identify previous similar incidents in historical data
  4. Correlate multiple signals to identify potential root causes
  5. Generate hypotheses with confidence levels (typically 70%+)
  6. Classify issues by severity (high, medium, low)
  7. Provide categorized remediation actions
  8. Display visual representations of system relationships
  9. Enable interactive exploration through AI-powered chat

How to Use

Accessing the Root Cause Analyzer

  1. Navigate to Monitoring in the main menu
  2. Click on Root Cause Analysis
  3. Select the RCA Overview tab

Root Cause AnalyzerFigure: Root Cause Analysis Initial Dashboard View

Running an RCA Analysis

Step 1: Select an Application

  1. Click on the Application dropdown menu
  2. Browse or search for your application
  3. Select the application you want to analyze

Note: Ensure you have the appropriate permissions to access the selected application's metrics.

Root Cause AnalysisFigure: Root Cause Analysis Select an Application View

Step 2: Choose a Time Range

  1. Click on the Time Range dropdown menu
  2. Select from preset time ranges, or define a custom range
  3. Consider selecting a time range that:
    • Includes the period when issues were observed
    • Provides sufficient context (typically 15-60 minutes)
    • Covers both normal and anomalous behavior for comparison

Root Cause AnalysisFigure: Root Cause Analysis Select Time Range

Tip: For transient issues, narrow the time range to focus on the specific incident window.

Step 3: Run the Analysis

  1. Click the Run button (rocket icon)
  2. Wait for the analysis to complete (typically 10-30 seconds)
  3. Review the generated results

Root Cause AnalysisFigure: Root Cause Analysis Run the Analysis

Important: SIPMon AI can make mistakes. Always verify important findings with additional investigation.

Understanding the Results

Once the analysis completes, you'll see several main sections:

1. Hypotheses Tab

This tab displays potential root causes ranked by confidence level.

What You'll See:

  • Hypothesis ID (e.g., HYP-1, HYP-2)
  • Title: Brief description of the potential root cause
  • Confidence Level: Indicates how certain the system is about this hypothesis
    • Accepted (green): High confidence (typically 70%+)
    • Inconclusive (gray): Moderate confidence, requires further investigation
  • Description: Detailed explanation of the hypothesis
  • Supporting Evidence: Specific observations that support this hypothesis

Example: Root Cause AnalysisFigure: Root Cause Hypotheses tab View

How to Use:

  • Start with accepted hypotheses as they have the highest likelihood
  • Review the evidence to validate the hypothesis
  • Use the raw data links to drill deeper into specific metrics

2. Issues Tab

This tab lists specific issues detected during the analysis period.

What You'll See:

  • Issue ID (e.g., ISSUE-1, ISSUE-2, ISSUE-3)
  • Title: Brief description of the issue
  • Severity Badge:
    • High (red): Critical issues requiring immediate attention
    • Medium (orange): Moderate issues that should be addressed soon
    • Low (green): Minor issues or informational alerts
  • Description: Detailed explanation including affected metrics and values

Example: Root Cause AnalysisFigure: Root Cause Issues tab View

How to Use:

  • Prioritize high-severity issues first
  • Compare issue timestamps with known events or deployments
  • Cross-reference with hypotheses to understand root causes

3. Evidence / Downstream Tab

This tab provides supporting evidence and downstream impacts.

What You'll See:

  • Metric details (names, timestamps, values)
  • Anomaly scores indicating deviation from normal behavior
  • Correlation information between different metrics
  • Visual indicators showing relationships

Example: Root Cause AnalysisFigure: Evidence / Downstream Tab View

How to Use:

  • Use evidence to validate hypotheses
  • Identify patterns across multiple metrics
  • Understand the scope and impact of issues

4. Actions Sections

The RCA provides four categories of recommended actions:

Root Cause AnalysisFigure: Actions Sections Tab View

Immediate Actions

Actions to take right now to stabilize the system or gather critical information.

Examples:

  • Check current CPU usage on affected instances
  • Review recent deployments or configuration changes
  • Collect thread dumps if CPU is still elevated
Short-term Actions

Actions to implement within hours or days to address the immediate problem.

Examples:

  • Adjust Prometheus scrape intervals
  • Increase resource allocations
  • Implement rate limiting
Long-term Actions

Strategic changes to prevent recurrence and improve system resilience.

Examples:

  • Optimize application performance
  • Implement better monitoring and alerting
  • Review and update capacity planning
Verification Steps

Steps to confirm that the issue is resolved and monitor for recurrence.

Examples:

  • Monitor CPU metrics for 24-48 hours
  • Verify scrape frequency has decreased
  • Check correlation between CPU and scrape events has weakened

How to Use Actions:

  1. Expand each action category by clicking on it
  2. Review all recommended actions
  3. Prioritize based on your operational needs
  4. Document actions taken for future reference
  5. Use verification steps to confirm resolution

Detecting Previous Incidents

The RCA automatically detects and displays previous incidents related to the current analysis, helping you identify patterns and recurring issues.

What You'll See:

  • A list of historical incidents with similar characteristics
  • Timestamps of when previous incidents occurred
  • Severity and issue types from past incidents
  • Links to view details of previous RCA results

How to Use:

  1. After running an RCA, look for the "Previous Incidents" section
  2. Review the list of detected historical incidents
  3. Click on any previous incident to view its full RCA results
  4. Compare patterns between current and previous incidents
  5. Identify if this is a recurring issue or a new problem

Benefits:

  • Pattern Recognition: Identify if similar issues have occurred before
  • Trend Analysis: Understand if problems are increasing in frequency
  • Solution Reference: Review what actions were taken in previous incidents
  • Root Cause Validation: Confirm if the same root cause appears repeatedly
  • Prevention Planning: Use historical data to implement long-term fixes

Example Use Case: If the current RCA shows a CPU spike, previous incidents might reveal:

  • Similar CPU spikes occurred every Tuesday at 2 PM (suggesting scheduled job)
  • The issue started after a specific deployment date
  • Previous mitigations that worked or didn't work

Using the Interactive Chat

The RCA includes an AI-powered chat interface where you can ask questions about the analysis results and get instant clarifications.

Accessing the Chat:

  1. Look for the chat box interface (typically at the bottom or side of the RCA results)
  2. Click on the chat input field to start typing
  3. Enter your question and press Enter or click Send

What You Can Ask:

  • Clarification Questions: "What does hypothesis HYP-1 mean in simple terms?"
  • Evidence Details: "Show me more details about the CPU spike at 11:20 AM"
  • Action Guidance: "Which action should I prioritize first?"
  • Comparison Questions: "How is this different from the incident last week?"
  • Technical Queries: "What is the anomaly score threshold?"
  • Next Steps: "What should I check next?"

Example Chat Interactions: Root Cause AnalysisFigure: Chat View

Best Practices for Using Chat:

  • Be Specific: Ask about particular hypotheses, issues, or metrics
  • Reference IDs: Use hypothesis IDs (HYP-1) or issue IDs (ISSUE-1) for clarity
  • Ask Follow-ups: Build on previous questions to dig deeper
  • Request Examples: Ask for concrete examples or scenarios
  • Verify Understanding: Rephrase the AI's explanation to confirm comprehension

Chat Limitations:

  • The chat is context-aware of the current RCA results only
  • It cannot execute actions or make changes to your system
  • Always verify critical information with actual metric data
  • The AI may not have information beyond what's in the RCA results

How to Use Actions:

  1. Expand each action category by clicking on it
  2. Review all recommended actions
  3. Prioritize based on your operational needs
  4. Document actions taken for future reference
  5. Use verification steps to confirm resolution

Viewing the RCA Context Diagram

The visual diagram shows relationships between different components and hypotheses.

Root Cause AnalysisFigure: RCA Context Diagram

Features:

  • Color-coded nodes representing different hypothesis categories
  • Connecting lines showing relationships and dependencies
  • Minimap for navigation in complex diagrams
  • Zoom controls (+, -, fullscreen, download)

How to Use:

  • Click + to zoom in, - to zoom out
  • Use the minimap (bottom right) to navigate large diagrams
  • Click nodes to highlight related components
  • Download the diagram for documentation or sharing

Best Practices

Before Running RCA:

  • Ensure you have a clear understanding of when the issue occurred
  • Select an appropriate time range that captures the anomaly
  • Have relevant context about recent changes or deployments

When Analyzing Results:

  • Don't rely solely on the highest confidence hypothesis
  • Review multiple hypotheses to get a complete picture
  • Cross-reference issues with your own observations
  • Validate findings with actual metric data
  • Check previous incidents for patterns and recurring issues
  • Use the chat feature to clarify unclear points

After RCA:

  • Document your findings and actions taken
  • Share insights with your team
  • Implement recommended actions systematically
  • Use verification steps to confirm resolution
  • Compare with previous incidents to identify trends

For Recurring Issues:

  • Compare RCA results across multiple incidents
  • Look for common patterns in hypotheses
  • Prioritize long-term actions to prevent recurrence
  • Review previous incidents to see what solutions worked
  • Use the chat to ask about differences between incidents

Using the Chat Effectively:

  • Ask specific questions about hypotheses or issues
  • Request clarification on technical terms or metrics
  • Inquire about recommended next steps
  • Compare current and previous incidents
  • Validate your understanding before taking action