SIPmon Root Cause Analyzer

Introduction

What is Root Cause Analysis?

Root Cause Analysis is a systematic process of identifying the fundamental causes of problems or events in your system. Rather than addressing symptoms, RCA helps you discover the underlying issues that lead to anomalies, performance degradation, or failures.

Why Use SIPmon RCA?

Traditional troubleshooting can be time-consuming and requires manual correlation of multiple metrics and logs. SIPmon RCA automates this process by:

Analyzing patterns across multiple data sources simultaneously
Correlating temporal relationships between different metrics
Providing confidence-rated hypotheses based on evidence
Reducing mean time to resolution (MTTR)
Offering structured action plans for remediation

How SIPmon RCA Works

The RCA module uses advanced algorithms to:

Collect metrics and traces from your selected application
Detect anomalies and unusual patterns in the specified time range
Search for and identify previous similar incidents in historical data
Correlate multiple signals to identify potential root causes
Generate hypotheses with confidence levels (typically 70%+)
Classify issues by severity (high, medium, low)
Provide categorized remediation actions
Display visual representations of system relationships
Enable interactive exploration through AI-powered chat

How to Use

Accessing the Root Cause Analyzer

Navigate to Monitoring in the main menu
Click on Root Cause Analysis
Select the RCA Overview tab

Root Cause Analyzer Figure: Root Cause Analysis Initial Dashboard View

Running an RCA Analysis

Step 1: Select an Application

Click on the Application dropdown menu
Browse or search for your application
Select the application you want to analyze

Note: Ensure you have the appropriate permissions to access the selected application's metrics.

Figure: Root Cause Analysis Select an Application View

Step 2: Choose a Time Range

Click on the Time Range dropdown menu
Select from preset time ranges, or define a custom range
Consider selecting a time range that:
- Includes the period when issues were observed
- Provides sufficient context (typically 15-60 minutes)
- Covers both normal and anomalous behavior for comparison

Figure: Root Cause Analysis Select Time Range

Tip: For transient issues, narrow the time range to focus on the specific incident window.

Step 3: Run the Analysis

Click the Run button (rocket icon)
Wait for the analysis to complete (typically 10-30 seconds)
Review the generated results

Figure: Root Cause Analysis Run the Analysis

Important: SIPMon AI can make mistakes. Always verify important findings with additional investigation.

Understanding the Results

Once the analysis completes, you'll see several main sections:

1. Hypotheses Tab

This tab displays potential root causes ranked by confidence level.

What You'll See:

Hypothesis ID (e.g., HYP-1, HYP-2)
Title: Brief description of the potential root cause
Confidence Level: Indicates how certain the system is about this hypothesis
- Accepted (green): High confidence (typically 70%+)
- Inconclusive (gray): Moderate confidence, requires further investigation
Description: Detailed explanation of the hypothesis
Supporting Evidence: Specific observations that support this hypothesis

Example: Root Cause Analysis Figure: Root Cause Hypotheses tab View

How to Use:

Start with accepted hypotheses as they have the highest likelihood
Review the evidence to validate the hypothesis
Use the raw data links to drill deeper into specific metrics

2. Issues Tab

This tab lists specific issues detected during the analysis period.

What You'll See:

Issue ID (e.g., ISSUE-1, ISSUE-2, ISSUE-3)
Title: Brief description of the issue
Severity Badge:
- High (red): Critical issues requiring immediate attention
- Medium (orange): Moderate issues that should be addressed soon
- Low (green): Minor issues or informational alerts
Description: Detailed explanation including affected metrics and values

Example: Root Cause Analysis Figure: Root Cause Issues tab View

How to Use:

Prioritize high-severity issues first
Compare issue timestamps with known events or deployments
Cross-reference with hypotheses to understand root causes

3. Evidence / Downstream Tab

This tab provides supporting evidence and downstream impacts.

What You'll See:

Metric details (names, timestamps, values)
Anomaly scores indicating deviation from normal behavior
Correlation information between different metrics
Visual indicators showing relationships

Example: Root Cause Analysis Figure: Evidence / Downstream Tab View

How to Use:

Use evidence to validate hypotheses
Identify patterns across multiple metrics
Understand the scope and impact of issues

4. Actions Sections

The RCA provides four categories of recommended actions:

Root Cause Analysis Figure: Actions Sections Tab View

Immediate Actions

Actions to take right now to stabilize the system or gather critical information.

Examples:

Check current CPU usage on affected instances
Review recent deployments or configuration changes
Collect thread dumps if CPU is still elevated

Short-term Actions

Actions to implement within hours or days to address the immediate problem.

Examples:

Adjust Prometheus scrape intervals
Increase resource allocations
Implement rate limiting

Long-term Actions

Strategic changes to prevent recurrence and improve system resilience.

Examples:

Optimize application performance
Implement better monitoring and alerting
Review and update capacity planning

Verification Steps

Steps to confirm that the issue is resolved and monitor for recurrence.

Examples:

Monitor CPU metrics for 24-48 hours
Verify scrape frequency has decreased
Check correlation between CPU and scrape events has weakened

How to Use Actions:

Expand each action category by clicking on it
Review all recommended actions
Prioritize based on your operational needs
Document actions taken for future reference
Use verification steps to confirm resolution

Detecting Previous Incidents

The RCA automatically detects and displays previous incidents related to the current analysis, helping you identify patterns and recurring issues.

What You'll See:

A list of historical incidents with similar characteristics
Timestamps of when previous incidents occurred
Severity and issue types from past incidents
Links to view details of previous RCA results

How to Use:

After running an RCA, look for the "Previous Incidents" section
Review the list of detected historical incidents
Click on any previous incident to view its full RCA results
Compare patterns between current and previous incidents
Identify if this is a recurring issue or a new problem

Benefits:

Pattern Recognition: Identify if similar issues have occurred before
Trend Analysis: Understand if problems are increasing in frequency
Solution Reference: Review what actions were taken in previous incidents
Root Cause Validation: Confirm if the same root cause appears repeatedly
Prevention Planning: Use historical data to implement long-term fixes

Example Use Case: If the current RCA shows a CPU spike, previous incidents might reveal:

Similar CPU spikes occurred every Tuesday at 2 PM (suggesting scheduled job)
The issue started after a specific deployment date
Previous mitigations that worked or didn't work

Using the Interactive Chat

The RCA includes an AI-powered chat interface where you can ask questions about the analysis results and get instant clarifications.

Accessing the Chat:

Look for the chat box interface (typically at the bottom or side of the RCA results)
Click on the chat input field to start typing
Enter your question and press Enter or click Send

What You Can Ask:

Clarification Questions: "What does hypothesis HYP-1 mean in simple terms?"
Evidence Details: "Show me more details about the CPU spike at 11:20 AM"
Action Guidance: "Which action should I prioritize first?"
Comparison Questions: "How is this different from the incident last week?"
Technical Queries: "What is the anomaly score threshold?"
Next Steps: "What should I check next?"

Example Chat Interactions: Root Cause Analysis Figure: Chat View

Best Practices for Using Chat:

Be Specific: Ask about particular hypotheses, issues, or metrics
Reference IDs: Use hypothesis IDs (HYP-1) or issue IDs (ISSUE-1) for clarity
Ask Follow-ups: Build on previous questions to dig deeper
Request Examples: Ask for concrete examples or scenarios
Verify Understanding: Rephrase the AI's explanation to confirm comprehension

Chat Limitations:

The chat is context-aware of the current RCA results only
It cannot execute actions or make changes to your system
Always verify critical information with actual metric data
The AI may not have information beyond what's in the RCA results

How to Use Actions:

Expand each action category by clicking on it
Review all recommended actions
Prioritize based on your operational needs
Document actions taken for future reference
Use verification steps to confirm resolution

Viewing the RCA Context Diagram

The visual diagram shows relationships between different components and hypotheses.

Root Cause Analysis Figure: RCA Context Diagram

Features:

Color-coded nodes representing different hypothesis categories
Connecting lines showing relationships and dependencies
Minimap for navigation in complex diagrams
Zoom controls (+, -, fullscreen, download)

How to Use:

Click + to zoom in, - to zoom out
Use the minimap (bottom right) to navigate large diagrams
Click nodes to highlight related components
Download the diagram for documentation or sharing

Best Practices

Before Running RCA:

Ensure you have a clear understanding of when the issue occurred
Select an appropriate time range that captures the anomaly
Have relevant context about recent changes or deployments

When Analyzing Results:

Don't rely solely on the highest confidence hypothesis
Review multiple hypotheses to get a complete picture
Cross-reference issues with your own observations
Validate findings with actual metric data
Check previous incidents for patterns and recurring issues
Use the chat feature to clarify unclear points

After RCA:

Document your findings and actions taken
Share insights with your team
Implement recommended actions systematically
Use verification steps to confirm resolution
Compare with previous incidents to identify trends

For Recurring Issues:

Compare RCA results across multiple incidents
Look for common patterns in hypotheses
Prioritize long-term actions to prevent recurrence
Review previous incidents to see what solutions worked
Use the chat to ask about differences between incidents

Using the Chat Effectively:

Ask specific questions about hypotheses or issues
Request clarification on technical terms or metrics
Inquire about recommended next steps
Compare current and previous incidents
Validate your understanding before taking action

Reporting

Advanced Reporting

Auto Discovery

Topology Dashboard

Anomaly Detection

Plugin Pack

IP Phone

Netflow Analyzer

Root Cause Analyzer

Mobile App

SIPmon Root Cause Analyzer

Introduction

What is Root Cause Analysis?

Why Use SIPmon RCA?

How SIPmon RCA Works

How to Use

Accessing the Root Cause Analyzer

Running an RCA Analysis

Step 1: Select an Application

Step 2: Choose a Time Range

Step 3: Run the Analysis

Understanding the Results

1. Hypotheses Tab

2. Issues Tab

3. Evidence / Downstream Tab

4. Actions Sections

Immediate Actions

Short-term Actions

Long-term Actions

Verification Steps

Detecting Previous Incidents

Using the Interactive Chat

Viewing the RCA Context Diagram

Best Practices

SIPmon Root Cause Analyzer ​

Introduction ​

What is Root Cause Analysis? ​

Why Use SIPmon RCA? ​

How SIPmon RCA Works ​

How to Use ​

Accessing the Root Cause Analyzer ​

Running an RCA Analysis ​

Step 1: Select an Application ​

Step 2: Choose a Time Range ​

Step 3: Run the Analysis ​

Understanding the Results ​

1. Hypotheses Tab ​

2. Issues Tab ​

3. Evidence / Downstream Tab ​

4. Actions Sections ​

Immediate Actions ​

Short-term Actions ​

Long-term Actions ​

Verification Steps ​

Detecting Previous Incidents ​

Using the Interactive Chat ​

Viewing the RCA Context Diagram ​

Best Practices ​

SIPmon Root Cause Analyzer

Introduction

What is Root Cause Analysis?

Why Use SIPmon RCA?

How SIPmon RCA Works

How to Use

Accessing the Root Cause Analyzer

Running an RCA Analysis

Step 1: Select an Application

Step 2: Choose a Time Range

Step 3: Run the Analysis

Understanding the Results

1. Hypotheses Tab

2. Issues Tab

3. Evidence / Downstream Tab

4. Actions Sections

Immediate Actions

Short-term Actions

Long-term Actions

Verification Steps

Detecting Previous Incidents

Using the Interactive Chat

Viewing the RCA Context Diagram

Best Practices