Docs

    Azure Service Health delivers personalized alerts about service issues, planned maintenance, and health advisories affecting your Azure environment, enabling targeted response and proactive management. This documentation provides guidance on interpreting Service Health notifications and offers actionable recommendations for resolution.

    How Monitoring Works


    We monitor the following Azure Service Health event types:

    • Degraded Resource
    • Health Advisory
    • Planned Maintenance
    • Service Issue
    • Unavailable Resource

    While Azure Monitor tracks metrics and Azure Service Health provides event notifications, Blue Matador advances monitoring by correlating health events with your actual infrastructure state and performance metrics. With intelligent filtering and proactive alerting, Blue Matador surfaces the Service Health events that require action rather than overwhelming teams with routine notifications. By integrating Service Health monitoring with broader infrastructure observability, administrators can respond to issues efficiently and maintain optimal system availability.

     

    Understanding Event Categories


    Azure Service Health delivers several categories of events, and understanding which category you're addressing determines the appropriate response strategy.

    Event Types

    • Service Issue: Active problems currently affecting Azure services in your subscribed regions, requiring immediate investigation and potential mitigation.
    • Unavailable Resource: Critical events indicating complete unavailability of specific resources, demanding urgent response.
    • Degraded Resource: Events indicating partial functionality or performance degradation of resources, requiring assessment and potential intervention.
    • Planned Maintenance: Scheduled maintenance activities that may impact resource availability during specified timeframes.
    • Health Advisory: Informational notifications regarding service changes, deprecations, or recommendations that may affect your environment.

     

    Responding to Service Issues

    Possible Solutions

    • Identify Affected Services and Regions
      • Review event details to determine which Azure services and regions are experiencing issues.
      • Cross-reference affected services with your deployed resources to understand potential impact scope.
    • Assess Application Impact
      • Determine whether the service issue is causing user-facing degradation or if redundancy mechanisms are maintaining availability.
      • Consult monitoring dashboards and application metrics to validate actual service impact versus potential impact.
    • Implement Mitigation Strategies
      • For multi-region deployments, consider redirecting traffic to unaffected regions through Traffic Manager or Front Door.
      • Evaluate whether deploying temporary resources in unaffected regions can maintain service continuity.
      • Document incident timeline and mitigation actions for post-incident analysis.
    • Monitor Issue Resolution
      • Track Azure's provided updates on issue status and expected resolution timeframes.
      • Maintain communication channels with stakeholders regarding service status and mitigation efforts.

    Effective service issue response requires distinguishing between problems that can be mitigated through infrastructure adjustments and situations where Azure service restoration is necessary. Where regional failover or resource redeployment can restore service, implement those changes promptly. When resolution depends on Azure service restoration, monitor progress and maintain appropriate escalation channels through Azure Support.

     

    Addressing Unavailable and Degraded Resources


    Possible Solutions

    • Verify Resource Status
      • Access the Azure portal to confirm resource status and review any additional diagnostic information.
      • Check resource-specific metrics and logs to understand the nature and extent of the issue.
    • Assess Redundancy Configuration
      • Determine whether the affected resource is part of a redundant configuration with automatic failover capabilities.
      • For resources without redundancy, evaluate impact on dependent applications and services.
    • Implement Recovery Actions
      • For unavailable resources, attempt restart or redeployment procedures if Azure's guidance indicates this may restore functionality.
      • For degraded resources, consider scaling operations or traffic redistribution to healthy instances.
      • If the resource is critical and recovery is not immediate, deploy replacement resources in unaffected availability zones or regions.
    • Engage Azure Support
      • For resource-specific issues not resolved through standard recovery procedures, open a support case with Azure Support.
      • Provide detailed resource information and observed symptoms to expedite troubleshooting.

    Resource-specific health events often indicate issues that may be isolated to particular instances or configurations. Rapid assessment of redundancy status and implementation of recovery procedures can minimize service disruption while Azure investigates and resolves underlying issues.

     

    Managing Planned Maintenance

    Possible Solutions

    • Evaluate Maintenance Impact
      • Review the scheduled maintenance timeframe and assess potential conflicts with business-critical operations.
      • Determine whether maintenance affects resources with built-in redundancy or presents risk of service interruption.
    • Plan Resource Adjustments
      • For maintenance affecting non-redundant resources, consider migrating workloads to unaffected resources before the maintenance window.
      • Utilize availability sets, availability zones, or multi-region deployments to ensure service continuity during maintenance.
    • Review Self-Service Options
      • Some planned maintenance allows self-service scheduling, enabling you to select an optimal maintenance window.
      • Access the Azure portal to review available self-service options and reschedule maintenance when applicable.
    • Prepare Validation Procedures
      • Ensure rollback procedures are documented and tested in case post-maintenance issues arise.
      • Establish monitoring and validation protocols to verify resource health following maintenance completion.

    Planned maintenance notifications provide the opportunity to implement proactive measures that minimize service disruption. Evaluate whether passive acceptance is appropriate or if active resource migration or rescheduling is necessary to maintain service availability during the maintenance window.

     

    Responding to Health Advisories


    Possible Solutions

    • Security and Compliance Advisories
      • Review security recommendations and assess current configuration against advised best practices.
      • Implement recommended security controls or configuration changes to address identified vulnerabilities or gaps.
    • Service Deprecation Notices
      • Establish migration timelines for deprecated services, features, or API versions identified in advisories.
      • Test replacement services or updated APIs in non-production environments before deprecation deadlines.
      • Update documentation and runbooks to reflect service changes.
    • Performance and Configuration Recommendations
      • Evaluate performance optimization or configuration recommendations against current resource utilization.
      • Implement advised changes during scheduled maintenance windows to improve resource efficiency or reliability.
    • Billing and Pricing Updates
      • Review pricing change notifications to understand impact on operational costs.
      • Adjust resource configurations or reserved capacity planning based on pricing updates.

    Health advisories require appropriate triage to identify which messages demand immediate action versus those providing general guidance. While many advisories are informational, security recommendations and deprecation notices may require strategic planning and timely response to maintain security posture and service continuity.


    Remember to correlate Azure Service Health events with your monitoring and observability platforms to validate actual service impact on your applications. Review event history to identify patterns in maintenance schedules or recurring issues that may indicate architectural considerations for improved resilience. Consult Azure Support for clarification on event details or to escalate issues requiring additional assistance or expedited resolution.