Understanding Monitor Categories
Blue Matador's Event Hubs monitors fall into four broad concern areas, resource saturation, throughput anomalies, quota and throttling, and errors, and understanding which category you're addressing determines the appropriate response strategy.
Event Types
-
Resource Saturation
- High CPU / Namespace CPU Usage: CPU utilization on the Event Hubs cluster or namespace is elevated, indicating processing pressure that may affect request handling latency.
- High Memory Usage / Namespace Memory Usage: Memory consumption is elevated at the cluster or namespace level, which can precede instability or degraded throughput.
- High Cluster Utilization: Overall cluster resource utilization is high, indicating the namespace is approaching its provisioned capacity ceiling.
-
Throughput Anomalies
- High Incoming Bytes: Ingress byte volume is elevated, which may indicate a producer traffic spike or misconfigured producer behavior.
- High Outgoing Bytes: Egress byte volume is elevated, which may indicate consumer fan-out behavior or unexpected downstream demand.
- Incoming Messages Drop: Incoming messages are being dropped, indicating ingestion failures at the namespace level.
- Low Outgoing Messages: Outgoing message volume has fallen unexpectedly, which may indicate consumer group failures or downstream processing stalls.
-
Quota and Throttling
- Quota Exceeded: The namespace has exceeded an applicable quota, blocking further ingestion or egress until the quota resets or is increased.
- Throttled Requests: Requests are being throttled due to throughput unit exhaustion, causing producers or consumers to receive server busy responses.
Errors
- Server Errors: The Event Hubs service is returning server-side errors, which may indicate a platform issue or resource exhaustion condition.
- User Errors: Requests are returning user-side errors, typically indicating authentication failures, malformed requests, or references to nonexistent resources.
Responding to Resource Saturation
Possible Solutions
- Scale Namespace Capacity
- Increase throughput unit or processing unit allocation for the affected namespace. For namespaces with Auto-Inflate enabled, verify the maximum TU ceiling is not constraining automatic scaling.
- For Premium or Dedicated tier namespaces, evaluate whether additional processing units or cluster capacity are required to accommodate current workload.
- Review Producer and Consumer Behavior
- Identify whether specific producers or consumer groups are contributing disproportionate load.
- Evaluate batching, compression, and retry configurations at the application level to reduce per-request overhead.
- Assess whether workloads can be distributed across multiple namespaces to reduce per-namespace resource pressure.
Sustained resource saturation indicates that provisioned capacity is consistently misaligned with actual workload. Address the immediate gap through scaling while evaluating whether current load represents a permanent shift requiring architectural adjustment.
Responding to Throughput Anomalies
Possible Solutions
- Investigate Incoming Messages Drop
- Incoming message drops indicate that the namespace is rejecting ingestion requests.
- Check for concurrent throttling or quota exceeded events, as drops commonly occur alongside capacity exhaustion.
- Review producer-side error telemetry to confirm whether drops are being surfaced as errors to upstream applications and whether retry logic is in place.
- Investigate Low Outgoing Messages
- A drop in outgoing message volume may indicate that consumer groups have stalled, crashed, or lost connectivity.
- Review consumer application health and offset progression metrics to identify which consumer groups are affected.
- Check whether the Event Hub itself has received reduced incoming traffic, which would naturally reduce outgoing volume. This distinguishes a consumer-side failure from an ingestion-side drop.
- Respond to High Incoming or Outgoing Bytes
- Identify whether byte volume elevation is expected given current producer or consumer activity, or whether it represents anomalous behavior such as a misconfigured producer sending oversized payloads. If volume is legitimate but exceeds comfortable capacity margins, evaluate scaling throughput units or redistributing producers across namespaces.
Responding to Quota Exceeded and Throttled Requests
Possible Solutions
- Identify the Exhausted Quota or Throughput Limit
- Review namespace metrics and Azure portal quota information to identify which specific limit has been reached : throughput units, namespace-level message size limits, or subscription-level quotas.
- Determine whether the breach is a one-time spike or a sustained trend, as this affects whether immediate scaling or longer-term capacity planning is the appropriate response.
- Scale or Request Quota Increases
- For throughput unit exhaustion, increase TU allocation or raise the Auto-Inflate maximum. For subscription-level quota limits, submit a quota increase request through the Azure portal.
- Evaluate whether producer retry storms following throttling are amplifying the problem. Implement exponential backoff in producer clients if not already in place.
- Distribute Load
- If quota limits cannot be increased quickly enough to address immediate demand, evaluate distributing producers across multiple namespaces to stay within per-namespace limits.
Responding to Server Errors and User Errors
Possible Solutions
- Triage Server Errors
- Server errors indicate a platform-side issue. Cross-reference with Azure Service Health to determine whether an active incident is affecting Event Hubs in your region.
- If no platform incident is active, review namespace resource utilization, server errors can occur under extreme resource saturation even in the absence of a broader platform issue.
- Triage User Errors
- User errors are client-side and typically fall into three categories: authentication and authorization failures, references to nonexistent Event Hubs or consumer groups, and malformed requests.
- Review producer and consumer application logs to identify the specific error codes being returned. Common culprits include expired SAS tokens, incorrect connection strings, or consumer groups that have been deleted.
- Engage Azure Support
- For persistent server errors not attributable to resource saturation or a known platform incident, open a support case with Azure Support.
- Provide the namespace resource ID, affected timeframe, and observed error codes to expedite investigation.
- Billing and Pricing Updates
- Review pricing change notifications to understand impact on operational costs.
- Adjust resource configurations or reserved capacity planning based on pricing updates.
Remember to correlate Event Hubs monitor alerts with related signals across your infrastructure — throttling alongside high cluster utilization tells a different story than throttling in isolation. Review alert history to identify patterns in recurring saturation or error conditions that may indicate architectural considerations for improved pipeline resilience. Consult Azure Support for clarification on platform-side error conditions or to escalate issues requiring expedited resolution.