Behind the Scenes: Building AI-Powered Observability

Introducing Prophet for Forecasting and Anomaly Detection

Behind the Scenes: AI-Powered Observability at Kloudfuse

Table of Contents

As machine learning data engineers at Kloudfuse, we’re excited to highlight the advanced ML and AI capabilities of our platform. As we learn more from our customers who are already using these capabilities to proactively identify issues and enhance their user experience, we will continue to improve and add new AI-powered features

Here is a short summary of our current ML and AI features, along with the latest updates in Kloudfuse 3.0.

1. Rolling Quantile

Use Case: Anomaly Detection

We use rolling quantiles to detect anomalies in time series data. By calculating quantiles (like 25th, 75th quartiles, and IQR(Inter quartile range)) over a moving time-window, we establish a statistical range for expected behavior in each time-window. When the realtime values deviate beyond these ranges, Kloudfuse signals a potential anomaly, allowing engineers to react promptly.

In monitoring server response times, we can compute the 75th percentile over the past hour and add 1.5 times the interquartile range (difference between 75th and 25th quantiles) to set a performance upper threshold. If the current response time is higher than this threshold, an alert is triggered, indicating response time being too high due to a possible performance issue.

2. SARIMA (Seasonal Autoregressive Integrated Moving Average)

Use Case: Time Series Forecasting and Anomaly Detection

We use SARIMA for forecasting future values in seasonal time series data such as server load and request counts. Its ability to incorporate trends and seasonality makes it a popular choice in observability, particularly for capacity planning and identifying when resources may be insufficient.

For instance, SARIMA allows Kloudfuse to analyze historical data that shows increased service requests on weekends. By leveraging these forecasts, we can anticipate request volumes for the upcoming week, enabling proactive resource scaling to prevent outages during peak times.

On the other hand, SARIMA can be used for anomaly detection by forecasting a range (upper and lower bounds) and flagging the real-time value of the timeseries as an anomaly if it falls outside the range. We provide users the ability to create anomaly alerts using SARIMA which is insensitive to seasonal changes, unlike rolling quantiles method.

3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Use Case: Time Series Outlier Detection

We use DBSCAN to identify clusters in high-dimensional observability data to detect patterns or highlighting unusual behavior in system performance. This technique is particularly useful for uncovering correlated failures.

For example, when facing excessive amounts of log entries, DBSCAN can cluster similar sources together, thus identifying the source contributing to abnormal log messages that suggest specific system failures. This accelerates the identification of root causes during incident investigations.

4. Seasonal Decomposition

Use Case: Anomaly Detection and Forecasting

We use seasonal decomposition to break down time series data into trend and seasonal components. This helps differentiate between seasonal fluctuations in data and anomalies to help diagnose whether an issue or a deviation is part of normal behavior or indicative of a problem.

For example, CPU traffic can be decomposed into its trends and daily/weekly seasonal patterns. If the traffic spikes inconsistently, additional resources can be put in place, or one can look into potential problems.

5. Pearson Correlation Coefficient

Use Case: Time Series Correlation Analysis

We added Pearson correlation coefficient recently to correlate metrics, events, logs, and trace (MELT) data. Understanding these correlations can highlight potential bottlenecks or dependencies in system performance.

For example, when monitoring CPU usage and application response times, Pearson correlation coefficient can draw a correlation between CPU usage and increased response times. This finding may be indicative of resource allocation that warrants further investigation.

6. Prophet

Use Case: Time Series Forecasting and Anomaly Detection

In our latest Kloudfuse 3.0 release, we've integrated Prophet for enhanced forecasting and anomaly detection.

In many ways Prophet is perfect for Observability data as it excels in handling irregular time series data and can tolerate missing values or significant outliers that require preprocessing of the data. More details about the Prophet can be found here.

For instance, in a scenario monitoring the number of incoming CPU hits over time, there may be gaps in the data due to system outages or periods of low activity. Prophet can model the underlying trend and seasonal patterns in data such as regular spikes in particular times of the day, predicting future volumes even in the presence of these gaps.

Prophet outperforms many previous models with its speed and ability to deliver real-time insights. It requires no explicit hyper-parameter tuning via time-consuming cross-validation. Users can easily set periodicities—such as daily or weekly—and specify multiple seasonalities, making adjustments straightforward without the need of additional tuning.

Moreover, Prophet, being a Bayesian algorithm, provides better confidence bounds by predicting a distribution for each timestamp, which lowers the risk of false alarms compared to the single point estimates typical of other models that use a frequentist approach to generate confidence bounds. This is especially useful for scenarios where training data is limited, in these scenarios Prophet generates a broader confidence bound, allowing for larger thresholds and reducing alert fatigue. As the algorithm gathers more information, confidence bounds it generates become increasingly precise and narrower.

Conclusion

Incorporating advanced machine learning algorithms into our observability platform has allowed Kloudfuse to transform how we manage and predict system performance. By leveraging these powerful tools, we can detect anomalies, forecast demand, and identify patterns with greater accuracy. This not only enhances our capacity planning but also ensures a more resilient infrastructure that can adapt to changing needs.

Stay tuned for more as we continue to refine our approach.

Observe. Analyze. Automate.

Free Download

Playground