Why We Built an Observability Data Lake

Table of Contents

Observability has evolved significantly over the past couple of decades. At Kloudfuse, our journey began with a clear vision: to transition from fragmented, vendor-specific observability solutions to a unified platform. After speaking with hundreds of users of tools like DataDog, New Relic, and various open-source alternatives, it became clear that the primary challenge was tool fragmentation. Developers and SREs often found themselves navigating multiple tools for diagnostics, relying on everything from log analysis and metrics monitoring to APM to pinpoint the root causes of failures.

To tackle this challenge, we recognized that observability requires seamless backend integration at the data layer. This would eliminate manual intervention and speed up root cause analysis. By enabling this integration, one could quickly identify whether an application's performance issues stem from faulty code, slow infrastructure, complex interactions among microservices, or a combination of all.

This led us to consider data lakes—a proven technology in the data and analytics space—as the foundational architecture for our observability platform.

Key factors we evaluated in selecting a data lake for our solution included the ability to perform sub-second real-time monitoring, handle massive volumes of observability data, uncover causal relationships for root cause analysis, and integrate advanced machine learning algorithms for pattern detection and forecasting.

Today, we're proud of our decision. We see competitors following our lead, and our customers consistently affirm that Kloudfuse addresses their current challenges while providing a solution for their future needs.

To delve deeper, here are key factors why Kloudfuse’s Observability Data Lake stands out:

Unification of Fragmented Observability

In a market filled with traditional observability tools that separate metrics, traces, and logs, data silos complicate operations. Kloudfuse’s observability data lake unified all telemetry data into a single data lake. This approach allows for:

Correlation across various data types—metrics, traces, logs, and more—without managing multiple tools.
A unified query language, user interface, and alerting system, enabling users to easily identify and analyze performance issues across the full stack.
Faster troubleshooting by maintaining context throughout investigations.

Scalability and Cost Efficiency

Observability generates massive amounts of data, resulting in skyrocketing costs primarily due to the processing and storage requirements associated with this data. Traditional vendors often charge millions of dollars in fees and overages to monitor and track this data for their customers.

Kloudfuse addresses this challenge by allowing customers to:

Deploy their observability data lake in a cloud-prem environment, giving them control over costs.
Utilize cloud credits and discounts for storage.
Store data affordably using low-cost Amazon S3 storage.
Eliminate egress fees associated with moving data to vendor clouds.
Scale up as telemetry data grows, without incurring usage-based costs.

Schemaless Ingest and Real-Time Analytics

Observability requires real-time insights. We designed our data lake to ingest data in real time without preprocessing, while ensuring fast query performance. We chose Apache Pinot, a distributed, real time OLAP datastore, as the base layer, as it by itself is not purpose-built for Observability; that allowed us to enable:

Ultra-low query latencies for immediate monitoring and alerts.
High query concurrency for handling large workloads.

Building on top of Pinot, we have implemented extensive functionalities to purpose-build it for observability which is a high cardinality, metadata heavy application. These advancements include:

Instant data availability upon ingestion is achieved through high-speed, schemaless indexing, enabling flexible storage of diverse data without a predefined structure. This approach allows us to build a platform that accommodates all types of observability data, both now and in the future.
A fingerprinting index that filters out static and dynamic parts of observability events, enabling efficient indexing and improved query performance. This auto-facet extraction of logs facilitates faster searches without the need for extensive human-written parsing rules. Additionally, deduplication of event signatures reduces the storage footprint, resulting in significant cost savings.
Decoupling of storage and compute layers, along with pushing down computation, enhances query performance and analysis.
Compression techniques for both low-cardinality and high cardinality datasets, typical of Observability data streams.

Data Transformation and Optimization

As observability data comes from various applications and platforms, and with our commitment to keeping the data lake open to ingest any type of telemetry data, we recognized the need for flexibility in transforming this data for efficient analysis. In this context, Kloudfuse enables:

Metrics rollups and aggregations to manage high-cardinality data, improving querying and analysis during troubleshooting.
Dimensionality reduction by filtering out unnecessary labels, tags, or attributes, reducing storage costs and enhancing processing efficiency.
Mapping and transformation (e.g., converting IP addresses to zones) to speed up query performance in root cause analysis, minimizing the need to process every data row.
Custom data shaping to optimize querying and analysis for better performance.

Open Platform and Open Standards: Avoiding Vendor Lock-In.

From our field interviews, a common sentiment emerged: organizations have been burned by previous observability investments and are keenly aware of the risks of vendor lock-in. At the same time, OpenTelemetry has gained traction and is becoming more comprehensive. In response, we designed Kloudfuse to support open-source agents and collectors, along with open query languages and an open architecture for integration. This includes:

Support for OpenTelemetry collectors across all metrics, logs, traces, and profiles in various languages (e.g., Java, Python, Go, etc.).
Poly-agent data input, accepting input data from various agents such as DataDog, New Relic, Elastic, fluentbit, prometheus, and others.
Compatibility with multiple query languages, such as PromQL, TraceQL, LogQL, SQL, and GraphQL, ensuring seamless data access.
Interoperability with a range of tools, including SEIM, CI/CD pipelines, and incident management systems and notification channels.
An open platform to ensure that observability data remains accessible and developer-friendly.

Simplified Migrations

With many organizations already invested in observability tools, we prioritized making transitions to Kloudfuse as smooth as possible. To enable this, we designed our data lake by

Migration of existing alerts and dashboards to help our customers keep their organizational knowledge and onboard them into Kloudfuse quickly by building custom table views and dashboards on top of our data lake that mirrors their existing dashboards.
Support of vendor-specific agents (e.g., DataDog or New Relic) for collecting data into Kloudfuse’s unified data lake. Customers that move from multiple vendors will now have a single view for their Observability analytics and monitoring.
Embedded Grafana dashboards alongside Kloudfuse dashboards to enable querying Kloudfuse unified data across metrics, logs, traces, digital experience monitoring, and more, allowing tools that SREs and developers are already familiar with.

Enabling AI and ML for Root Cause Analysis

AI is transforming observability by facilitating automated root cause analysis. However, this requires a vast amount of robust data to model data correlations and establish historical baselines. Kloudfuse Observability Data Lake enables:

Correlation algorithms using Pearson correlation to gain deep insights into how metrics, events, logs, and traces (MELT) are interconnected, highlighting potential bottlenecks and dependencies.
Anomaly and Outlier detection using DBSCAN, Rolling Quantile, and Seasonal Decomposition to identify patterns such as excessive log entries that can help pinpoint sources of abnormal log messages.
Forecasting using SARIMA to analyze historical data that shows increased service requests during weekends, enabling proactive resource scaling to prevent outages during peak times.
Utilization of Prophet, an algorithm developed by Facebook, to model underlying trends and seasonal patterns, such as incoming CPU hits over time, even accounting for gaps in data due to outages or low activity periods.
Integration of Custom Models, allowing customers to integrate their own ML models within the platform, as the open data lake ensures that they maintain ownership of their data in their own environment, providing a seamless and cost-effective solution without incurring egress costs or API rate limiting.

The Next Evolution: Agentic Workflows and LLMs

Large Language Models (LLMs) have taken the industry by storm, leading to the creation of new Gen AI applications and agents for various use cases. This surge has made observability for LLMs a crucial domain, establishing data lakes as essential components for several reasons:

End-to-End Monitoring: Enables comprehensive tracking of LLM application traces, evaluating the entire stack of an LLM-powered application, including the LLM itself, external data sources, retrieval calls from vector databases, agent performance, and other functions.
Quality Evaluation: Facilitates the assessment of model-generated responses by providing analytics and comparison metrics to measure the accuracy, relevance, and appropriateness of LLM outputs against ground truth or golden datasets, with a focus on addressing hallucination, toxicity, sensitivity, and code embedding.
Cost Efficiency: Allows organizations to keep LLM Observability local, deployed in their own environment, eliminating egress fees associated with transferring observability data to hosted platforms, which can add significant costs on top of already expensive LLM applications.
Facilitating Agentic Workflows: Supports the creation of workflows such as explainability and summarization through unrestricted access to data. A data lake serves as a centralized repository enriched by telemetry data correlations provided by Kloudfuse, along with the potential integration of other public or private data sources to create agents tailored to specific use cases.

In our upcoming article, we will dive into these points deeper and explore further reasons why data lakes are crucial for LLM observability.

Final Thoughts

We believe that a data-first approach to observability offers the most flexibility, scalability, and intelligence. By building our platform on a real-time, unified, and AI-powered observability data lake, we empower enterprises to move beyond fragmented, high-cost observability solutions.

This is just the beginning. As LLM observability evolves and AI-driven monitoring takes center stage, Kloudfuse is poised to drive the next wave of innovation in data-centric observability.

Observe. Analyze. Automate.

Free Download

Playground