Behind the Scenes: Building Service Catalog

Behind the Scenes: Building Service Catalog

Elevating Collaboration and On-Call Coverage

Elevating Collaboration and On-Call Coverage

Published on

Nov 6, 2024

As an engineer leading Application Performance Monitoring (APM) at Kloudfuse, I’m excited to share our latest addition: the Service Catalog. My experience with distributed systems at Google, Stripe and Nvidia has shaped my approach to building a robust APM platform that helps developers monitor their services efficiently and root-cause issues quickly.

What is Kloudfuse APM?

Kloudfuse offers application performance monitoring to track how distributed services interact with each other, including service endpoints, backend components, and databases. With features like service maps, request flows, and flame graphs, developers can quickly visualize service requests and understand their impact on other services. 

Kloudfuse Span Analytics provides insights into application performance and resource usage based on key metrics (rate, errors, duration) in real time. We have also built our “Analyze with K-Lens” feature to detect outliers and identify root causes of issues more quickly and with greater precision. More about that in another blog post. 

To get the most out of our APM capabilities, we needed to not only monitor active services but also identify those that have stopped sending data, and we needed to create a platform that is inductive to collaboration and faster time to find impacted services and related workflows. 

What’s New?

The Service Catalog introduces key enhancements to our APM solution:

  • Service Discovery: Users can now find services across various APM tools like Kloudfuse and OpenTelemetry, including details on service dependencies and version changes.

  • Visibility into Non-Active Services: The platform now lets users see services that aren’t currently actively sending span data but may have been during the past month. This feature helps identify services that may have gone offline unexpectedly or have encountered instrumentation issues leading to span data not being sent to Kloudfuse.

  • Centralized Collaboration: The Service Catalog acts as a central hub for managing microservice ownership and collaboration during incidents, breaking down knowledge silos and improving on-call coverage.

Enhanced Functionality

With the new Service Catalog, users can easily switch between viewing only active services and seeing all services, including those that were active previously.

We focused on ensuring compatibility while managing relational data, analytics data and cache-data across multiple components  . Each system has its strengths, and balancing them was crucial for maintaining performance.

One of our primary challenges was to effectively detect span events from new services or new versions of existing services during data ingestion, without any performance penalties. We aimed to avoid executing complex service detection logic on each span event received by Kloudfuse, as this would significantly slow down our data ingestion rate. 

We also needed to track “first-seen” and “last-seen” timestamps for services without analyzing every individual spanevent. Implementing the “first-seen” functionality was simpler; we utilized a distributed cache to detect new service-version combinations. However, tracking “last-seen” was more complex without inspecting every span event. To address this, we came up with an alternative: instead of recording every “last-seen” timestamp, we implemented a “last-seen-time-bucket”. This allows us to inspect the “last-seen” time once per time-bucket—which is set to an hour by default—reducing the frequency of checks. The time-bucket is now a part of the cache key alongside the service and version, enabling us to determine if the service was active at any point during the time-bucket.

We rolled out this feature without any performance issues. In fact, some APM queries for Service Discovery are now up to 100 times faster—reducing 10-second queries to just 100 milliseconds!

Immediate Impact

The feedback from users has been overwhelmingly positive. It’s rewarding to see how this feature has immediately improved query performance and validated our work.

As we continue to develop Kloudfuse, we’re dedicated to enhancing user experience and providing powerful tools for effective service monitoring and problem resolution.

Stay tuned for more updates!

Observe. Analyze. Automate.

Observe. Analyze. Automate.

Observe. Analyze. Automate.

All Rights Reserved ® Kloudfuse 2024

Terms and Conditions

All Rights Reserved ® Kloudfuse 2024

Terms and Conditions

All Rights Reserved ® Kloudfuse 2024

Terms and Conditions