Rails request Apdex SLI
Introduced in GitLab 14.4
NOTE: This SLI is used for service monitoring. But not for error budgets for stage groups by default. You can opt in.
The request Apdex SLI (Service Level Indicator) is an SLI defined in the application. It measures the duration of successful requests as an indicator for application performance. This includes the REST and GraphQL API, and the regular controller endpoints. It consists of these counters:
gitlab_sli:rails_request_apdex:total: This counter gets incremented for every request that did not result in a response with a
5xxstatus code. It ensures slow failures are not counted twice, because the request is already counted in the error SLI.
gitlab_sli:rails_request_apdex:success_total: This counter gets incremented for every successful request that performed faster than the defined target duration depending on the endpoint's urgency.
Both these counters are labeled with:
endpoint_id: The identification of the Rails Controller or the Grape-API endpoint.
feature_category: The feature category specified for that controller or API endpoint.
Request Apdex SLO
These counters can be combined into a success ratio. The objective for this ratio is defined in the service catalog per service. For this SLI to meet SLO, the ratio recorded must be higher than:
For example: for the web-service, we want at least 99.8% of requests to be faster than their target duration.
We use these targets for alerting and service monitoring. Set durations taking these targets into account, so we don't cause alerts. The goal, however, is to set the urgency to a target that satisfies our users.
Both successful measurements and unsuccessful ones affect the error budget for stage groups.
Adjusting request urgency
Not all endpoints perform the same type of work, so it is possible to define different urgency levels for different endpoints. An endpoint with a lower urgency can have a longer request duration than endpoints with high urgency.
Long-running requests are more expensive for our infrastructure. While serving one request, the thread remains occupied for the duration of that request. The thread can handle nothing else. Due to Ruby's Global VM Lock, the thread might keep the lock and stall other requests handled by the same Puma worker process. The request is, in fact, a noisy neighbor for other requests handled by the worker. We cap the upper bound for a target duration at 5 seconds for this reason.
Decreasing the urgency (setting a higher target duration)
You can decrease the urgency on an existing endpoint on a case-by-case basis. Take the following into account:
Apdex is about perceived performance. If a user is actively waiting for the result of a request, waiting 5 seconds might not be acceptable. However, if the endpoint is used by an automation requiring a lot of data, 5 seconds could be acceptable.
A product manager can help to identify how an endpoint is used.
The workload for some endpoints can sometimes differ greatly depending on the parameters specified by the caller. The urgency needs to accommodate those differences. In some cases, you could define a separate application SLI for what the endpoint is doing.
When the endpoints in certain cases turn into no-ops, making them very fast, we should ignore these fast requests when setting the target. For example, if the
MergeRequests::DraftsControlleris hit for every merge request being viewed, but rarely renders anything, then we should pick the target that would still accommodate the endpoint performing work.
Consider the dependent resources consumed by the endpoint. If the endpoint loads a lot of data from Gitaly or the database, and this causes unsatisfactory performance, consider optimizing the way the data is loaded rather than increasing the target duration by lowering the urgency.
In these cases, it might be appropriate to temporarily decrease urgency to make the endpoint meet SLO, if this is bearable for the infrastructure. In such cases, create a code comment linking to an issue.
If the endpoint consumes a lot of CPU time, we should also consider this: these kinds of requests are the kind of noisy neighbors we should try to keep as short as possible.
Traffic characteristics should also be taken into account. If the traffic to the endpoint sometimes bursts, like CI traffic spinning up a big batch of jobs hitting the same endpoint, then having these endpoints take five seconds is unacceptable from an infrastructure point of view. We cannot scale up the fleet fast enough to accommodate for the incoming slow requests alongside the regular traffic.
When lowering the urgency for an existing endpoint, please involve a Scalability team member in the review. We can use request rates and durations available in the logs to come up with a recommendation. You can pick a threshold using the same process as for increasing urgency, picking a duration that is higher than the SLO for the service.
We shouldn't set the longest durations on endpoints in the merge requests that introduces them, because we don't yet have data to support the decision.
Increasing urgency (setting a lower target duration)
When increasing the urgency, we must make sure the endpoint still meets SLO for the fleet that handles the request. You can use the information in the logs to check:
Open this table in Kibana
The table loads information for the busiest endpoints by default. To speed the response, add both:
A filter for
The identifier you're interested in, for example:
Check the appropriate percentile duration for the service handling the endpoint. The overall duration should be lower than your intended target.
If the overall duration is below the intended target, check the peaks over time in this graph in Kibana. Here, the percentile in question should not peak above the target duration we want to set.
As decreasing a threshold too much could result in alerts for the Apdex degradation, please also involve a Scalability team member in the merge request.
How to adjust the urgency
You can specify urgency similar to how endpoints get a feature category. Endpoints without a specific target use the default urgency: 1s duration. These configurations are available:
|Urgency||Duration in seconds||Notes|
||1s||The default when nothing is specified.|
An urgency can be specified for all actions in a controller:
class Boards::ListsController < ApplicationController urgency :high end
To also specify the urgency for certain actions in a controller:
class Boards::ListsController < ApplicationController urgency :high, [:index, :show] end
To specify the urgency for an entire API class:
module API class Issues < ::API::Base urgency :low end end
To specify the urgency also for certain actions in a API class:
module API class Issues < ::API::Base urgency :medium, [ '/groups/:id/issues', '/groups/:id/issues_statistics' ] end end
Or, we can specify the urgency per endpoint:
get 'client/features', urgency: :low do # endpoint logic end
Error budget attribution and ownership
This SLI is used for service level monitoring. It feeds into the error budget for stage groups. For this particular SLI, we have opted everyone out by default to give time to set the correct urgencies on endpoints before it affects a group's error budget.
To include this SLI in the error budget, remove the
ignored_components array in the entry for your group. Read
more about what is configurable in the
For more information, read the epic for defining custom SLIs and incorporating them into error budgets). The endpoints for the SLI feed into a group's error budget based on the feature category declared on it.
To know which endpoints are included for your group, you can see the request rates on the group dashboard for your group. In the Budget Attribution row, the Puma Apdex log link shows you how many requests are not meeting a 1s or 5s target.
Learn more about the content of the dashboard in the documentation for Dashboards for stage groups. For more information on our exploration of the error budget itself, read the infrastructure issue Stage group error budget exploration dashboard.