Future Technology

Enhancing Istio Propagation Delay | by Ying Zhu | The Airbnb Tech Weblog | Mar, 2023

8 min read

A case research in service mesh efficiency optimization

On this article, we’ll showcase how we recognized and addressed a service mesh efficiency drawback at Airbnb, offering insights into the method of troubleshooting service mesh points.


At Airbnb, we use a microservices structure, which requires environment friendly communication between providers. Initially, we developed a homegrown service discovery system referred to as Smartstack precisely for this goal. As the corporate grew, nonetheless, we encountered scalability issues¹. To handle this, in 2019, we invested in a contemporary service mesh resolution referred to as AirMesh, constructed on the open-source Istio software program. At present, over 90% of our manufacturing site visitors has been migrated to AirMesh, with plans to finish the migration by 2023.

The Symptom: Elevated Propagation Delay

After we upgraded Istio from 1.11 to 1.12, we observed a puzzling improve within the propagation delay — the time between when the Istio management airplane will get notified of a change occasion and when the change is processed and pushed to a workload. This delay is necessary for our service house owners as a result of they depend upon it to make crucial routing choices. For instance, servers must have a swish shutdown interval longer than the propagation delay, in any other case shoppers can ship requests to already-shut-down server workloads and get 503 errors.

Knowledge Gathering: Propagation Delay Metrics

Right here’s how we found the situation: we had been monitoring the Istio metric pilot_proxy_convergence_time for propagation delay after we observed a rise from 1.5 seconds (p90 in Istio 1.11) to 4.5 seconds (p90 in Istio 1.12). Pilot_proxy_convergence_time is considered one of a number of metrics Istio data for propagation delay. The entire record of metrics is:

  • pilot_proxy_convergence_time — measures the time from when a push request is added to the push queue to when it’s processed and pushed to a workload proxy. (Notice that change occasions are transformed into push requests and are batched via a course of referred to as debounce earlier than being added to the queue, which we’ll go into particulars later.)
  • pilot_proxy_queue_time — measures the time between a push request enqueue and dequeue.
  • pilot_xds_push_time — measures the time for constructing and sending the xDS assets. Istio leverages Envoy as its information airplane. Istiod, the management airplane of Istio, configures Envoy via the xDS API (the place x will be considered as a variable, and DS stands for discovery service).
  • pilot_xds_send_time — measures the time for truly sending the xDS assets.
A excessive stage graph to assist perceive the metrics associated to propagation delay.

xDS Lock Rivalry

CPU profiling confirmed no noticeable modifications between 1.11 and 1.12, however dealing with push requests took longer, indicating time was spent on some ready occasions. This led to the suspicion of lock competition points.

  • Endpoint Discovery Service (EDS) — describes find out how to uncover members of an upstream cluster.
  • Cluster Discovery Service (CDS) — describes find out how to uncover upstream clusters used throughout routing.
  • Route Discovery Service (RDS) –describes find out how to uncover the route configuration for an HTTP connection supervisor filter at runtime.
  • Listener Discovery Service (LDS) –describes find out how to uncover the listeners at runtime.
  • Management airplane:
    – 1 Istiod pod (reminiscence 26 G, cpu 10 cores)
  • Knowledge airplane:
    – 50 providers and 500 pods
    – We mimicked modifications by restarting deployments randomly each 10 seconds and altering digital service routings randomly each 5 seconds
A desk of results² for the perfomance testing.


Right here’s a twist in our analysis: in the course of the deep dive of Istio code base, we realized that pilot_proxy_convergence_time doesn’t truly absolutely seize propagation delay. We noticed in our manufacturing that 503 errors occur throughout server deployment even after we set swish shutdown time longer than pilot_proxy_convergence_time. This metric doesn’t precisely replicate what we wish it to replicate and we have to redefine it. Let’s revisit our community diagram, zoomed out to incorporate the debounce course of to seize the total lifetime of a change occasion.

A excessive stage diagram of the lifetime of a change occasion.
A CPU profile of Istiod.
A CPU profile of Istiod after DeepCopy enchancment.

To conclude our analysis, we discovered that:

  • We must always use each pilot_debounce_time and pilot_proxy_convergence_time to trace propagation delay.
  • xDS cache will help with CPU utilization however can impression propagation delay because of lock competition, tune PILOT_ENABLE_CDS_CACHE & PILOT_ENABLE_RDS_CACHE to see what’s finest to your system.
  • Prohibit the visibility of your Istio manifests by setting the exportTo discipline.

Because of the Istio neighborhood for creating a fantastic open supply mission and for collaborating with us to make it even higher. Additionally name out to the entire AirMesh workforce for constructing, sustaining and bettering the service mesh layer at Airbnb. Because of Lauren Mackevich, Mark Giangreco and Surashree Kulkarni for enhancing the put up.

Copyright © All rights reserved. | Newsphere by AF themes.