
Managing Service Dependency for Cloud
Reliability: The Industrial Practice
Tianyi Yang∗, Baitong Li∗, Jiacheng Shen∗, Yuxin Su†, Yongqiang Yang‡, and Michael R. Lyu∗
∗Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR.
Email: {tyyang, btli, jcshen, lyu}@cse.cuhk.edu.hk
†School of Software Engineering, Sun Yat-Sen Univeristy, Zhuhai, China. Email: suyx35@mail.sysu.edu.cn
‡Computing and Networking Innovation Lab, Huawei Cloud, Shenzhen, China. Email: yangyongqiang@huawei.com
Abstract—Interactions between cloud services result in service
dependencies. Evaluating and managing the cascading impacts
caused by service dependencies is critical to the reliability of
cloud systems. This paper summarizes the dependency types in
cloud systems and demonstrates the design of the Dependency
Management System (DMS), a platform for managing the service
dependencies in the production cloud system. DMS features the
full-lifecycle support for service reliability (i.e., initial service de-
ployment, service upgrade, proactive architectural optimization,
and reactive failure mitigation) and refined characterization of
the intensity of dependencies.
Index Terms—cloud computing, software reliability, AIOps,
service dependency
I. BACKGROUND AND MOTIVATION
Modern cloud systems, including Huawei Cloud, are often
constructed from a complex and large-scale hierarchy of dis-
tributed software modules.The common practice is to develop
and deploy these software modules as cloud microservices
that collectively comprise multiple cloud services [1], e.g.,
resource allocation, virtual network management, and virtual
machine management. Different microservices serve different
functionalities. The microservices communicate through well-
defined APIs and respond to external requests as a whole
through service invocations.
Such an architecture benefits scalability, robustness, and
agility but also complicates system reliability engineering.
However, the interactions between services cause dependen-
cies, resulting in the cascading impact on the system. Despite
various fault-tolerance mechanisms introduced, it is still possi-
ble for minor anomalies to magnify their impacts and escalate
into system outages. When a cloud service or microservice
enters an anomalous status, the anomaly can cascadingly prop-
agate through the service-calling structure, causing a degraded
user experience or even a service outage [2].
The cascading impacts hinder system operation and mainte-
nance, deteriorating customer satisfaction. For instance, during
the initial service deployment or service upgrade, all the
services it relies on should be ready. During the failure
mitigation and recovery, the cascading impact will slow the
This work was supported by Key-Area Research and Development Program
of Guangdong Province (No. 2020B010165002), Key Program of Fundamen-
tal Research from Shenzhen Science and Technology Innovation Commission
(No. JCYJ20200109113403826), and the Research Grants Council of the
Hong Kong Special Administrative Region, China (CUHK 14210920).
Engineer Specification
Configuration Files
Manual Update
Configuration Parser
AID
Data Source
Dependency Analysis
Initial Service
Deployment Service
Upgrade
Architectural
Optimization
Failure
Mitigation
Application Scenarios
Fig. 1. The architecture of DMS.
recovery. Therefore, evaluating and managing the cascading
impacts caused by service dependencies is crucial.
II. KEY INNOVATIONS
This paper classifies the dependency types in cloud systems
and demonstrates the design of the Dependency Management
System (DMS), an end-to-end platform for managing the ser-
vice dependencies in the production cloud system. DMS sup-
ports the full-lifecycle support for service reliability, i.e., initial
service deployment, service upgrade, proactive architectural
optimization, and reactive failure mitigation. DMS integrates
our previous study on the aggregated intensity of service
dependency [2] to characterize the degree of cascading impacts
and provides a refined characterization of dependencies. In
addition, DMS also features automatic configuration parsing
and multi-source dependency fusion for practicality.
III. DEPENDENCY TYPES
The dependency relations in a cloud system are diverse.
In Huawei Cloud, we categorize the dependencies according
to the architectural level, i.e., service-level dependencies and
microservice-level dependencies.
1) Service-level dependency: If the dependency is between
two cloud services, we call it a service-level dependency.
Service-level dependency can be further divided into the
following three subtypes, i.e., deployment dependency, run-
time dependency, and operational dependency.
Deployment dependency indicates dependency during the
deployment of a cloud service. The deployment phase may
rely on some cloud services to create and configure resources.
For example, the elastic computing service depends on the
API management service to register public APIs. The elastic
computing service also depends on the block storage service
to allocate the required resource.
arXiv:2210.06249v1 [cs.DC] 28 Aug 2022