2
field of machine learning (ML), distributed learning frameworks [4], [5] that keep the training
data locally are well developed to protect data privacy and reduce network energy/time costs.
Recently, federated learning (FL) [1], [4], [5] has been proposed as a promising solution for
distributed ML, which enables multiple devices to execute local training on their own dataset and
collaboratively build a shared ML model with the coordination of a parameter server (PS) (e.g.,
access point and base station). Since only model parameters rather than raw data are exchanged
between devices and the PS, FL significantly relieves the communication burden and protects
data privacy [4], [6] with wide-field applications, e.g., vehicle-to-vehicle communications [7]
and content recommendations for smartphones [5].
In contrast to the centralized ML, the PS in FL needs to exchange models with multiple devices
over hundreds to thousands of communication rounds to achieve the desired training accuracy.
However, the main challenge in realizing FL on wireless networks arises from communication
stragglers with unfavorable links [8], [9]. For example, in over-the-air computation (AirComp)-
based analog FL [6], [10], communication stragglers dominate the overall model aggregation
error caused by channel fading and communication noise since the devices with better channel
qualities have to reduce their transmit power for the local models’ alignment at the PS. Moreover,
in digital synchronous FL [8], [11], communication stragglers significantly slow down the model
aggregation process and dominate cumulative communication delay since the PS must wait until
receiving the training updates from all participants. If the number of communication stragglers
is high, the overall communication delay will be unacceptable. The straggler issue is thus the
main bottleneck to design communication-efficient FL systems.
There have been many efforts to mitigate the communication straggler effect in FL, such as
device scheduling [6], [10], [11]. For instance, to reduce model misalignment error incurred by
stragglers in AirComp-based FL, the authors in [6], [10] scheduled the devices with reliable chan-
nels for concurrent model uploading. In addition, to reduce the communication delay incurred
by stragglers in digital FL, devices with large contributions to the global model [11] or/and with
favorable channel conditions [12], [13] are generally selected. Nevertheless, because such device
scheduling is biased, which results in a smaller amount of training data utilized, this, in turn,
may damage the update of the global model and decrease the learning performance of FL. To
alleviate such communication-learning tradeoff, recent research has investigated the integration
of advanced technologies (i.e., relays [14], reconfigurable intelligent surfaces [15]–[17]) into
FL systems to improve stragglers’ communication qualities and thus further upgrade device
scheduling policy for the reduction of communication errors. These existing frameworks require
a terrestrial BS to provide network coverage to the devices for model aggregation. However, many