The method selects the subset of active clients that will receive the server broadcast and perform local
training. The clients are selected using self.selection_scheme.
If self.selection_scheme is None, all active clients are selected.
Clean up the algorithm state by clearing auxiliary variables from agents.
This method is used to free up memory used by auxiliary variables that are not needed after training.
Can be overridden to control what gets cleaned up.
Note
Override cleanup_agents() to control which
agents are cleaned up.
Parameters:
network – provides the agents and topology for this algorithm.
This method first calls initialize(), then step() for the specified number of iterations.
Optionally call cleanup() after run() to clear auxiliary variables
and free up memory.
Parameters:
network – provides the agents and topology for this algorithm.
start_iteration – iteration number to start from, used when resuming from a checkpoint. If greater than 0,
initialize() will be skipped.
progress_callback – optional callback to report progress after each iteration.
Raises:
ValueError – if start_iteration is not in [0, iterations]
Warning
Do not override this method. Instead, override initialize() and step() as needed.
FedAdagrad uses local SGD on clients and an Adagrad-style adaptive server update [1].
Each selected client starts from the broadcast global model \(\mathbf{x}_t\) and performs
num_local_steps local SGD steps with client step size step_size.
Here \(\eta_l\) is the client learning rate (the corresponding argument is step_size), \(K\) is the
number of local SGD steps (the corresponding argument is num_local_steps), \(\eta\) is the server
learning rate (the corresponding argument is server_step_size), \(\beta_1\) is the first-moment
coefficient (the corresponding argument is beta_1), \(\epsilon\) is the numerical stability term (the
corresponding argument is epsilon), and \(S_t\) is the set of clients whose uploads are actually received
in round \(t\). Aggregation is always uniform across the received clients.
Costs that preserve the EmpiricalRiskCost abstraction use mini-batch local updates;
generic costs use their usual full-gradient updates.
FedAdam uses local SGD on clients and an Adam-style adaptive server update [1].
Each selected client starts from the broadcast global model \(\mathbf{x}_t\) and performs
num_local_steps local SGD steps with client step size step_size.
Here \(\eta_l\) is the client learning rate (the corresponding argument is step_size), \(K\) is the
number of local SGD steps (the corresponding argument is num_local_steps), \(\eta\) is the server
learning rate (the corresponding argument is server_step_size), \(\beta_1\) and \(\beta_2\) are the
first- and second-moment coefficients (the corresponding arguments are beta_1 and beta_2),
\(\epsilon\) is the numerical stability term (the corresponding argument is epsilon), and \(S_t\) is
the set of clients whose uploads are actually received in round \(t\). Aggregation is always uniform across
the received clients. Costs that preserve the
EmpiricalRiskCost abstraction use mini-batch local updates; generic costs use their
usual full-gradient updates.
where \(t\) indexes the local training epochs on each client and \(E\) is the number of local epochs per
round (the corresponding argument is num_local_steps), \(\eta\) is the step size (the corresponding
argument is step_size), and \(S_k\) is the set of participating clients at round \(k\). In FedAvg,
each selected client performs num_local_steps local SGD epochs, then the server
aggregates the final local models to form \(\mathbf{x}_{k+1}\) using uniform averaging over the participating
clients. Client selection (subsampling) defaults to uniform sampling with fraction 1.0 (all active clients) and can
be customized via selection_scheme. Costs that preserve the
EmpiricalRiskCost abstraction use client-side mini-batches of size
EmpiricalRiskCost.batch_size; generic cost wrappers fall
back to full-gradient local updates.
Federated Dynamic Regularization (FedDyn) with local gradient steps [3].
FedDyn keeps one dynamic state \(g_i\) per client and one server auxiliary vector \(h\). In each
communication round, the paper writes the selected-client update as the exact minimizer of the
dynamic-regularized local objective
In practice, and following the local SGD device update used in the paper’s experiments, this implementation
approximates that minimization by running num_local_steps gradient steps
from the received server model, with local gradient
After local training, each participating client updates its dynamic state as
\[g_i^+ = g_i - \alpha(\theta_i^+ - \theta^t).\]
The server aggregates only the selected client models it actually receives. If \(R_t\) is the received subset,
\(m\) is the total number of clients, and \(\theta^t\) is the server model before aggregation, then
Here \(\alpha\) is the dynamic regularization coefficient (the corresponding argument is penalty), and the
local step size is the scalar used in local SGD (the corresponding argument is step_size).
If no selected client model is received, the server model and h remain unchanged. Unselected clients and
selected clients that miss the server broadcast keep their previous local model and dynamic state. Costs that
preserve the EmpiricalRiskCost abstraction use mini-batch local updates; generic costs
keep their usual full-gradient behavior. Client selection defaults to uniform sampling with fraction 1.0.
Federated Local Training (Fed-LT) with cost-driven local gradients [4][5].
Fed-LT maintains one auxiliary variable \(z_i\) per client. This \(z_i\) state is maintained during
execution and is not a constructor argument. At the start of round \(k\), the server
computes the broadcast variable
\[y_{k+1} = \operatorname{prox}_{\rho h / N}\left(\frac{1}{N}\sum_{i=1}^N z_{i,k}\right)\]
where \(N\) is the number of clients, and \(\rho\) is the regularization strength (the corresponding
argument is penalty). The server sends \(y_{k+1}\) to the selected clients.
The initial auxiliary states z0 follow the same InitialStates convention as
x0. If z0 is None, Fed-LT initializes \(z_{i,0}=x_{i,0}\) for every client.
In this implementation, each client cost \(f_i\) is treated as the full local objective already available on
that client. The global regularizer \(h\) is represented
by the server cost’s proximal operator; with the default ZeroCost server, this is the
documented \(h=0\) case and the server step is plain averaging.
Local solvers.
A selected client \(i\) sets \(w^0_{i,k}=x_{i,k}\) and
\(v_{i,k}=2y_{k+1}-z_{i,k}\), then uses local_solver to approximately minimize the regularized local
objective
Costs preserving the EmpiricalRiskCost abstraction use its default mini-batch
sampling, so gradient-based local solvers use mini-batches. Generic Cost objects use
their normal full-gradient behavior. Solver-specific hyperparameters are passed through solver_args.
Gradient descent.
The default local_solver="gd" uses step_size as the local step size and expects empty solver_args:
Here \(\gamma\) is the local step size (the corresponding argument is step_size).
Nesterov.
The local_solver="nesterov" option applies a Nesterov-style update to the same local gradient. It initializes
\(u_i^0=w^0_{i,k}\) and uses step_size as the local step size. Its solver_args may contain
"momentum"; the default is 0.9:
Here \(\beta\) is the Nesterov momentum coefficient (the corresponding argument is
solver_args["momentum"]).
One possible centralized strongly-convex choice is
\(\beta=(\sqrt{L_i + 1/\rho} - \sqrt{\mu_i + 1/\rho}) / (\sqrt{L_i + 1/\rho}
+ \sqrt{\mu_i + 1/\rho})\), with local step size \(1/(L_i + 1/\rho)\), where
m_smooth supplies \(L_i\) and m_cvx supplies \(\mu_i\).
Adam.
The local_solver="adam" option applies Adam to the same local gradient. Adam moments are reset at the start
of every local solve because Fed-LT locally trains the current round’s subproblem rather than maintaining a
persistent optimizer state across rounds. Its solver_args may contain "beta1", "beta2", and
"epsilon"; the defaults are 0.9, 0.999, and 1e-8:
for \(\ell=1,\ldots,N_e\), with \(m^0_{i,k}=s^0_{i,k}=0\). Here \(N_e\) is the local step count
(the corresponding argument is num_local_steps), \(\beta_1\) and \(\beta_2\) are the Adam moment
coefficients (the corresponding arguments are solver_args["beta1"] and solver_args["beta2"]), and
\(\epsilon\) is the Adam numerical stability term (the corresponding argument is
solver_args["epsilon"]). The terms
\(\hat m^\ell_{i,k}\) and \(\hat s^\ell_{i,k}\) are the Adam bias-corrected moments.
and uploads \(z_{i,k+1}\) to the server. Inactive clients keep
\(x_{i,k+1}=x_{i,k}\) and \(z_{i,k+1}=z_{i,k}\). For later server averages, the server stores a received
fresh \(z_{i,k+1}\) when the upload arrives and otherwise keeps its previous stored \(z_i\).
Fed-PLT is the privacy-noise version of Fed-LT [4].
Store received Fed-LT z uploads for future server averages.
Clients whose uploads are not received keep their previous server-side z value, matching the stale-value
aggregation in partial participation and lossy communication settings.
FedNova with optional local momentum, proximal correction, and server momentum [6].
Each selected client starts from the broadcast global model \(\mathbf{x}_t = \mathbf{x}^{(t,0)}\) and performs
\(\tau_i\) local steps with client step size step_size. At local step \(k\), client \(i\)
computes the gradient
when use_prox=True and \(a_i^{(k+1)} = a_i^{(k)} + s_i^{(k+1)}\) otherwise.
During initialize, the server resolves and stores each client’s sample count \(n_i\).
After \(\tau_i\) local steps, client \(i\) first uploads the FedNova coefficient \(a_i\) and then
uploads the cumulative local update \(\mathbf{c}_i^t\) in a second transmission. For the clients in the
current round whose two uploads are both actually received, the server forms the data-proportional client weight
\[p_i = \frac{n_i}{\sum_{j \in S_t} n_j}.\]
The server forms the weighted effective local-step coefficient
When use_momentum=False, use_prox=False, and use_server_momentum=False, this reduces to the plain
local-SGD FedNova variant. In that plain setting, FedNova reduces to FedAvg if and only if all participating
clients use the same number of local steps (\(\tau_i = \tau_j\) for all \(i, j \in S_t\)) and
FedNova and FedAvg both use data-proportional aggregation weights.
Here \(\tau_i\) is the number of local SGD steps used by client \(i\) (the corresponding argument is
num_local_steps), \(\eta_l\) is the local step size (the corresponding argument is step_size),
\(\mu\) is the proximal coefficient (the corresponding argument is penalty), \(\beta\) is the local
momentum coefficient (the corresponding argument is momentum), and \(\gamma\) is the server momentum
coefficient (the corresponding argument is server_momentum).
In this implementation, \(n_i\) is inferred once during initialize from each client’s local cost via
infer_client_data_size(), then stored on the server for later rounds.
If no first-phase a_i uploads are received in a round under network impairments, the server skips that round
without error.
Aggregate FedNova client uploads following the Local-SGD FedNova pseudocode.
This method assumes the current round has already cached the received FedNova coefficients a_i in
server.aux_vars["received_a_i"] and that cumulative local updates \(\mathbf{c}_i^t\) are read from
the server inbox under the cumulative-gradient channel. Client sample counts are looked up from the mapping
stored on the server during initialize. Only clients with both uploads available in the current round are
aggregated; if none are available, this method returns without updating the server model.
Raises:
ValueError – if any received FedNova coefficient a_i is non-positive.
Shared FedOpt template with client local SGD and server adaptive optimization [1].
Each selected client starts from the broadcast global model \(\mathbf{x}_t\) and performs
num_local_steps local SGD steps with client step size step_size:
Here \(\eta_l\) is the client learning rate (the corresponding argument is step_size), \(K\) is the
number of local SGD steps (the corresponding argument is num_local_steps), \(\eta\) is the server
learning rate (the corresponding argument is server_step_size), \(\beta_1\) is the first-moment
coefficient (the corresponding argument is beta_1), \(\epsilon\) is the numerical stability term (the
corresponding argument is epsilon), and \(S_t\) is the set of clients whose uploads are actually received
in round \(t\). The second-moment update \(\Phi\) is
variant-specific and is defined by subclasses. Aggregation is always uniform across the received clients.
Costs that preserve the EmpiricalRiskCost abstraction use mini-batch local updates;
generic costs use their usual full-gradient updates.
Federated Primal-Dual (FedPD) with local gradient steps [7].
FedPD rewrites federated learning as a global consensus problem with client primal variables
\(\mathbf{x}_i\), local centre variables \(\mathbf{x}_{0,i}\), and dual variables
\(\lambda_i\). In each round, all active clients run num_local_steps gradient steps on the local
augmented Lagrangian
Here \(\eta\) is the quadratic-penalty and dual-update scale (the corresponding argument is penalty).
With probability 1-skip_probability, clients upload their local centre candidates and the server uniformly
averages the candidates it actually receives. If at least one candidate is received, the server centre is then
sent back to all active clients; clients that do not receive the server’s synchronized centre keep their local
centre candidate. With probability skip_probability, aggregation is skipped and every participating client
keeps its local centre candidate.
Costs that preserve the EmpiricalRiskCost abstraction use mini-batch local updates;
generic costs keep their usual full-gradient behavior. num_local_steps can be homogeneous (single integer) or
heterogeneous via a mapping keyed by client agent. Partial client participation is not supported.
Aggregate received FedPD centre candidates and broadcast the synchronized centre.
Unlike most federated algorithms, a FedPD communication round couples aggregation with centre
synchronization: after the server averages the received centre candidates, it immediately broadcasts the
updated centre back to all participating clients.
If no centre candidates are received, the server state is left unchanged and no synchronization is sent.
where \(\mathbf{w}^t\) is the server model broadcast at the start of round \(k\), held fixed
throughout each selected client’s local epochs, \(\mu \geq 0\) is the proximal coefficient (the
corresponding argument is penalty), \(\eta\) is the step size (the corresponding argument is
step_size), and \(S_k\) is the set of participating clients. Setting penalty=0.0
recovers FedAvg exactly. Aggregation uses uniform averaging
over the participating clients. Client selection defaults to uniform sampling with fraction 1.0. For
EmpiricalRiskCost, local updates use mini-batches of size
EmpiricalRiskCost.batch_size; for generic costs,
local updates use full-batch gradients.
FedYogi uses local SGD on clients and a Yogi-style adaptive server update [1].
Each selected client starts from the broadcast global model \(\mathbf{x}_t\) and performs
num_local_steps local SGD steps with client step size step_size.
Here \(\eta_l\) is the client learning rate (the corresponding argument is step_size), \(K\) is the
number of local SGD steps (the corresponding argument is num_local_steps), \(\eta\) is the server
learning rate (the corresponding argument is server_step_size), \(\beta_1\) and \(\beta_2\) are the
first- and second-moment coefficients (the corresponding arguments are beta_1 and beta_2),
\(\epsilon\) is the numerical stability term (the corresponding argument is epsilon), and \(S_t\) is
the set of clients whose uploads are actually received in round \(t\). Aggregation is always uniform across
the received clients. Costs that preserve the
EmpiricalRiskCost abstraction use mini-batch local updates; generic costs use their
usual full-gradient updates.
SCAFFOLD with client/server control variates for variance-reduced local training [9].
When the server control variate \(\mathbf{c}\) and all client control variates \(\mathbf{c}_i\)
are zero, the local updates reduce to FedAvg.
Here, \(\eta_l\) is the local client step size (the corresponding argument is step_size),
\(\eta_g\) is the global server step size (the corresponding argument is server_step_size),
\(S\) is the set of selected clients, and \(|S|\) its size.
Selected clients perform local steps with the correction
and update their control variates using the practical SCAFFOLD rule, where \(K\) is the number of
local steps (the corresponding argument is num_local_steps) and \(\mathbf{y}_i\) is the final local model
after those \(K\) steps:
where both aggregated deltas are averaged uniformly over the selected clients.
Note
x0 and c0 follow the InitialStates convention and are resolved
per agent during initialize via
initial_states(). For federated networks, c0 dictionaries can
include both client control variates and the server control variate. If the server entry is missing, it is
inferred as the average of the client control variates.