Distributed training¶
AdaNet uses the same distributed training model as tf.estimator.Estimator
.
For training TensorFlow estimators on Google Cloud ML Engine, please see this guide.
Placement Strategies¶
Given a cluster of worker and parameter servers, AdaNet will manage distributed
training automatically. When creating an AdaNet Estimator
, you can specify the
adanet.distributed.PlacementStrategy
to decide which subnetworks each worker
will be responsible for training.
Replication Strategy¶
The default distributed training strategy is the same as the default
tf.estimator.Estimator
model: each worker will create the full training graph,
including all subnetworks and ensembles, and optimize all the trainable
parameters. Each variable will be randomly allocated to a parameter server to
minimize bottlenecks in workers fetching them. Worker’s updates will be sent to
the parameter servers which apply the updates to their managed variables.
Replication strategy
To learn more, see the implementation at
adanet.distributed.ReplicationStrategy
.
Round Robin Stategy (experimental)¶
A strategy that scales better than the Replication Strategy is the experimental Round Robin Stategy. Instead of replicating the same graph on each worker, AdaNet will round robin assign workers to train a single subnetwork.
Round robin strategy
To learn more, see the implementation at
adanet.distributed.RoundRobinStrategy
.