Distributed training

AdaNet uses the same distributed training model as tf.estimator.Estimator.

For training TensorFlow estimators on Google Cloud ML Engine, please see this guide.

Placement Strategies

Given a cluster of worker and parameter servers, AdaNet will manage distributed training automatically. When creating an AdaNet Estimator, you can specify the adanet.distributed.PlacementStrategy to decide which subnetworks each worker will be responsible for training.

Replication Strategy

The default distributed training strategy is the same as the default tf.estimator.Estimator model: each worker will create the full training graph, including all subnetworks and ensembles, and optimize all the trainable parameters. Each variable will be randomly allocated to a parameter server to minimize bottlenecks in workers fetching them. Worker’s updates will be sent to the parameter servers which apply the updates to their managed variables.

_images/replication_strategy.svgReplication strategy

To learn more, see the implementation at adanet.distributed.ReplicationStrategy.

Round Robin Stategy (experimental)

A strategy that scales better than the Replication Strategy is the experimental Round Robin Stategy. Instead of replicating the same graph on each worker, AdaNet will round robin assign workers to train a single subnetwork.

_images/round_robin.svgRound robin strategy

To learn more, see the implementation at adanet.distributed.RoundRobinStrategy.