This package methods for distributing computation using the TensorFlow computation graph.

## PlacementStrategy¶

Abstract placement strategy for distributed training.

Given a cluster of workers, the placement strategy determines which subgraph each worker constructs.

config

Returns this strategy’s configuration.

Returns: The tf.estimator.RunConfig instance that defines the cluster.
should_build_ensemble(num_subnetworks)[source]

Whether to build the ensemble on the current worker.

Parameters: num_subnetworks – Integer number of subnetworks to train in the current iteration. Boolean whether to build the ensemble on the current worker.
should_build_subnetwork(num_subnetworks, subnetwork_index)[source]

Whether to build the given subnetwork on the current worker.

Parameters: num_subnetworks – Integer number of subnetworks to train in the current iteration. subnetwork_index – Integer index of the subnetwork in the list of the current iteration’s subnetworks. Boolean whether to build the given subnetwork on the current worker.
should_train_subnetworks(num_subnetworks)[source]

Whether to train subnetworks on the current worker.

Parameters: num_subnetworks – Integer number of subnetworks to train in the current iteration. Boolean whether to train subnetworks on the current worker.
subnetwork_devices(num_subnetworks, subnetwork_index)[source]

A context for assigning subnetwork ops to devices.

## ReplicationStrategy¶

A simple strategy that replicates the same graph on every worker.

This strategy does not scale well as the number of subnetworks and workers increases. For $$m$$ workers, $$n$$ parameter servers, and $$k$$ subnetworks, this strategy will scale with $$O(m)$$ training speedup, $$O(m*n*k)$$ variable fetches from parameter servers, and $$O(k)$$ memory required per worker. Additionally there will be $$O(m)$$ stale gradients per subnetwork when training with asynchronous SGD.

Returns: A ReplicationStrategy instance for the current cluster.
should_build_ensemble(num_subnetworks)[source]

Whether to build the ensemble on the current worker.

Parameters: num_subnetworks – Integer number of subnetworks to train in the current iteration. Boolean whether to build the ensemble on the current worker.
should_build_subnetwork(num_subnetworks, subnetwork_index)[source]

Whether to build the given subnetwork on the current worker.

Parameters: num_subnetworks – Integer number of subnetworks to train in the current iteration. subnetwork_index – Integer index of the subnetwork in the list of the current iteration’s subnetworks. Boolean whether to build the given subnetwork on the current worker.
should_train_subnetworks(num_subnetworks)[source]

Whether to train subnetworks on the current worker.

Parameters: num_subnetworks – Integer number of subnetworks to train in the current iteration. Boolean whether to train subnetworks on the current worker.
subnetwork_devices(num_subnetworks, subnetwork_index)[source]

A context for assigning subnetwork ops to devices.

## RoundRobinStrategy¶

A strategy that round-robin assigns subgraphs to specific workers.

Specifically, it selects dedicated workers to only train ensemble variables, and round-robin assigns subnetworks to dedicated subnetwork-training workers.

Unlike ReplicationStrategy, this strategy scales better with the number of subnetworks, workers, and parameter servers. For $$m$$ workers, $$n$$ parameter servers, and $$k$$ subnetworks, this strategy will scale with $$O(m/k)$$ training speedup, $$O(m*n/k)$$ variable fetches from parameter servers, and $$O(1)$$ memory required per worker. Additionally, there will only be $$O(m/k)$$ stale gradients per subnetwork when training with asynchronous SGD, which reduces training instability versus ReplicationStrategy.

When there are more workers than subnetworks, this strategy assigns subnetworks to workers modulo the number of subnetworks.

Conversely, when there are more subnetworks than workers, this round robin assigns subnetworks modulo the number of workers. So certain workers may end up training more than one subnetwork.

This strategy gracefully handles scenarios when the number of subnetworks does not perfectly divide the number of workers and vice-versa. It also supports different numbers of subnetworks at different iterations, and reloading training with a resized cluster.

Parameters: drop_remainder – Bool whether to drop remaining subnetworks that haven’t been assigned to a worker in the remainder after perfect division of workers by the current iteration’s num_subnetworks + 1. When True, each subnetwork worker will only train a single subnetwork, and subnetworks that have not been assigned to assigned to a worker are dropped. NOTE: This can result in subnetworks not being assigned to any worker when num_workers < num_subnetworks + 1. When False, remaining subnetworks during the round-robin assignment will be placed on workers that already have a subnetwork. A RoundRobinStrategy instance for the current cluster.
should_build_ensemble(num_subnetworks)[source]

Whether to build the ensemble on the current worker.

Parameters: num_subnetworks – Integer number of subnetworks to train in the current iteration. Boolean whether to build the ensemble on the current worker.
should_build_subnetwork(num_subnetworks, subnetwork_index)[source]

Whether to build the given subnetwork on the current worker.

Parameters: num_subnetworks – Integer number of subnetworks to train in the current iteration. subnetwork_index – Integer index of the subnetwork in the list of the current iteration’s subnetworks. Boolean whether to build the given subnetwork on the current worker.
should_train_subnetworks(num_subnetworks)[source]

Whether to train subnetworks on the current worker.

Parameters: num_subnetworks – Integer number of subnetworks to train in the current iteration. Boolean whether to train subnetworks on the current worker.
subnetwork_devices(num_subnetworks, subnetwork_index)[source]

A context for assigning subnetwork ops to devices.