decent_bench.datasets#

class decent_bench.datasets.DatasetHandler[source]#

Bases: ABC

Abstract wrapper for datasets used in decentralized optimization benchmark problems.

This class provides an interface for accessing datasets in a partitioned format for decentralized optimization scenarios. Rather than storing the data directly, DatasetHandler implementations act as wrappers that return data in the required format when queried.

In decentralized optimization, the dataset is typically divided among multiple agents in a network, where each agent has access to only a subset (partition) of the complete dataset. This class abstracts that partitioning scheme.

When defining benchmark problems, a DatasetHandler instance can be used to:

Provide local datasets to each agent in the network via get_partitions()
Define the overall optimization problem (e.g., empirical risk minimization)
Serve as a test set for evaluating decentralized algorithms on the full dataset (e.g. via get_datapoints()) by defining the test_data field of BenchmarkProblem.

Data Structure:: The dataset consists of datapoints, where each datapoint is a tuple of (features, targets). Features and targets are represented as Array objects or framework-specific tensor objects in special cases. For unsupervised learning, targets are usually None. Partitions are sequences of such datapoints, allowing users to easily distribute local datasets among agents.

Note

Implementations may load data from various sources (files, generators, synthetic data, etc) and are not required to store all datapoints in memory.

abstract property n_samples: int#: Total number of datapoints in the dataset.

abstract property n_partitions: int#: Total number of partitions in the dataset.

abstract property n_features: int#: Number of feature dimensions.

abstract property n_targets: int#: Number of target dimensions.

abstractmethod get_datapoints() → Dataset[source]#

Return all datapoints in the dataset.

Can be used for evaluation on the full dataset or creation of test datasets.

abstractmethod get_partitions() → Sequence[Dataset][source]#

Return the dataset divided into partitions for distribution among agents.

This method provides the core partitioning functionality for decentralized optimization. Each partition represents the local dataset of an agent in the network.

Returns:: Sequence of Dataset objects, where each partition is a list of (features, targets) tuples.
Return type:: Sequence[Dataset]

class decent_bench.datasets.KaggleDatasetHandler(kaggle_handle: str, path: str, feature_columns: list[str], target_columns: list[str], n_partitions: int = 1, *, framework: SupportedFrameworks = SupportedFrameworks.NUMPY, device: SupportedDevices = SupportedDevices.CPU, dtype: DTypeLike = np.float64, samples_per_partition: int | None = None)[source]#

Bases: DatasetHandler

property n_samples: int#: Total number of datapoints in the dataset.

property n_partitions: int#: Total number of partitions in the dataset.

property n_features: int#: Number of feature dimensions.

property n_targets: int#: Number of target dimensions.

get_datapoints() → Dataset[source]#

Return all datapoints in the dataset.

Can be used for evaluation on the full dataset or creation of test datasets.

get_partitions() → Sequence[Dataset][source]#

Return the dataset divided into partitions for distribution among agents.

This method provides the core partitioning functionality for decentralized optimization. Each partition represents the local dataset of an agent in the network.

Each partition is sampled uniformly at random from the dataset without replacement.

Returns:: Sequence of Dataset objects, where each partition is a list of (features, targets) tuples.
Return type:: Sequence[Dataset]

class decent_bench.datasets.PyTorchDatasetHandler(torch_dataset: torch.utils.data.Dataset[Any], n_features: int, n_targets: int, n_partitions: int = 1, *, samples_per_partition: int | None = None, heterogeneity: bool = False, targets_per_partition: int = 1)[source]#

Bases: DatasetHandler

property n_samples: int[source]#: Total number of datapoints in the dataset.

property n_partitions: int#: Total number of partitions in the dataset.

property n_features: int#: Number of feature dimensions.

property n_targets: int#: Number of target dimensions.

get_datapoints() → Dataset[source]#

Return all datapoints in the dataset.

Can be used for evaluation on the full dataset or creation of test datasets.

get_partitions() → list[Dataset][source]#

Return the dataset divided into partitions for distribution among agents.

This method provides the core partitioning functionality for decentralized optimization. Each partition represents the local dataset of an agent in the network.

Each partition is sampled uniformly at random from the dataset without replacement if heterogeneity is False, otherwise each partition contains unique classes (targets_per_partition) with number of datapoints per partition equal to min(samples_per_partition, number of available datapoints for the selected classes).

Returns:: Sequence of Dataset objects, where each partition is a list of (features, targets) tuples.
Return type:: Sequence[Dataset]

class decent_bench.datasets.SyntheticClassificationDatasetHandler(n_targets: int, n_features: int, n_samples_per_partition: int, n_partitions: int = 1, *, framework: SupportedFrameworks = SupportedFrameworks.NUMPY, device: SupportedDevices = SupportedDevices.CPU, feature_dtype: DTypeLike = np.float64, target_dtype: DTypeLike = np.int64, squeeze_targets: bool = False)[source]#

Bases: DatasetHandler

property n_samples: int#: Total number of datapoints in the dataset.

property n_partitions: int#: Total number of partitions in the dataset.

property n_features: int#: Number of feature dimensions.

property n_targets: int#: Number of target dimensions.

get_datapoints() → Dataset[source]#

Return all datapoints in the dataset.

Can be used for evaluation on the full dataset or creation of test datasets.

get_partitions() → list[Dataset][source]#

Return the dataset divided into partitions for distribution among agents.

This method provides the core partitioning functionality for decentralized optimization. Each partition represents the local dataset of an agent in the network.

Returns:: Sequence of Dataset objects, where each partition is a list of (features, targets) tuples.
Return type:: Sequence[Dataset]

class decent_bench.datasets.SyntheticRegressionDatasetHandler(n_targets: int, n_features: int, n_samples_per_partition: int, n_partitions: int = 1, *, framework: SupportedFrameworks = SupportedFrameworks.NUMPY, device: SupportedDevices = SupportedDevices.CPU, feature_dtype: DTypeLike = np.float64, target_dtype: DTypeLike = np.float64, squeeze_targets: bool = False)[source]#

Bases: DatasetHandler

property n_samples: int#: Total number of datapoints in the dataset.

property n_partitions: int#: Total number of partitions in the dataset.

property n_features: int#: Number of feature dimensions.

property n_targets: int#: Number of target dimensions.

get_datapoints() → Dataset[source]#

Return all datapoints in the dataset.

Can be used for evaluation on the full dataset or creation of test datasets.

get_partitions() → list[Dataset][source]#

Return the dataset divided into partitions for distribution among agents.

This method provides the core partitioning functionality for decentralized optimization. Each partition represents the local dataset of an agent in the network.

Returns:: Sequence of Dataset objects, where each partition is a list of (features, targets) tuples.
Return type:: Sequence[Dataset]