Creates federated data from the provided centralized data (features and labels) to exemplify identically and
non-identically distributed labels and features across the local nodes (clients). It allows one to select between
two methods of data federation (percent_noniid and dirichlet). It works only for classification problems
(labels as classes).
Create a federated dataset divided per each local node (client) using the desired method (percent_noniid or dirichlet). It works only for classification problems (labels as classes) with quantitaive (numeric) features.
List of numpy arrays (or pandas dataframe) with images (i.e. features) from the centralized data.
label_listarray-like
The target values (class labels in classification) from the centralized data.
num_clientsint
Number of local nodes (clients) used in the federated learning paradigm.
prefix_clistr
The clients’ name prefix, e.g., client_1, client_2, etc.
methodstring
Method to create the federated data based on label skew. Possible options: “percent_noniid”(default), “dirichlet”, “no-label-skew”
alphafloat
Concentration parameter of the Dirichlet distribution defining the desired degree of non-IID-ness for the labels of the federated data.
percent_noniidfloat
Percentage (between o and 100) desired of non-IID-ness for the labels of the federated data.
sigma_noisefloat
Noise (sigma parameter of Gaussian distro) to be added to the features. Applicable only for feat_skew_method=”gaussian-noise”.
binsint or str
Number of bins used to create histogram of features to check feature skew. It can be the word ‘n_samples’ or the integer number of bins to use. If ‘n_samples’(default) is selected, then it is set as the number values of the image_list (examples). Applicable only for feat_skew_method=”gaussian-noise”.
feat_sample_ratefloat
Proportion (between 0 and 1) to be sampled from features. This parameter is useful when dealing with datasets with many features (i.e. images). Applicable only for feat_skew_method=”gaussian-noise”.
feat_skew_methodstr
Method to create the federated data based on feature skew. Possible options: “gaussian-noise”(default), “hist-dirichlet”
alpha_feat_splitfloat
Concentration parameter of the Dirichlet distribution defining the desired degree of non-IID-ness for the features of the federated data. Applicable only for feat_skew_method=”hist-dirichlet”.
idx_featint or str
Position (idx) of feature used to simulate feature skew. It can be the word ‘feat-mean’ or the integer number of the position to use. If ‘feat-mean’(default) is selected, then the mean of all the features is computed as representative of the features. Applicable only for feat_skew_method=”hist-dirichlet”.
feat_quantileint
Number quantiles to use in the feature skew simulation. 20 for ventiles (default), 10 for deciles, 4 for quartiles, etc. Applicable only for feat_skew_method=”hist-dirichlet”.
quant_skew_methodstr
Method to create the federated data based on quantity skew. Possible options: “no-quant-skew”(default), “dirichlet”, “minsize-dirichlet”
alpha_quant_splitfloat
Concentration parameter of the Dirichlet distribution defining the desired degree of non-IID-ness for the quantity skew of the federated data. Applicable only for quant_skew_method=”dirichlet”.
spa_temp_skew_methodstr
Method to create the federated data based on spatio-temporal skew. Possible options: “no-spatemp-skew”(default), “st-dirichlet”
alpha_spa_tempfloat
Concentration parameter of the Dirichlet distribution defining the desired degree of non-IID-ness for the spatio-temporal skew of the federated data. Applicable only for spa_temp_skew_method=”st-dirichlet”.
spa_temp_vararray-like
The spatio-temporal variable from the centralized data. Applicable only for spa_temp_skew_method=”st-dirichlet”.
Contains features (images) and labels for each local node (client) after federating the data. Includes “with_class_completion” and “without_class_completion” cases.
ids_list_fed_dataarray-like
Indexes of examples (partition) taken for each local node (client).
num_missing_classesarray-like
Number of missing classes per each local node when creating the federated dataset
distancesdict
Distances calculated while measuring heterogeneity (non-IID-ness) of the label’s distribution among clients. Includes “with_class_completion” and “without_class_completion” cases.
spatemp_fed_datadict
Contains categories of the spatio-temporal variable for each local node (client) after federating the data. It is generated only when spa_temp_skew_method = “st-dirichlet”.
Note: When creating federated data and setting heterogeneous distributions (i.e. high values of percent_noniid or small values of alpha), it is more likely the clients hold examples from only one class. Then, two cases (for labels and features) are returned as output for fed_data and distances:
“with_class_completion”: In this case, the clients are completed with one (random) example of each missing class for each client to have all the label’s classes.
“without_class_completion”: In this case, the clients are NOT completed with one (random) example of each missing class. Consequently, summing the number of examples of each client results in the same number of total examples (number of rows in image_list).