Each Component
wraps a FederatedML Module
.
Modules
implement machine learning algorithms on federated learning,
while Components
provide convenient interface for easy model building.
Input
encapsulates all upstream input to a component in a job workflow.
There are two classes of input
: data
and model
. Not all components have
both classes of input, and a component may accept only some types of the class.
For information on each components' input, check the list below.
Here is an example to access a component's input:
from pipeline.component import DataIO
dataio_0 = DataIO(name="dataio_0")
input_all = dataio_0.input
input_data = dataio_0.input.data
input_model = dataio_0.input.model
Same as Input
, Output
encapsulates output data
and model
of component
in a FATE job. Not all components have both classes of outputs.
For information on each components' output, check the list below.
Here is an example to access a component's output:
from pipeline.component import DataIO
dataio_0 = DataIO(name="dataio_0")
output_all = dataio_0.output
output_data = dataio_0.output.data
output_model = dataio_0.output.model
Meanwhile, to download components' output table or model, please use task info interface.
In most cases, data sets are wrapped into data
when being passed
between modules. For instance, in the mini demo, data output of
dataio_0
is set as data input to intersection_0
.
pipeline.add_component(intersection_0, data=Data(data=dataio_0.output.data))
For data sets used in different modeling stages (e.g., train & validate)
of the same component, additional keywords train_data
and
validate_data
are used to distinguish data sets.
Also from mini demo, result from
intersection_0
and intersection_1
are set as train and validate data
of hetero logistic regression, respectively.
pipeline.add_component(hetero_lr_0, data=Data(train_data=intersection_0.output.data,
validate_data=intersection_1.output.data))
Another case of using keywords train_data
and validate_data
is to use
data output from DataSplit
module, which always has three data outputs:
train_data
, validate_data
, and test_data
.
pipeline.add_component(hetero_lr_0,
data=Data(train_data=hetero_data_split_0.output.data.train_data))
A special data type is predict_input
. predict_input
is only used for specifying
data input when running prediction task.
Here is an example of running prediction with an upstream model within the same pipeline:
pipeline.add_component(hetero_lr_1,
data=Data(predict_input=hetero_data_split_0.output.data.test_data),
model=Model(model=hetero_lr_0))
To run prediction with with new data, data source needs to be updated in prediction job. Below is an example from mini demo, where data input of original dataio_0 component is set to be the data output from reader_2.
reader_2 = Reader(name="reader_2")
reader_2.get_party_instance(role="guest", party_id=guest).component_param(table=guest_eval_data)
reader_2.get_party_instance(role="host", party_id=host).component_param(table=host_eval_data)
# add data reader onto predict pipeline
predict_pipeline.add_component(reader_2)
predict_pipeline.add_component(pipeline,
data=Data(predict_input={pipeline.dataio_0.input.data: reader_2.output.data}))
Below lists all five types of data
and whether Input
and Output
include them.
Data Name | Input | Output | Use Case |
---|---|---|---|
data | Yes | Yes | single data input or output |
train_data | Yes | Yes | model training; output of DataSplit component |
validate_data | Yes | Yes | model training with validate data; output of DataSplit component |
test_data | No | Yes | output of DataSplit component |
predict_input | Yes | No | model prediction |
All input and output data of components need to be wrapped into Data
objects when being passed between components. For information on valid data
types of each component, check the list below.
Here is a an example of chaining components with different types of data input and output:
from pipeline.backend.pipeline import Pipeline
from pipeline.component import DataIO, Intersection, HeteroDataSplit, HeteroLR
# initialize a pipeline
pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest)
# define all components
dataio_0 = DataIO(name="dataio_0")
data_split = HeteroDataSplit(name="data_split_0")
hetero_lr_0 = HeteroLR(name="hetero_lr_0", max_iter=20)
# chain together all components
pipeline.add_component(reader_0)
pipeline.add_component(dataio_0, data=Data(data=reader_0.output.data))
pipeline.add_component(intersection_0, data=Data(data=dataio_0.output.data))
pipeline.add_component(hetero_data_split_0, data=Data(data=intersection_0.output.data))
pipeline.add_component(hetero_lr_0, data=Data(train_data=hetero_data_split_0.output.data.train_data,
validate_data=hetero_data_split_0.output.data.test_data))
There are two types of Model
: model
andisometric_model
. When the current
component is of the same class as the previous component, if receiving
model
, the current component will replicate all model parameters from
the previous component. When a model from previous component is used as
input but the current component is of different class from the previous component,
isometric_model
is used.
Check below for a case from mini demo, where model
from dataio_0
is passed to dataio_1
.
pipeline.add_component(dataio_1,
data=Data(data=reader_1.output.data),
model=Model(dataio_0.output.model))
Here is a case of using isometric model
. HeteroFeatureSelection
uses
isometric_model
from HeteroFeatureBinning
to select the most
important features.
pipeline.add_component(hetero_feature_selection_0,
data=Data(data=intersection_0.output.data),
isometric_model=Model(hetero_feature_binning_0.output.model))
Warning
Please note that when using stepwise or cross validation method, components do
not have model
output. For information on valid model
types of each components, check the list below.
Parameters of underlying module can be set for all job participants or per individual participant.
- Parameters for all participants may be specified when defining a component:
from pipeline.component import DataIO
dataio_0 = DataIO(name="dataio_0", input_format="dense", output_format="dense",
outlier_replace=False)
- Parameters can be set for each party individually:
# set guest dataio_0 component parameters
guest_dataio_0 = dataio_0.get_party_instance(role='guest', party_id=9999)
guest_dataio_0.component_param(with_label=True)
# set host dataio_0 component parameters
dataio_0.get_party_instance(role='host', party_id=10000).component_param(with_label=False)
Output data and model information of Components
can be retrieved with
Pipeline task info API. Currently Pipeline support these four types of query on components:
- get_output_data: returns downloaded output data; use parameter limits to limit output lines
- get_output_data_table: returns output data table information(including table name and namespace)
- get_model_param: returns fitted model parameters
- get_summary: returns model summary
To obtain output of a component, the component needs to be first extracted from pipeline:
print(pipeline.get_component("dataio_0").get_output_data(limits=10))
Below lists input and output elements of each component.
Algorithm | Component Name | Description | Acceptable Input Data | Acceptable Output Data | Acceptable Input Model | Acceptable Output Model |
---|---|---|---|---|---|---|
Reader | Reader | This component is always the first component of a pipeline task(except for upload). It loads raw data from storage. | None | data | None | None |
DataIO | DataIO | This component usually follows Reader . It transforms user-uploaded date into Instance object. |
data | data | model | model |
Intersect | Intersection | Compute intersect data set of multiple parties without leakage of difference set information. Mainly used in hetero scenario task. | data | data | model | model |
Federated Sampling | FederatedSample | Federated Sampling data so that its distribution become balance in each party.This module supports standalone and federated versions. | data | data | model | model |
Feature Scale | FeatureScale | Feature scaling and standardization. | data | data | model | model |
Hetero Feature Binning | Hetero Feature Binning | With binning input data, calculates each column's iv and woe and transform data according to the binned information. | data | data | model | model |
OneHot Encoder | OneHotEncoder | Transfer a column into one-hot format. | data | data | model | model |
Hetero Feature Selection | HeteroFeatureSelection | Provide 5 types of filters. Each filters can select columns according to user config | data | data | model; isometric model | model |
Union | Union | Combine multiple data tables into one. | List[data] | data | model | model |
Hetero-LR | HeteroLR | Build hetero logistic regression module through multiple parties. | train_data; validate_data; predict_input | data | model | model |
Local Baseline | LocalBaseline | Wrapper that runs sklearn(scikit-learn) Logistic Regression model with local data. | train_data; validate_data; predict_input | data | None | None |
Hetero-LinR | HeteroLinR | Build hetero linear regression module through multiple parties. | train_data; validate_data; predict_input | data | model | model |
Hetero-Poisson | HeteroPoisson | Build hetero poisson regression module through multiple parties. | train_data; validate_data; predict_input | data | model | model |
Homo-LR | HomoLR | Build homo logistic regression module through multiple parties. | train_data; validate_data; predict_input | data | model | model |
Homo-NN | HomoNN | Build homo neural network module through multiple parties. | train_data; validate_data; predict_input | data | model | model |
Hetero Secure Boosting | HeteroSecureBoost | Build hetero secure boosting module through multiple parties | train_data; validate_data; predict_input | data | model | model |
Evaluation | Evaluation | Output the model evaluation metrics for user. | data; List[data] | None | model | None |
Hetero Pearson | HeteroPearson | Calculate hetero correlation of features from different parties. | data | None | model | model |
Hetero-NN | HeteroNN | Build hetero neural network module. | train_data; validate_data; predict_input | data | model | model |
Homo Secure Boosting | HomoSecureBoost | Build homo secure boosting module through multiple parties | train_data; validate_data; predict_input | data | model | model |
Homo OneHot Encoder | HomoOneHotEncoder | Build homo onehot encoder module through multiple parties. | data | data | model | model |
Data Split | Data Split | Split one data table into 3 tables by given ratio or count | data | train_data & validate_data & test_data | None | None |
Column Expand | Column Expand | Add arbitrary number of columns with user-provided values. | data | data | model | model |
Hetero KMeans | Hetero KMeans | Build Hetero KMeans module through multiple parties | train_data; validate_data; predict_input | data | model | model |
Data Statistics | Data Statistics | Compute Data Statistics | data | data | None | model |
Scorecard | Scorecard | Scale predict score to credit score by given scaling parameters | data | data | None | None |
.. automodule:: pipeline/param :autosummary: :members: