etiq.datasets.builders package

Submodules

etiq.datasets.builders.bias_dataset_builder module

class etiq.datasets.builders.bias_dataset_builder.BiasDatasetBuilder

Bases: SimpleDatasetBuilder

classmethod bias_params(bias_params: BiasParams | None = None, protected: str | None = None, privileged: Any | None = None, unprivileged: Any | None = None, positive_outcome_label: Any | None = None, negative_outcome_label: Any | None = None)

Returns a BiasParams object

Parameters:
  • bias_params – A bias params object to clone. This defaults to None if not provided.

  • protected – Protected feature name for example ‘gender’. Defaults to None.

  • privileged – Privileged label within the protected feature for example ‘male’. This defaults to None if not provided.

  • unprivileged – Privileged label within the protected feature for example ‘female’. This defaults to None if not provided.

  • positive_outcome_label – The label of a “positive” outcome within a target feature. Defaults to None.

  • negative_outcome_label – The label of a “negative” outcome within a target feature. Defaults to None.

Returns:

A BiasParams object

classmethod dataset(features: DataFrame, target: DataFrame | None = None, label: str | None = None, prediction: str | None = None, cat_col: List[str] | None = None, cont_col: List[str] | None = None, train_valid_test_splits: Tuple[float, float, float] = (0.8, 0.2, 0.0), id_col: List[str] | None = None, date_col: List[str] | None = None, bias_params: BiasParams | None = None, convert_date_cols: bool = False, datetime_format: str = '', remove_protected_from_features: bool = True, random_seed: int = 2, name: str = None, register_creation: bool = True) BiasDataset

Creates a BiasDataset object given pandas dataframe(s).

Use this the dataset builder like:

from etiq import BiasDatasetBuilder
from etiq.biasparams import BiasParams
import pandas as pd
a = [
        ["2022-10-10", 'M', 2, 3, 4, 5, 6, 1],
        ["2022-10-11", 'F', 8, 9, 10, 11, 12, 0],
        ["", 'F', 2, 3, 4, 5, 6, 1],
        ["2022-10-13", 'M', 8, 9, 10, 11, 12, 0],
        ["2022-10-14", 'F', 2, 3, 4, 5, 6, 1],
        ["2022-10-15", 'F', 8, 9, 10, 11, 12, 0],
        ["2022-10-16", 'M', 2, 3, 4, 5, 6, 1],
        ["2022-10-17", 'F', 8, 9, 10, 11, 12, 0],
        ["2022-10-18", 'M', 14, 15, 16, 17, 18, 1],
        ["2022-10-19", 'M', 15, 16, 17, 18, 19, 1],
    ]
df = pd.DataFrame(a,
                columns=["start_date", "gender", "age2", "age3", "age4",
                        "age5", "age6", "income"])
adataset = BiasDatasetBuilder.dataset(
                features=df,
                label="income",
                cat_col=["age2", "age3", "income"],
                cont_col=["age4", "age5", "age6"],
                date_col=["start_date"],
                bias_params = BiasParams(protected='gender',
                                         privileged='M',
                                         unprivileged='F',
                                         positive_outcome_label= 1,
                                         negative_outcome_label= 0),
                remove_protected_from_features = True,
                convert_date_cols=True,
                name="test_dataset")
Parameters:
  • features – Pandas dataframe containing the dataset features as columns.

  • target – Pandas dataframe containing the target feature as a column. This defaults to None in which case the target feature is assumed to be either the last column in features dataset or the column name specified in the label argument.

  • label – The name of the column containing the target. This defaults to None in which case the target is assumed to either be the last column of the features dataframe or the first column of the target dataframe if this is not None.

  • prediction – The name of the column containing the prediction data. This defaults to None in which case the assumption is that the dataset contains no prediction data.

  • cat_col – List of categorical features. This defaults to None in which case categorical features are determined automatically.

  • cont_col – List of continuous features. This defaults to None in which case continuous features are determined automatically.

  • id_col – List of id features. This defaults to None in which case it is assumed the dataset contains no id features.

  • date_col – List of datetime features. This defaults to None in which case it is assumed the dataset contains no datetime features.

  • bias_params – This contains demographic data (the protected feature) needed to create the bias dataset. This defaults to None in which case a fake random protected feature is created.

  • train_valid_test_splits – This parameter specifies the proportions to use when splitting the data into training, validation and test subsets. This defaults to (0.8, 0.2, 0.0).

  • random_seed – Random number seed (for reproducibility) used when splitting the data into random training, validation and test subsets. This defaults to 2.

  • remove_protected_from_features – This is set to True in order to remove the protected feature from the normal features i.e. the protected feature is then not considered a feature used by the model. Otherwise the protected feature is treated as a normal feature.

  • convert_date_cols – This is set to True in order to convert an date features into datetime objects. This defaults to False.

  • datetime_format – The specific datetime format (assumes a common datetime is used for all datetime features). This defaults to an empty string in which case the datetime format is guessed.

  • name – The name to use for the dataset. This defaults to None in which case a random name is assigned.

  • register_creation – This is set to True to enable the dataset to be registered to the database (note that only a hash and/or fingerprint of the data is stored). This Defaults to True.

Returns:

A BiasDataset object.

classmethod datasets(training_features: DataFrame | None = None, training_target: DataFrame | None = None, validation_features: DataFrame | None = None, validation_target: DataFrame | None = None, testing_features: DataFrame | None = None, testing_target: DataFrame | None = None, label: str | None = None, prediction: str | None = None, cat_col: List[str] | None = None, cont_col: List[str] | None = None, bias_params: BiasParams | None = None, remove_protected_from_features: bool = True, id_col: List[str] | None = None, date_col: List[str] | None = None, convert_date_cols: bool = False, datetime_format: str = '', name: str | None = None, register_creation: bool = True) BiasDataset

Creates a SimpleDataset object given pandas dataframe(s).

Use this builder like:

from etiq import BiasDatasetBuilder
from etiq.biasparams import BiasParams
import pandas as pd
training = [
        ["2022-10-10", 'M', 2, 3, 4, 5, 6, 1],
        ["2022-10-11", 'F', 8, 9, 10, 11, 12, 0],
        ["", 'F', 2, 3, 4, 5, 6, 1],
        ["2022-10-13", 'M', 8, 9, 10, 11, 12, 0],
        ["2022-10-14", 'F', 2, 3, 4, 5, 6, 1]
        ]
validation = [
        ["2022-10-15", 'F', 8, 9, 10, 11, 12, 0],
        ["2022-10-16", 'M', 2, 3, 4, 5, 6, 1],
        ["2022-10-17", 'F', 8, 9, 10, 11, 12, 0],
        ["2022-10-18", 'M', 14, 15, 16, 17, 18, 1],
        ["2022-10-19", 'M', 15, 16, 17, 18, 19, 1]
        ]
df1 = pd.DataFrame(training,
                columns=["start_date", "gender", "age2", "age3", "age4",
                        "age5", "age6", "income"])
df2 = pd.DataFrame(validation,
                columns=["start_date", "gender", "age2", "age3", "age4",
                        "age5", "age6", "income"])
adataset = BiasDatasetBuilder.datasets(
                training_features=df1,
                validation_features=df2,
                label="income",
                cat_col=["age2", "age3", "income"],
                cont_col=["age4", "age5", "age6"],
                date_col=["start_date"],
                bias_params = BiasParams(protected='gender',
                                         privileged='M',
                                         unprivileged='F',
                                         positive_outcome_label= 1,
                                         negative_outcome_label= 0),
                remove_protected_from_features = True,
                convert_date_cols=True,
                name="test_dataset")
Parameters:
  • training_features – Pandas dataframe containing the training dataset features. This defaults to None in which case we assume there is no training data.

  • training_target – Pandas dataframe containing the target training data as a column. This defaults to None in which case the target feature is assumed to be either the last column in features dataset or the column name specified in the label argument.

  • validation_features – Pandas dataframe containing the validation dataset features. This defaults to None in which case we assume there is no validation data.

  • validation_target – Pandas dataframe containing the target validation data as a column. This defaults to None in which case the target feature is assumed to be either the last column in validation features dataset or the column name specified in the label argument.

  • testing_features – Pandas dataframe containing the testing dataset features. This defaults to None in which case we assume there is no testing data.

  • testing_target – Pandas dataframe containing the target testing data as a column. This defaults to None in which case the target feature is assumed to be either the last column in testing features dataset or the column name specified in the label argument.

  • label – The name of the column containing the target. This defaults to None in which case the target is assumed to either be the last column of the features dataframe or the first column of the target dataframe if this is not None.

  • prediction – The name of the column containing the prediction data. This defaults to None in which case the assumption is that the dataset contains no prediction data.

  • cat_col – List of categorical features. This defaults to None in which case categorical features are determined automatically.

  • cont_col – List of continuous features. This defaults to None in which case continuous features are determined automatically.

  • id_col – List of id features. This defaults to None in which case it is assumed the dataset contains no id features.

  • date_col – List of datetime features. This defaults to None in which case it is assumed the dataset contains no datetime features.

  • bias_params – This contains demographic data (the protected feature) needed to create the bias dataset. This defaults to None in which case a fake random protected feature is created.

  • train_valid_test_splits – This parameter specifies the proportions to use when splitting the data into training, validation and test subsets. This defaults to (0.8, 0.2, 0.0).

  • random_seed – Random number seed used when splitting the data into random training, validation and test subsets. This defaults to 2.

  • remove_protected_from_features – This is set to True in order to remove the protected feature from the normal features i.e. the protected feature is then not considered a feature used by the model. Otherwise the protected feature is treated as a normal feature.

  • convert_date_cols – This is set to True in order to convert an date features into datetime objects. This defaults to False.

  • datetime_format – The specific datetime format (assumes a common datetime is used for all datetime features). This defaults to an empty string in which case the datetime format is guessed.

  • name – The name to use for the dataset. This defaults to None in which case a random name is assigned.

  • register_creation – This is set to True to enable the dataset to be registered to the database (note that only a hash and/or fingerprint of the data is stored). This Defaults to True.

Returns:

A BiasDataset object.

etiq.datasets.builders.simple_dataset_builder module

exception etiq.datasets.builders.simple_dataset_builder.DatasetError

Bases: Exception

Base class for dataset errors

class etiq.datasets.builders.simple_dataset_builder.SimpleDatasetBuilder

Bases: object

A builder for the SimpleDataset class

classmethod dataset(features: DataFrame, target: DataFrame | None = None, label: str | None = None, prediction: str | None = None, cat_col: List[str] | None = None, cont_col: List[str] | None = None, id_col: List[str] | None = None, date_col: List[str] | None = None, train_valid_test_splits: Tuple[float, float, float] = (0.8, 0.2, 0.0), random_seed: int = 2, convert_date_cols: bool = False, datetime_format: str = '', name: str | None = None, register_creation: bool = True) SimpleDataset

Creates a SimpleDataset object given pandas dataframe(s).

Use this builder like:

from etiq import SimpleDatasetBuilder
import pandas as pd
a = [
        ["2022-10-10", 2, 3, 4, 5, 6, 1],
        ["2022-10-11", 8, 9, 10, 11, 12, 0],
        ["", 2, 3, 4, 5, 6, 1],
        ["2022-10-13", 8, 9, 10, 11, 12, 0],
        ["2022-10-14", 2, 3, 4, 5, 6, 1],
        ["2022-10-15", 8, 9, 10, 11, 12, 0],
        ["2022-10-16", 2, 3, 4, 5, 6, 1],
        ["2022-10-17", 8, 9, 10, 11, 12, 0],
        ["2022-10-18", 14, 15, 16, 17, 18, 1],
        ["2022-10-19", 15, 16, 17, 18, 19, 1],
    ]
df = pd.DataFrame(a,
                columns=["start_date", "age2", "age3", "age4",
                        "age5", "age6", "income"])
adataset = SimpleDatasetBuilder.dataset(
                features=df,
                label="income",
                cat_col=["age2", "age3", "income"],
                cont_col=["age4", "age5", "age6"],
                date_col=["start_date"],
                convert_date_cols=True,
                name="test_dataset")
Parameters:
  • features – Pandas dataframe containing the dataset features as columns.

  • target – Pandas dataframe containing the target feature as a column. This defaults to None in which case the target feature is assumed to be either the last column in features dataset or the column name specified in the label argument.

  • label – The name of the column containing the target. This defaults to None in which case the target is assumed to either be the last column of the features dataframe or the first column of the target dataframe if this is not None.

  • prediction – The name of the column containing the prediction data. This defaults to None in which case the assumption is that the dataset contains no prediction data.

  • cat_col – List of categorical features. This defaults to None in which case categorical features are determined automatically.

  • cont_col – List of continuous features. This defaults to None in which case continuous features are determined automatically.

  • id_col – List of id features. This defaults to None in which case it is assumed the dataset contains no id features.

  • date_col – List of datetime features. This defaults to None in which case it is assumed the dataset contains no datetime features.

  • train_valid_test_splits – This parameter specifies the proportions to use when splitting the data into training, validation and test subsets. This defaults to (0.8, 0.2, 0.0).

  • random_seed – Random number seed used when splitting the data into random training, validation and test subsets. This defaults to 2.

  • convert_date_cols – This is set to True in order to convert an date features into datetime objects. This defaults to False.

  • datetime_format – The specific datetime format (assumes a common datetime is used for all datetime features). This defaults to an empty string in which case the datetime format is guessed.

  • name – The name to use for the dataset. This defaults to None in which case a random name is assigned.

  • register_creation – This is set to True to enable the dataset to be registered to the database (note that only a hash and/or fingerprint of the data is stored). This Defaults to True.

Returns:

A SimpleDataset object.

classmethod datasets(training_features: DataFrame | None = None, training_target: DataFrame | None = None, validation_features: DataFrame | None = None, validation_target: DataFrame | None = None, testing_features: DataFrame | None = None, testing_target: DataFrame | None = None, label: str | None = None, prediction: str | None = None, cat_col: List[str] | None = None, cont_col: List[str] | None = None, id_col: List[str] | None = None, date_col: List[str] | None = None, convert_date_cols=False, datetime_format='', name: str | None = None, register_creation: bool = True) SimpleDataset

Creates a SimpleDataset object given pandas dataframe(s).

Use this builder like:

from etiq import SimpleDatasetBuilder
import pandas as pd
training = [
                ["2022-10-10", 2, 3, 4, 5, 6, 1],
                ["2022-10-11", 8, 9, 10, 11, 12, 0],
                ["", 2, 3, 4, 5, 6, 1],
                ["2022-10-13", 8, 9, 10, 11, 12, 0]
                ["2022-10-14", 2, 3, 4, 5, 6, 1]
            ]
validation = [
                ["2022-10-15", 8, 9, 10, 11, 12, 0],
                ["2022-10-16", 2, 3, 4, 5, 6, 1],
                ["2022-10-17", 8, 9, 10, 11, 12, 0],
                ["2022-10-18", 14, 15, 16, 17, 18, 1],
                ["2022-10-19", 15, 16, 17, 18, 19, 1]
            ]
df1 = pd.DataFrame(training,
                columns=["start_date", "age2", "age3", "age4",
                        "age5", "age6", "income"])
df2 = pd.DataFrame(validation,
                columns=["start_date", "age2", "age3", "age4",
                        "age5", "age6", "income"])
adataset = SimpleDatasetBuilder.datasets(
                training_features=df1,
                validation_features=df2,
                label="income",
                cat_col=["age2", "age3", "income"],
                cont_col=["age4", "age5", "age6"],
                date_col=["start_date"],
                convert_date_cols=True,
                name="test_dataset")
Parameters:
  • training_features – Pandas dataframe containing the training dataset features. This defaults to None in which case we assume there is no training data.

  • training_target – Pandas dataframe containing the target training data as a column. This defaults to None in which case the target feature is assumed to be either the last column in features dataset or the column name specified in the label argument.

  • validation_features – Pandas dataframe containing the validation dataset features. This defaults to None in which case we assume there is no validation data.

  • validation_target – Pandas dataframe containing the target validation data as a column. This defaults to None in which case the target feature is assumed to be either the last column in validation features dataset or the column name specified in the label argument.

  • testing_features – Pandas dataframe containing the testing dataset features. This defaults to None in which case we assume there is no testing data.

  • testing_target – Pandas dataframe containing the target testing data as a column. This defaults to None in which case the target feature is assumed to be either the last column in testing features dataset or the column name specified in the label argument.

  • label – The name of the column containing the target. This defaults to None in which case the target is assumed to either be the last column of the features dataframe or the first column of the target dataframe if this is not None.

  • prediction – The name of the column containing the prediction data. This defaults to None in which case the assumption is that the dataset contains no prediction data.

  • cat_col – List of categorical features. This defaults to None in which case categorical features are determined automatically.

  • cont_col – List of continuous features. This defaults to None in which case continuous features are determined automatically.

  • id_col – List of id features. This defaults to None in which case it is assumed the dataset contains no id features.

  • date_col – List of datetime features. This defaults to None in which case it is assumed the dataset contains no datetime features.

  • convert_date_cols – This is set to True in order to convert an date features into datetime objects. This defaults to False.

  • datetime_format – The specific datetime format (assumes a common datetime is used for all datetime features). This defaults to an empty string in which case the datetime format is guessed.

  • name – The name to use for the dataset. This defaults to None in which case a random name is assigned.

  • register_creation – This is set to True to enable the dataset to be registered to the database (note that only a hash and/or fingerprint of the data is stored). This Defaults to True.

Returns:

A SimpleDataset object.

Module contents

class etiq.datasets.builders.BiasDatasetBuilder

Bases: SimpleDatasetBuilder

classmethod bias_params(bias_params: BiasParams | None = None, protected: str | None = None, privileged: Any | None = None, unprivileged: Any | None = None, positive_outcome_label: Any | None = None, negative_outcome_label: Any | None = None)

Returns a BiasParams object

Parameters:
  • bias_params – A bias params object to clone. This defaults to None if not provided.

  • protected – Protected feature name for example ‘gender’. Defaults to None.

  • privileged – Privileged label within the protected feature for example ‘male’. This defaults to None if not provided.

  • unprivileged – Privileged label within the protected feature for example ‘female’. This defaults to None if not provided.

  • positive_outcome_label – The label of a “positive” outcome within a target feature. Defaults to None.

  • negative_outcome_label – The label of a “negative” outcome within a target feature. Defaults to None.

Returns:

A BiasParams object

classmethod dataset(features: DataFrame, target: DataFrame | None = None, label: str | None = None, prediction: str | None = None, cat_col: List[str] | None = None, cont_col: List[str] | None = None, train_valid_test_splits: Tuple[float, float, float] = (0.8, 0.2, 0.0), id_col: List[str] | None = None, date_col: List[str] | None = None, bias_params: BiasParams | None = None, convert_date_cols: bool = False, datetime_format: str = '', remove_protected_from_features: bool = True, random_seed: int = 2, name: str = None, register_creation: bool = True) BiasDataset

Creates a BiasDataset object given pandas dataframe(s).

Use this the dataset builder like:

from etiq import BiasDatasetBuilder
from etiq.biasparams import BiasParams
import pandas as pd
a = [
        ["2022-10-10", 'M', 2, 3, 4, 5, 6, 1],
        ["2022-10-11", 'F', 8, 9, 10, 11, 12, 0],
        ["", 'F', 2, 3, 4, 5, 6, 1],
        ["2022-10-13", 'M', 8, 9, 10, 11, 12, 0],
        ["2022-10-14", 'F', 2, 3, 4, 5, 6, 1],
        ["2022-10-15", 'F', 8, 9, 10, 11, 12, 0],
        ["2022-10-16", 'M', 2, 3, 4, 5, 6, 1],
        ["2022-10-17", 'F', 8, 9, 10, 11, 12, 0],
        ["2022-10-18", 'M', 14, 15, 16, 17, 18, 1],
        ["2022-10-19", 'M', 15, 16, 17, 18, 19, 1],
    ]
df = pd.DataFrame(a,
                columns=["start_date", "gender", "age2", "age3", "age4",
                        "age5", "age6", "income"])
adataset = BiasDatasetBuilder.dataset(
                features=df,
                label="income",
                cat_col=["age2", "age3", "income"],
                cont_col=["age4", "age5", "age6"],
                date_col=["start_date"],
                bias_params = BiasParams(protected='gender',
                                         privileged='M',
                                         unprivileged='F',
                                         positive_outcome_label= 1,
                                         negative_outcome_label= 0),
                remove_protected_from_features = True,
                convert_date_cols=True,
                name="test_dataset")
Parameters:
  • features – Pandas dataframe containing the dataset features as columns.

  • target – Pandas dataframe containing the target feature as a column. This defaults to None in which case the target feature is assumed to be either the last column in features dataset or the column name specified in the label argument.

  • label – The name of the column containing the target. This defaults to None in which case the target is assumed to either be the last column of the features dataframe or the first column of the target dataframe if this is not None.

  • prediction – The name of the column containing the prediction data. This defaults to None in which case the assumption is that the dataset contains no prediction data.

  • cat_col – List of categorical features. This defaults to None in which case categorical features are determined automatically.

  • cont_col – List of continuous features. This defaults to None in which case continuous features are determined automatically.

  • id_col – List of id features. This defaults to None in which case it is assumed the dataset contains no id features.

  • date_col – List of datetime features. This defaults to None in which case it is assumed the dataset contains no datetime features.

  • bias_params – This contains demographic data (the protected feature) needed to create the bias dataset. This defaults to None in which case a fake random protected feature is created.

  • train_valid_test_splits – This parameter specifies the proportions to use when splitting the data into training, validation and test subsets. This defaults to (0.8, 0.2, 0.0).

  • random_seed – Random number seed (for reproducibility) used when splitting the data into random training, validation and test subsets. This defaults to 2.

  • remove_protected_from_features – This is set to True in order to remove the protected feature from the normal features i.e. the protected feature is then not considered a feature used by the model. Otherwise the protected feature is treated as a normal feature.

  • convert_date_cols – This is set to True in order to convert an date features into datetime objects. This defaults to False.

  • datetime_format – The specific datetime format (assumes a common datetime is used for all datetime features). This defaults to an empty string in which case the datetime format is guessed.

  • name – The name to use for the dataset. This defaults to None in which case a random name is assigned.

  • register_creation – This is set to True to enable the dataset to be registered to the database (note that only a hash and/or fingerprint of the data is stored). This Defaults to True.

Returns:

A BiasDataset object.

classmethod datasets(training_features: DataFrame | None = None, training_target: DataFrame | None = None, validation_features: DataFrame | None = None, validation_target: DataFrame | None = None, testing_features: DataFrame | None = None, testing_target: DataFrame | None = None, label: str | None = None, prediction: str | None = None, cat_col: List[str] | None = None, cont_col: List[str] | None = None, bias_params: BiasParams | None = None, remove_protected_from_features: bool = True, id_col: List[str] | None = None, date_col: List[str] | None = None, convert_date_cols: bool = False, datetime_format: str = '', name: str | None = None, register_creation: bool = True) BiasDataset

Creates a SimpleDataset object given pandas dataframe(s).

Use this builder like:

from etiq import BiasDatasetBuilder
from etiq.biasparams import BiasParams
import pandas as pd
training = [
        ["2022-10-10", 'M', 2, 3, 4, 5, 6, 1],
        ["2022-10-11", 'F', 8, 9, 10, 11, 12, 0],
        ["", 'F', 2, 3, 4, 5, 6, 1],
        ["2022-10-13", 'M', 8, 9, 10, 11, 12, 0],
        ["2022-10-14", 'F', 2, 3, 4, 5, 6, 1]
        ]
validation = [
        ["2022-10-15", 'F', 8, 9, 10, 11, 12, 0],
        ["2022-10-16", 'M', 2, 3, 4, 5, 6, 1],
        ["2022-10-17", 'F', 8, 9, 10, 11, 12, 0],
        ["2022-10-18", 'M', 14, 15, 16, 17, 18, 1],
        ["2022-10-19", 'M', 15, 16, 17, 18, 19, 1]
        ]
df1 = pd.DataFrame(training,
                columns=["start_date", "gender", "age2", "age3", "age4",
                        "age5", "age6", "income"])
df2 = pd.DataFrame(validation,
                columns=["start_date", "gender", "age2", "age3", "age4",
                        "age5", "age6", "income"])
adataset = BiasDatasetBuilder.datasets(
                training_features=df1,
                validation_features=df2,
                label="income",
                cat_col=["age2", "age3", "income"],
                cont_col=["age4", "age5", "age6"],
                date_col=["start_date"],
                bias_params = BiasParams(protected='gender',
                                         privileged='M',
                                         unprivileged='F',
                                         positive_outcome_label= 1,
                                         negative_outcome_label= 0),
                remove_protected_from_features = True,
                convert_date_cols=True,
                name="test_dataset")
Parameters:
  • training_features – Pandas dataframe containing the training dataset features. This defaults to None in which case we assume there is no training data.

  • training_target – Pandas dataframe containing the target training data as a column. This defaults to None in which case the target feature is assumed to be either the last column in features dataset or the column name specified in the label argument.

  • validation_features – Pandas dataframe containing the validation dataset features. This defaults to None in which case we assume there is no validation data.

  • validation_target – Pandas dataframe containing the target validation data as a column. This defaults to None in which case the target feature is assumed to be either the last column in validation features dataset or the column name specified in the label argument.

  • testing_features – Pandas dataframe containing the testing dataset features. This defaults to None in which case we assume there is no testing data.

  • testing_target – Pandas dataframe containing the target testing data as a column. This defaults to None in which case the target feature is assumed to be either the last column in testing features dataset or the column name specified in the label argument.

  • label – The name of the column containing the target. This defaults to None in which case the target is assumed to either be the last column of the features dataframe or the first column of the target dataframe if this is not None.

  • prediction – The name of the column containing the prediction data. This defaults to None in which case the assumption is that the dataset contains no prediction data.

  • cat_col – List of categorical features. This defaults to None in which case categorical features are determined automatically.

  • cont_col – List of continuous features. This defaults to None in which case continuous features are determined automatically.

  • id_col – List of id features. This defaults to None in which case it is assumed the dataset contains no id features.

  • date_col – List of datetime features. This defaults to None in which case it is assumed the dataset contains no datetime features.

  • bias_params – This contains demographic data (the protected feature) needed to create the bias dataset. This defaults to None in which case a fake random protected feature is created.

  • train_valid_test_splits – This parameter specifies the proportions to use when splitting the data into training, validation and test subsets. This defaults to (0.8, 0.2, 0.0).

  • random_seed – Random number seed used when splitting the data into random training, validation and test subsets. This defaults to 2.

  • remove_protected_from_features – This is set to True in order to remove the protected feature from the normal features i.e. the protected feature is then not considered a feature used by the model. Otherwise the protected feature is treated as a normal feature.

  • convert_date_cols – This is set to True in order to convert an date features into datetime objects. This defaults to False.

  • datetime_format – The specific datetime format (assumes a common datetime is used for all datetime features). This defaults to an empty string in which case the datetime format is guessed.

  • name – The name to use for the dataset. This defaults to None in which case a random name is assigned.

  • register_creation – This is set to True to enable the dataset to be registered to the database (note that only a hash and/or fingerprint of the data is stored). This Defaults to True.

Returns:

A BiasDataset object.

class etiq.datasets.builders.SimpleDatasetBuilder

Bases: object

A builder for the SimpleDataset class

classmethod dataset(features: DataFrame, target: DataFrame | None = None, label: str | None = None, prediction: str | None = None, cat_col: List[str] | None = None, cont_col: List[str] | None = None, id_col: List[str] | None = None, date_col: List[str] | None = None, train_valid_test_splits: Tuple[float, float, float] = (0.8, 0.2, 0.0), random_seed: int = 2, convert_date_cols: bool = False, datetime_format: str = '', name: str | None = None, register_creation: bool = True) SimpleDataset

Creates a SimpleDataset object given pandas dataframe(s).

Use this builder like:

from etiq import SimpleDatasetBuilder
import pandas as pd
a = [
        ["2022-10-10", 2, 3, 4, 5, 6, 1],
        ["2022-10-11", 8, 9, 10, 11, 12, 0],
        ["", 2, 3, 4, 5, 6, 1],
        ["2022-10-13", 8, 9, 10, 11, 12, 0],
        ["2022-10-14", 2, 3, 4, 5, 6, 1],
        ["2022-10-15", 8, 9, 10, 11, 12, 0],
        ["2022-10-16", 2, 3, 4, 5, 6, 1],
        ["2022-10-17", 8, 9, 10, 11, 12, 0],
        ["2022-10-18", 14, 15, 16, 17, 18, 1],
        ["2022-10-19", 15, 16, 17, 18, 19, 1],
    ]
df = pd.DataFrame(a,
                columns=["start_date", "age2", "age3", "age4",
                        "age5", "age6", "income"])
adataset = SimpleDatasetBuilder.dataset(
                features=df,
                label="income",
                cat_col=["age2", "age3", "income"],
                cont_col=["age4", "age5", "age6"],
                date_col=["start_date"],
                convert_date_cols=True,
                name="test_dataset")
Parameters:
  • features – Pandas dataframe containing the dataset features as columns.

  • target – Pandas dataframe containing the target feature as a column. This defaults to None in which case the target feature is assumed to be either the last column in features dataset or the column name specified in the label argument.

  • label – The name of the column containing the target. This defaults to None in which case the target is assumed to either be the last column of the features dataframe or the first column of the target dataframe if this is not None.

  • prediction – The name of the column containing the prediction data. This defaults to None in which case the assumption is that the dataset contains no prediction data.

  • cat_col – List of categorical features. This defaults to None in which case categorical features are determined automatically.

  • cont_col – List of continuous features. This defaults to None in which case continuous features are determined automatically.

  • id_col – List of id features. This defaults to None in which case it is assumed the dataset contains no id features.

  • date_col – List of datetime features. This defaults to None in which case it is assumed the dataset contains no datetime features.

  • train_valid_test_splits – This parameter specifies the proportions to use when splitting the data into training, validation and test subsets. This defaults to (0.8, 0.2, 0.0).

  • random_seed – Random number seed used when splitting the data into random training, validation and test subsets. This defaults to 2.

  • convert_date_cols – This is set to True in order to convert an date features into datetime objects. This defaults to False.

  • datetime_format – The specific datetime format (assumes a common datetime is used for all datetime features). This defaults to an empty string in which case the datetime format is guessed.

  • name – The name to use for the dataset. This defaults to None in which case a random name is assigned.

  • register_creation – This is set to True to enable the dataset to be registered to the database (note that only a hash and/or fingerprint of the data is stored). This Defaults to True.

Returns:

A SimpleDataset object.

classmethod datasets(training_features: DataFrame | None = None, training_target: DataFrame | None = None, validation_features: DataFrame | None = None, validation_target: DataFrame | None = None, testing_features: DataFrame | None = None, testing_target: DataFrame | None = None, label: str | None = None, prediction: str | None = None, cat_col: List[str] | None = None, cont_col: List[str] | None = None, id_col: List[str] | None = None, date_col: List[str] | None = None, convert_date_cols=False, datetime_format='', name: str | None = None, register_creation: bool = True) SimpleDataset

Creates a SimpleDataset object given pandas dataframe(s).

Use this builder like:

from etiq import SimpleDatasetBuilder
import pandas as pd
training = [
                ["2022-10-10", 2, 3, 4, 5, 6, 1],
                ["2022-10-11", 8, 9, 10, 11, 12, 0],
                ["", 2, 3, 4, 5, 6, 1],
                ["2022-10-13", 8, 9, 10, 11, 12, 0]
                ["2022-10-14", 2, 3, 4, 5, 6, 1]
            ]
validation = [
                ["2022-10-15", 8, 9, 10, 11, 12, 0],
                ["2022-10-16", 2, 3, 4, 5, 6, 1],
                ["2022-10-17", 8, 9, 10, 11, 12, 0],
                ["2022-10-18", 14, 15, 16, 17, 18, 1],
                ["2022-10-19", 15, 16, 17, 18, 19, 1]
            ]
df1 = pd.DataFrame(training,
                columns=["start_date", "age2", "age3", "age4",
                        "age5", "age6", "income"])
df2 = pd.DataFrame(validation,
                columns=["start_date", "age2", "age3", "age4",
                        "age5", "age6", "income"])
adataset = SimpleDatasetBuilder.datasets(
                training_features=df1,
                validation_features=df2,
                label="income",
                cat_col=["age2", "age3", "income"],
                cont_col=["age4", "age5", "age6"],
                date_col=["start_date"],
                convert_date_cols=True,
                name="test_dataset")
Parameters:
  • training_features – Pandas dataframe containing the training dataset features. This defaults to None in which case we assume there is no training data.

  • training_target – Pandas dataframe containing the target training data as a column. This defaults to None in which case the target feature is assumed to be either the last column in features dataset or the column name specified in the label argument.

  • validation_features – Pandas dataframe containing the validation dataset features. This defaults to None in which case we assume there is no validation data.

  • validation_target – Pandas dataframe containing the target validation data as a column. This defaults to None in which case the target feature is assumed to be either the last column in validation features dataset or the column name specified in the label argument.

  • testing_features – Pandas dataframe containing the testing dataset features. This defaults to None in which case we assume there is no testing data.

  • testing_target – Pandas dataframe containing the target testing data as a column. This defaults to None in which case the target feature is assumed to be either the last column in testing features dataset or the column name specified in the label argument.

  • label – The name of the column containing the target. This defaults to None in which case the target is assumed to either be the last column of the features dataframe or the first column of the target dataframe if this is not None.

  • prediction – The name of the column containing the prediction data. This defaults to None in which case the assumption is that the dataset contains no prediction data.

  • cat_col – List of categorical features. This defaults to None in which case categorical features are determined automatically.

  • cont_col – List of continuous features. This defaults to None in which case continuous features are determined automatically.

  • id_col – List of id features. This defaults to None in which case it is assumed the dataset contains no id features.

  • date_col – List of datetime features. This defaults to None in which case it is assumed the dataset contains no datetime features.

  • convert_date_cols – This is set to True in order to convert an date features into datetime objects. This defaults to False.

  • datetime_format – The specific datetime format (assumes a common datetime is used for all datetime features). This defaults to an empty string in which case the datetime format is guessed.

  • name – The name to use for the dataset. This defaults to None in which case a random name is assigned.

  • register_creation – This is set to True to enable the dataset to be registered to the database (note that only a hash and/or fingerprint of the data is stored). This Defaults to True.

Returns:

A SimpleDataset object.