butterfree.transform.features package

Submodules

Feature entity.

class butterfree.transform.features.feature.Feature(name: parameters_validation.parameter_validation_decorator.parameter_validation.<locals>.func_partial.<locals>.validation_partial, description: parameters_validation.parameter_validation_decorator.parameter_validation.<locals>.func_partial.<locals>.validation_partial, dtype: parameters_validation.parameter_validation_decorator.parameter_validation.<locals>.func_partial.<locals>.validation_partial = None, from_column: parameters_validation.parameter_validation_decorator.parameter_validation.<locals>.func_partial.<locals>.validation_partial = None, transformation: parameters_validation.parameter_validation_decorator.parameter_validation.<locals>.func_partial.<locals>.validation_partial = None)

Bases: object

Defines a Feature.

A Feature is the result of a transformation over one (or more) data columns over an input dataframe. Transformations can be as simple as renaming, casting types, mathematical expressions or complex functions/models.

name

feature name. Can be use by the transformation to derive multiple output columns.

description

brief explanation regarding the feature.

dtype

data type for the output columns of this feature.

from_column

original column to build feature. Used when there is transformation or the transformation has no reference about the column to use for.

transformation

transformation that will be applied to create this feature.

property dtype

Attribute dtype getter.

Returns

The data type for this feature.

get_output_columns() → List[str]

Get output columns that will be generated by this feature engineering.

Returns

Output columns names.

transform(dataframe: pyspark.sql.dataframe.DataFrame) → pyspark.sql.dataframe.DataFrame

Performs a transformation to the feature pipeline.

Parameters

dataframe – input dataframe for the transformation.

Returns

Transformed dataframe.

property transformation

Attribute transformation getter.

Returns

A transformation for this feature.

KeyFeature entity.

class butterfree.transform.features.key_feature.KeyFeature(name: str, description: str, dtype: butterfree.constants.data_type.DataType, from_column: str = None, transformation: butterfree.transform.transformations.transform_component.TransformComponent = None)

Bases: butterfree.transform.features.feature.Feature

Defines a KeyFeature.

A FeatureSet must contain one or more KeyFeatures, which will be used as keys when storing the feature set dataframe as tables. The Feature Set may validate keys are unique for the latest state of a feature set.

name

key name. Can be use by the transformation to derive multiple key columns.

description

brief explanation regarding the key.

dtype

data type for the output column of this key.

from_column

original column to build a key. Used when there is transformation or the transformation has no reference about the column to use for.

transformation

transformation that will be applied to create this key. Keys can be derived by transformations over any data column. Like a location hash based on latitude and longitude.

TimestampFeature entity.

class butterfree.transform.features.timestamp_feature.TimestampFeature(from_column: str = None, transformation: butterfree.transform.transformations.transform_component.TransformComponent = None, from_ms: bool = False, mask: str = None)

Bases: butterfree.transform.features.feature.Feature

Defines a TimestampFeature.

A FeatureSet must contain one TimestampFeature, which will be used as a time tag for the state of all features. By containing a timestamp feature, users may time travel over their features. The Feature Set may validate that the set of keys and timestamp are unique for a feature set.

By defining a TimestampColumn, the feature set will always contain a data column called “timestamp” of TimestampType (spark dtype).

from_column

original column to build a “timestamp” feature column. Used when there is transformation or the transformation has no reference about the column to use for. If from_column is None, the FeatureSet will assume the input dataframe already has a data column called “timestamp”.

transformation

transformation that will be applied to create the “timestamp”. Type casting will already happen when no transformation is given. But a timestamp can be derived from multiple columns, like year, month and day, for example. The transformation must always handle naming and typing.

from_ms

true if timestamp column presents milliseconds time unit. A

conversion is then performed.
mask

specified timestamp format by the user.

transform(dataframe: pyspark.sql.dataframe.DataFrame) → pyspark.sql.dataframe.DataFrame

Performs a transformation to the feature pipeline.

Parameters

dataframe – input dataframe for the transformation.

Returns

Transformed dataframe.

Module contents

Holds all feature types to be part of a FeatureSet.

class butterfree.transform.features.Feature(name: parameters_validation.parameter_validation_decorator.parameter_validation.<locals>.func_partial.<locals>.validation_partial, description: parameters_validation.parameter_validation_decorator.parameter_validation.<locals>.func_partial.<locals>.validation_partial, dtype: parameters_validation.parameter_validation_decorator.parameter_validation.<locals>.func_partial.<locals>.validation_partial = None, from_column: parameters_validation.parameter_validation_decorator.parameter_validation.<locals>.func_partial.<locals>.validation_partial = None, transformation: parameters_validation.parameter_validation_decorator.parameter_validation.<locals>.func_partial.<locals>.validation_partial = None)

Bases: object

Defines a Feature.

A Feature is the result of a transformation over one (or more) data columns over an input dataframe. Transformations can be as simple as renaming, casting types, mathematical expressions or complex functions/models.

name

feature name. Can be use by the transformation to derive multiple output columns.

description

brief explanation regarding the feature.

dtype

data type for the output columns of this feature.

from_column

original column to build feature. Used when there is transformation or the transformation has no reference about the column to use for.

transformation

transformation that will be applied to create this feature.

property dtype

Attribute dtype getter.

Returns

The data type for this feature.

get_output_columns() → List[str]

Get output columns that will be generated by this feature engineering.

Returns

Output columns names.

transform(dataframe: pyspark.sql.dataframe.DataFrame) → pyspark.sql.dataframe.DataFrame

Performs a transformation to the feature pipeline.

Parameters

dataframe – input dataframe for the transformation.

Returns

Transformed dataframe.

property transformation

Attribute transformation getter.

Returns

A transformation for this feature.

class butterfree.transform.features.KeyFeature(name: str, description: str, dtype: butterfree.constants.data_type.DataType, from_column: str = None, transformation: butterfree.transform.transformations.transform_component.TransformComponent = None)

Bases: butterfree.transform.features.feature.Feature

Defines a KeyFeature.

A FeatureSet must contain one or more KeyFeatures, which will be used as keys when storing the feature set dataframe as tables. The Feature Set may validate keys are unique for the latest state of a feature set.

name

key name. Can be use by the transformation to derive multiple key columns.

description

brief explanation regarding the key.

dtype

data type for the output column of this key.

from_column

original column to build a key. Used when there is transformation or the transformation has no reference about the column to use for.

transformation

transformation that will be applied to create this key. Keys can be derived by transformations over any data column. Like a location hash based on latitude and longitude.

class butterfree.transform.features.TimestampFeature(from_column: str = None, transformation: butterfree.transform.transformations.transform_component.TransformComponent = None, from_ms: bool = False, mask: str = None)

Bases: butterfree.transform.features.feature.Feature

Defines a TimestampFeature.

A FeatureSet must contain one TimestampFeature, which will be used as a time tag for the state of all features. By containing a timestamp feature, users may time travel over their features. The Feature Set may validate that the set of keys and timestamp are unique for a feature set.

By defining a TimestampColumn, the feature set will always contain a data column called “timestamp” of TimestampType (spark dtype).

from_column

original column to build a “timestamp” feature column. Used when there is transformation or the transformation has no reference about the column to use for. If from_column is None, the FeatureSet will assume the input dataframe already has a data column called “timestamp”.

transformation

transformation that will be applied to create the “timestamp”. Type casting will already happen when no transformation is given. But a timestamp can be derived from multiple columns, like year, month and day, for example. The transformation must always handle naming and typing.

from_ms

true if timestamp column presents milliseconds time unit. A

conversion is then performed.
mask

specified timestamp format by the user.

transform(dataframe: pyspark.sql.dataframe.DataFrame) → pyspark.sql.dataframe.DataFrame

Performs a transformation to the feature pipeline.

Parameters

dataframe – input dataframe for the transformation.

Returns

Transformed dataframe.