butterfree.load package

Submodules

Holds the Sink class.

class butterfree.load.sink.Sink(writers: List[butterfree.load.writers.writer.Writer], validation: butterfree.validations.validation.Validation = None)

Bases: object

Define the destinations for the feature set pipeline.

A Sink is created from a set of writers. The main goal of the Sink is to trigger the load in each defined writers. After the load the entity can be used to make sure that all data was written properly using the validate method.

writers

list of Writers to use to load the data.

validation

validation to check the data before starting to write.

flush(feature_set: butterfree.transform.feature_set.FeatureSet, dataframe: pyspark.sql.dataframe.DataFrame, spark_client: butterfree.clients.spark_client.SparkClient) → List[pyspark.sql.streaming.StreamingQuery]

Trigger a write job in all the defined Writers.

Parameters
  • dataframe – spark dataframe containing data from a feature set.

  • feature_set – object processed with feature set metadata.

  • spark_client – client used to run a query.

Returns

Streaming handlers for each defined writer, if writing streaming dfs.

validate(feature_set: butterfree.transform.feature_set.FeatureSet, dataframe: pyspark.sql.dataframe.DataFrame, spark_client: butterfree.clients.spark_client.SparkClient)

Trigger a validation job in all the defined Writers.

Parameters
  • dataframe – spark dataframe containing data from a feature set.

  • feature_set – object processed with feature set metadata.

  • spark_client – client used to run a query.

Raises

RuntimeError – if any on the Writers returns a failed validation.

property validation

Validation to check the data before starting to write.

property writers

List of Writers to use to load the data.

Module contents

Holds the Sink component of a feature set pipeline.

class butterfree.load.Sink(writers: List[butterfree.load.writers.writer.Writer], validation: butterfree.validations.validation.Validation = None)

Bases: object

Define the destinations for the feature set pipeline.

A Sink is created from a set of writers. The main goal of the Sink is to trigger the load in each defined writers. After the load the entity can be used to make sure that all data was written properly using the validate method.

writers

list of Writers to use to load the data.

validation

validation to check the data before starting to write.

flush(feature_set: butterfree.transform.feature_set.FeatureSet, dataframe: pyspark.sql.dataframe.DataFrame, spark_client: butterfree.clients.spark_client.SparkClient) → List[pyspark.sql.streaming.StreamingQuery]

Trigger a write job in all the defined Writers.

Parameters
  • dataframe – spark dataframe containing data from a feature set.

  • feature_set – object processed with feature set metadata.

  • spark_client – client used to run a query.

Returns

Streaming handlers for each defined writer, if writing streaming dfs.

validate(feature_set: butterfree.transform.feature_set.FeatureSet, dataframe: pyspark.sql.dataframe.DataFrame, spark_client: butterfree.clients.spark_client.SparkClient)

Trigger a validation job in all the defined Writers.

Parameters
  • dataframe – spark dataframe containing data from a feature set.

  • feature_set – object processed with feature set metadata.

  • spark_client – client used to run a query.

Raises

RuntimeError – if any on the Writers returns a failed validation.

property validation

Validation to check the data before starting to write.

property writers

List of Writers to use to load the data.