butterfree.extract package

Submodules

Holds the SourceSelector class.

class butterfree.extract.source.Source(readers: List[butterfree.extract.readers.reader.Reader], query: str)

Bases: object

The definition of the the entry point data for the ETL pipeline.

A FeatureSet (the next step in the pipeline) expects a single dataframe as input. This dataframe is built from a data composition of one or more readers defined in the Source. There is only one Source for pipeline.

TODO refactor query into multiple query components TODO make it harder to do query injection

readers

list of readers from where the source will get data.

query

Spark SQL query to run against the readers.

Example

Simple example regarding Source class instantiation.

>>> from butterfree.extract import Source
>>> from butterfree.extract.readers import TableReader, FileReader
>>> from butterfree.clients import SparkClient
>>> spark_client = SparkClient()
>>> source = Source(
...    readers=[
...        TableReader(
...            id="table_reader_id",
...            database="table_reader_db",
...            table="table_reader_table",
...        ),
...        FileReader(id="file_reader_id", path="data_sample_path", format="json"),
...    ],
...    query=f"select a.*, b.feature2 "
...    f"from table_reader_id a "
...    f"inner join file_reader_id b on a.id = b.id ",
...)
>>> df = source.construct(spark_client)

This last method will use the Spark Client, as default, to create temporary views regarding each reader and, after, will run the desired query and return a dataframe.

construct(client: butterfree.clients.spark_client.SparkClient) → pyspark.sql.dataframe.DataFrame

Construct an entry point dataframe for a feature set.

This method will assemble multiple readers, by building each one and querying them using a Spark SQL.

After that, there’s the caching of the dataframe, however since cache() in Spark is lazy, an action is triggered in order to force persistence.

Parameters

client – client responsible for connecting to Spark session.

Returns

DataFrame with the query result against all readers.

Module contents

The Source Component of a Feature Set.

class butterfree.extract.Source(readers: List[butterfree.extract.readers.reader.Reader], query: str)

Bases: object

The definition of the the entry point data for the ETL pipeline.

A FeatureSet (the next step in the pipeline) expects a single dataframe as input. This dataframe is built from a data composition of one or more readers defined in the Source. There is only one Source for pipeline.

TODO refactor query into multiple query components TODO make it harder to do query injection

readers

list of readers from where the source will get data.

query

Spark SQL query to run against the readers.

Example

Simple example regarding Source class instantiation.

>>> from butterfree.extract import Source
>>> from butterfree.extract.readers import TableReader, FileReader
>>> from butterfree.clients import SparkClient
>>> spark_client = SparkClient()
>>> source = Source(
...    readers=[
...        TableReader(
...            id="table_reader_id",
...            database="table_reader_db",
...            table="table_reader_table",
...        ),
...        FileReader(id="file_reader_id", path="data_sample_path", format="json"),
...    ],
...    query=f"select a.*, b.feature2 "
...    f"from table_reader_id a "
...    f"inner join file_reader_id b on a.id = b.id ",
...)
>>> df = source.construct(spark_client)

This last method will use the Spark Client, as default, to create temporary views regarding each reader and, after, will run the desired query and return a dataframe.

construct(client: butterfree.clients.spark_client.SparkClient) → pyspark.sql.dataframe.DataFrame

Construct an entry point dataframe for a feature set.

This method will assemble multiple readers, by building each one and querying them using a Spark SQL.

After that, there’s the caching of the dataframe, however since cache() in Spark is lazy, an action is triggered in order to force persistence.

Parameters

client – client responsible for connecting to Spark session.

Returns

DataFrame with the query result against all readers.