butterfree.configs.db package¶

Submodules¶

Abstract classes for database configurations with spark.

class butterfree.configs.db.abstract_config.AbstractWriteConfig¶

Bases: ABC

Abstract class for database write configurations with spark.

abstract property database: str¶: Database name.

abstract property format_: Any¶

Config option “format” for spark write.

Args:

Returns:: format.
Return type:: str

abstract property mode: Any¶

Config option “mode” for spark write.

Args:

Returns:: mode.
Return type:: str

abstract translate(schema: Any) → List[Dict[Any, Any]]¶

Translate feature set spark schema to the corresponding database.

Parameters:: schema – feature set schema
Returns:: Corresponding database schema.

Holds configurations to read and write with Spark to Cassandra DB.

class butterfree.configs.db.cassandra_config.CassandraConfig(username: Optional[str] = None, password: Optional[str] = None, host: Optional[str] = None, keyspace: Optional[str] = None, mode: Optional[str] = None, format_: Optional[str] = None, stream_processing_time: Optional[str] = None, stream_output_mode: Optional[str] = None, stream_checkpoint_path: Optional[str] = None, read_consistency_level: Optional[str] = None, write_consistency_level: Optional[str] = None, local_dc: Optional[str] = None)¶

Bases: AbstractWriteConfig

Configuration for Spark to connect on Cassandra DB.

References can be found [here](https://docs.databricks.com/data/data-sources/cassandra.html).

username¶: username to use in connection.

password¶: password to use in connection.

host¶: host to use in connection.

keyspace¶: Cassandra DB keyspace to write data.

mode¶: write mode for Spark.

format_¶: write format for Spark.

stream_processing_time¶: processing time interval for streaming jobs.

stream_output_mode¶: specify the mode from writing streaming data.

stream_checkpoint_path¶: path on S3 to save checkpoints for the stream job.

read_consistency_level¶: read consistency level used in connection.

write_consistency_level¶: write consistency level used in connection.

More information about processing_time, output_mode and checkpoint_path can be found in Spark documentation: [here](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)

property database: str¶: Database name.

property format_: Optional[str]¶: Write format for Spark.

get_options(table: str) → Dict[Optional[str], Optional[str]]¶

Get options for connect to Cassandra DB.

Options will be a dictionary with the write and read configuration for spark to cassandra.

Parameters:: table – table name (keyspace) into Cassandra DB.
Returns:: Configuration to connect to Cassandra DB.

property host: Optional[str]¶: Host used in connection to Cassandra DB.

property keyspace: Optional[str]¶: Cassandra DB keyspace to write data.

property local_dc: Optional[str]¶: Local DC for Cassandra connection.

property mode: Optional[str]¶: Write mode for Spark.

property password: Optional[str]¶: Password used in connection to Cassandra DB.

property read_consistency_level: Optional[str]¶: Read consistency level for Cassandra.

property stream_checkpoint_path: Optional[str]¶: Path on S3 to save checkpoints for the stream job.

property stream_output_mode: Optional[str]¶: Specify the mode from writing streaming data.

property stream_processing_time: Optional[str]¶: Processing time interval for streaming jobs.

translate(schema: List[Dict[str, Any]]) → List[Dict[str, Any]]¶

Get feature set schema to be translated.

The output will be a list of dictionaries regarding cassandra database schema.

Parameters:: schema – feature set schema in spark.
Returns:: Cassandra schema.

property username: Optional[str]¶: Username used in connection to Cassandra DB.

property write_consistency_level: Optional[str]¶: Write consistency level for Cassandra.

Holds configurations to read and write with Spark to Kafka.

class butterfree.configs.db.kafka_config.KafkaConfig(kafka_topic: Optional[str] = None, kafka_connection_string: Optional[str] = None, mode: Optional[str] = None, format_: Optional[str] = None, stream_processing_time: Optional[str] = None, stream_output_mode: Optional[str] = None, stream_checkpoint_path: Optional[str] = None)¶

Bases: AbstractWriteConfig

Configuration for Spark to connect to Kafka.

kafka_topic¶: string with kafka topic name.

kafka_connection_string¶: string with hosts and ports to connect.

mode¶: write mode for Spark.

format_¶: write format for Spark.

stream_processing_time¶: processing time interval for streaming jobs.

stream_output_mode¶: specify the mode from writing streaming data.

stream_checkpoint_path¶: path on S3 to save checkpoints for the stream job.

More information about processing_time, output_mode and checkpoint_path can be found in Spark documentation: [here](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)

property database: str¶: Database name.

property format_: Optional[str]¶: Write format for Spark.

get_options(topic: str) → Dict[Optional[str], Optional[str]]¶

Get options for connecting to Kafka.

Options will be a dictionary with the write and read configuration for spark to kafka.

Parameters:: topic – topic related to Kafka.
Returns:: Configuration to connect to Kafka.

property kafka_connection_string: Optional[str]¶: Kafka connection string with hosts and ports to connect.

property kafka_topic: Optional[str]¶: Kafka topic name.

property mode: Optional[str]¶: Write mode for Spark.

property stream_checkpoint_path: Optional[str]¶: Path on S3 to save checkpoints for the stream job.

property stream_output_mode: Optional[str]¶: Specify the mode from writing streaming data.

property stream_processing_time: Optional[str]¶: Processing time interval for streaming jobs.

translate(schema: List[Dict[str, Any]]) → List[Dict[str, Any]]¶

Get feature set schema to be translated.

The output will be a list of dictionaries regarding cassandra database schema.

Parameters:: schema – feature set schema in spark.
Returns:: Kafka schema.

Holds configurations to read and write with Spark to AWS S3.

class butterfree.configs.db.metastore_config.MetastoreConfig(path: Optional[str] = None, mode: Optional[str] = None, format_: Optional[str] = None, file_system: Optional[str] = None)¶

Bases: AbstractWriteConfig

Configuration for Spark metastore database stored.

By default the configuration is for AWS S3.

path¶: database root location.

mode¶: writing mode used be writers.

format_¶: expected stored file format.

file_system¶: file schema uri, like: s3a, file.

property database: str¶: Database name.

property file_system: Optional[str]¶: Writing mode used be writers.

property format_: Optional[str]¶: Expected stored file format.

get_options(key: str) → Dict[Optional[str], Optional[str]]¶

Get options for Metastore.

Options will be a dictionary with the write and read configuration for Spark Metastore.

Parameters:: key – path to save data into Metastore.
Returns:: Options configuration for Metastore.

get_path_with_partitions(key: str, dataframe: DataFrame) → List¶

Get options for AWS S3 from partitioned parquet file.

Options will be a dictionary with the write and read configuration for Spark to AWS S3.

Parameters:

key – path to save data into AWS S3 bucket.
dataframe – spark dataframe containing data from a feature set.

Returns:

A list of string for file-system backed data sources.

property mode: Optional[str]¶: Writing mode used be writers.

property path: Optional[str]¶: Bucket name.

translate(schema: List[Dict[str, Any]]) → List[Dict[str, Any]]¶: Translate feature set spark schema to the corresponding database.

Module contents¶

This module holds database configurations to be used by clients.

class butterfree.configs.db.AbstractWriteConfig¶

Bases: ABC

Abstract class for database write configurations with spark.

abstract property database: str¶: Database name.

abstract property format_: Any¶

Config option “format” for spark write.

Args:

Returns:: format.
Return type:: str

abstract property mode: Any¶

Config option “mode” for spark write.

Args:

Returns:: mode.
Return type:: str

abstract translate(schema: Any) → List[Dict[Any, Any]]¶

Translate feature set spark schema to the corresponding database.

Parameters:: schema – feature set schema
Returns:: Corresponding database schema.

class butterfree.configs.db.CassandraConfig(username: Optional[str] = None, password: Optional[str] = None, host: Optional[str] = None, keyspace: Optional[str] = None, mode: Optional[str] = None, format_: Optional[str] = None, stream_processing_time: Optional[str] = None, stream_output_mode: Optional[str] = None, stream_checkpoint_path: Optional[str] = None, read_consistency_level: Optional[str] = None, write_consistency_level: Optional[str] = None, local_dc: Optional[str] = None)¶

Bases: AbstractWriteConfig

Configuration for Spark to connect on Cassandra DB.

References can be found [here](https://docs.databricks.com/data/data-sources/cassandra.html).

username¶: username to use in connection.

password¶: password to use in connection.

host¶: host to use in connection.

keyspace¶: Cassandra DB keyspace to write data.

mode¶: write mode for Spark.

format_¶: write format for Spark.

stream_processing_time¶: processing time interval for streaming jobs.

stream_output_mode¶: specify the mode from writing streaming data.

stream_checkpoint_path¶: path on S3 to save checkpoints for the stream job.

read_consistency_level¶: read consistency level used in connection.

write_consistency_level¶: write consistency level used in connection.

More information about processing_time, output_mode and checkpoint_path can be found in Spark documentation: [here](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)

property database: str¶: Database name.

property format_: Optional[str]¶: Write format for Spark.

get_options(table: str) → Dict[Optional[str], Optional[str]]¶

Get options for connect to Cassandra DB.

Options will be a dictionary with the write and read configuration for spark to cassandra.

Parameters:: table – table name (keyspace) into Cassandra DB.
Returns:: Configuration to connect to Cassandra DB.

property host: Optional[str]¶: Host used in connection to Cassandra DB.

property keyspace: Optional[str]¶: Cassandra DB keyspace to write data.

property local_dc: Optional[str]¶: Local DC for Cassandra connection.

property mode: Optional[str]¶: Write mode for Spark.

property password: Optional[str]¶: Password used in connection to Cassandra DB.

property read_consistency_level: Optional[str]¶: Read consistency level for Cassandra.

property stream_checkpoint_path: Optional[str]¶: Path on S3 to save checkpoints for the stream job.

property stream_output_mode: Optional[str]¶: Specify the mode from writing streaming data.

property stream_processing_time: Optional[str]¶: Processing time interval for streaming jobs.

translate(schema: List[Dict[str, Any]]) → List[Dict[str, Any]]¶

Get feature set schema to be translated.

The output will be a list of dictionaries regarding cassandra database schema.

Parameters:: schema – feature set schema in spark.
Returns:: Cassandra schema.

property username: Optional[str]¶: Username used in connection to Cassandra DB.

property write_consistency_level: Optional[str]¶: Write consistency level for Cassandra.

class butterfree.configs.db.KafkaConfig(kafka_topic: Optional[str] = None, kafka_connection_string: Optional[str] = None, mode: Optional[str] = None, format_: Optional[str] = None, stream_processing_time: Optional[str] = None, stream_output_mode: Optional[str] = None, stream_checkpoint_path: Optional[str] = None)¶

Bases: AbstractWriteConfig

Configuration for Spark to connect to Kafka.

kafka_topic¶: string with kafka topic name.

kafka_connection_string¶: string with hosts and ports to connect.

mode¶: write mode for Spark.

format_¶: write format for Spark.

stream_processing_time¶: processing time interval for streaming jobs.

stream_output_mode¶: specify the mode from writing streaming data.

stream_checkpoint_path¶: path on S3 to save checkpoints for the stream job.

More information about processing_time, output_mode and checkpoint_path can be found in Spark documentation: [here](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)

property database: str¶: Database name.

property format_: Optional[str]¶: Write format for Spark.

get_options(topic: str) → Dict[Optional[str], Optional[str]]¶

Get options for connecting to Kafka.

Options will be a dictionary with the write and read configuration for spark to kafka.

Parameters:: topic – topic related to Kafka.
Returns:: Configuration to connect to Kafka.

property kafka_connection_string: Optional[str]¶: Kafka connection string with hosts and ports to connect.

property kafka_topic: Optional[str]¶: Kafka topic name.

property mode: Optional[str]¶: Write mode for Spark.

property stream_checkpoint_path: Optional[str]¶: Path on S3 to save checkpoints for the stream job.

property stream_output_mode: Optional[str]¶: Specify the mode from writing streaming data.

property stream_processing_time: Optional[str]¶: Processing time interval for streaming jobs.

translate(schema: List[Dict[str, Any]]) → List[Dict[str, Any]]¶

Get feature set schema to be translated.

The output will be a list of dictionaries regarding cassandra database schema.

Parameters:: schema – feature set schema in spark.
Returns:: Kafka schema.

class butterfree.configs.db.MetastoreConfig(path: Optional[str] = None, mode: Optional[str] = None, format_: Optional[str] = None, file_system: Optional[str] = None)¶

Bases: AbstractWriteConfig

Configuration for Spark metastore database stored.

By default the configuration is for AWS S3.

path¶: database root location.

mode¶: writing mode used be writers.

format_¶: expected stored file format.

file_system¶: file schema uri, like: s3a, file.

property database: str¶: Database name.

property file_system: Optional[str]¶: Writing mode used be writers.

property format_: Optional[str]¶: Expected stored file format.

get_options(key: str) → Dict[Optional[str], Optional[str]]¶

Get options for Metastore.

Options will be a dictionary with the write and read configuration for Spark Metastore.

Parameters:: key – path to save data into Metastore.
Returns:: Options configuration for Metastore.

get_path_with_partitions(key: str, dataframe: DataFrame) → List¶

Get options for AWS S3 from partitioned parquet file.

Options will be a dictionary with the write and read configuration for Spark to AWS S3.

Parameters:

key – path to save data into AWS S3 bucket.
dataframe – spark dataframe containing data from a feature set.

Returns:

A list of string for file-system backed data sources.

property mode: Optional[str]¶: Writing mode used be writers.

property path: Optional[str]¶: Bucket name.

translate(schema: List[Dict[str, Any]]) → List[Dict[str, Any]]¶: Translate feature set spark schema to the corresponding database.