butterfree.configs.db package

Submodules

Abstract classes for database configurations with spark.

class butterfree.configs.db.abstract_config.AbstractWriteConfig

Bases: ABC

Abstract class for database write configurations with spark.

abstract property database: str

Database name.

abstract property format_: Any

Config option “format” for spark write.

Args:

Returns:

format.

Return type:

str

abstract property mode: Any

Config option “mode” for spark write.

Args:

Returns:

mode.

Return type:

str

abstract translate(schema: Any) List[Dict[Any, Any]]

Translate feature set spark schema to the corresponding database.

Parameters:

schema – feature set schema

Returns:

Corresponding database schema.

Holds configurations to read and write with Spark to Cassandra DB.

class butterfree.configs.db.cassandra_config.CassandraConfig(username: Optional[str] = None, password: Optional[str] = None, host: Optional[str] = None, keyspace: Optional[str] = None, mode: Optional[str] = None, format_: Optional[str] = None, stream_processing_time: Optional[str] = None, stream_output_mode: Optional[str] = None, stream_checkpoint_path: Optional[str] = None, read_consistency_level: Optional[str] = None, write_consistency_level: Optional[str] = None, local_dc: Optional[str] = None)

Bases: AbstractWriteConfig

Configuration for Spark to connect on Cassandra DB.

References can be found [here](https://docs.databricks.com/data/data-sources/cassandra.html).

username

username to use in connection.

password

password to use in connection.

host

host to use in connection.

keyspace

Cassandra DB keyspace to write data.

mode

write mode for Spark.

format_

write format for Spark.

stream_processing_time

processing time interval for streaming jobs.

stream_output_mode

specify the mode from writing streaming data.

stream_checkpoint_path

path on S3 to save checkpoints for the stream job.

read_consistency_level

read consistency level used in connection.

write_consistency_level

write consistency level used in connection.

More information about processing_time, output_mode and checkpoint_path can be found in Spark documentation: [here](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)

property database: str

Database name.

property format_: Optional[str]

Write format for Spark.

get_options(table: str) Dict[Optional[str], Optional[str]]

Get options for connect to Cassandra DB.

Options will be a dictionary with the write and read configuration for spark to cassandra.

Parameters:

table – table name (keyspace) into Cassandra DB.

Returns:

Configuration to connect to Cassandra DB.

property host: Optional[str]

Host used in connection to Cassandra DB.

property keyspace: Optional[str]

Cassandra DB keyspace to write data.

property local_dc: Optional[str]

Local DC for Cassandra connection.

property mode: Optional[str]

Write mode for Spark.

property password: Optional[str]

Password used in connection to Cassandra DB.

property read_consistency_level: Optional[str]

Read consistency level for Cassandra.

property stream_checkpoint_path: Optional[str]

Path on S3 to save checkpoints for the stream job.

property stream_output_mode: Optional[str]

Specify the mode from writing streaming data.

property stream_processing_time: Optional[str]

Processing time interval for streaming jobs.

translate(schema: List[Dict[str, Any]]) List[Dict[str, Any]]

Get feature set schema to be translated.

The output will be a list of dictionaries regarding cassandra database schema.

Parameters:

schema – feature set schema in spark.

Returns:

Cassandra schema.

property username: Optional[str]

Username used in connection to Cassandra DB.

property write_consistency_level: Optional[str]

Write consistency level for Cassandra.

Holds configurations to read and write with Spark to Kafka.

class butterfree.configs.db.kafka_config.KafkaConfig(kafka_topic: Optional[str] = None, kafka_connection_string: Optional[str] = None, mode: Optional[str] = None, format_: Optional[str] = None, stream_processing_time: Optional[str] = None, stream_output_mode: Optional[str] = None, stream_checkpoint_path: Optional[str] = None)

Bases: AbstractWriteConfig

Configuration for Spark to connect to Kafka.

kafka_topic

string with kafka topic name.

kafka_connection_string

string with hosts and ports to connect.

mode

write mode for Spark.

format_

write format for Spark.

stream_processing_time

processing time interval for streaming jobs.

stream_output_mode

specify the mode from writing streaming data.

stream_checkpoint_path

path on S3 to save checkpoints for the stream job.

More information about processing_time, output_mode and checkpoint_path can be found in Spark documentation: [here](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)

property database: str

Database name.

property format_: Optional[str]

Write format for Spark.

get_options(topic: str) Dict[Optional[str], Optional[str]]

Get options for connecting to Kafka.

Options will be a dictionary with the write and read configuration for spark to kafka.

Parameters:

topic – topic related to Kafka.

Returns:

Configuration to connect to Kafka.

property kafka_connection_string: Optional[str]

Kafka connection string with hosts and ports to connect.

property kafka_topic: Optional[str]

Kafka topic name.

property mode: Optional[str]

Write mode for Spark.

property stream_checkpoint_path: Optional[str]

Path on S3 to save checkpoints for the stream job.

property stream_output_mode: Optional[str]

Specify the mode from writing streaming data.

property stream_processing_time: Optional[str]

Processing time interval for streaming jobs.

translate(schema: List[Dict[str, Any]]) List[Dict[str, Any]]

Get feature set schema to be translated.

The output will be a list of dictionaries regarding cassandra database schema.

Parameters:

schema – feature set schema in spark.

Returns:

Kafka schema.

Holds configurations to read and write with Spark to AWS S3.

class butterfree.configs.db.metastore_config.MetastoreConfig(path: Optional[str] = None, mode: Optional[str] = None, format_: Optional[str] = None, file_system: Optional[str] = None)

Bases: AbstractWriteConfig

Configuration for Spark metastore database stored.

By default the configuration is for AWS S3.

path

database root location.

mode

writing mode used be writers.

format_

expected stored file format.

file_system

file schema uri, like: s3a, file.

property database: str

Database name.

property file_system: Optional[str]

Writing mode used be writers.

property format_: Optional[str]

Expected stored file format.

get_options(key: str) Dict[Optional[str], Optional[str]]

Get options for Metastore.

Options will be a dictionary with the write and read configuration for Spark Metastore.

Parameters:

key – path to save data into Metastore.

Returns:

Options configuration for Metastore.

get_path_with_partitions(key: str, dataframe: DataFrame) List

Get options for AWS S3 from partitioned parquet file.

Options will be a dictionary with the write and read configuration for Spark to AWS S3.

Parameters:
  • key – path to save data into AWS S3 bucket.

  • dataframe – spark dataframe containing data from a feature set.

Returns:

A list of string for file-system backed data sources.

property mode: Optional[str]

Writing mode used be writers.

property path: Optional[str]

Bucket name.

translate(schema: List[Dict[str, Any]]) List[Dict[str, Any]]

Translate feature set spark schema to the corresponding database.

Module contents

This module holds database configurations to be used by clients.

class butterfree.configs.db.AbstractWriteConfig

Bases: ABC

Abstract class for database write configurations with spark.

abstract property database: str

Database name.

abstract property format_: Any

Config option “format” for spark write.

Args:

Returns:

format.

Return type:

str

abstract property mode: Any

Config option “mode” for spark write.

Args:

Returns:

mode.

Return type:

str

abstract translate(schema: Any) List[Dict[Any, Any]]

Translate feature set spark schema to the corresponding database.

Parameters:

schema – feature set schema

Returns:

Corresponding database schema.

class butterfree.configs.db.CassandraConfig(username: Optional[str] = None, password: Optional[str] = None, host: Optional[str] = None, keyspace: Optional[str] = None, mode: Optional[str] = None, format_: Optional[str] = None, stream_processing_time: Optional[str] = None, stream_output_mode: Optional[str] = None, stream_checkpoint_path: Optional[str] = None, read_consistency_level: Optional[str] = None, write_consistency_level: Optional[str] = None, local_dc: Optional[str] = None)

Bases: AbstractWriteConfig

Configuration for Spark to connect on Cassandra DB.

References can be found [here](https://docs.databricks.com/data/data-sources/cassandra.html).

username

username to use in connection.

password

password to use in connection.

host

host to use in connection.

keyspace

Cassandra DB keyspace to write data.

mode

write mode for Spark.

format_

write format for Spark.

stream_processing_time

processing time interval for streaming jobs.

stream_output_mode

specify the mode from writing streaming data.

stream_checkpoint_path

path on S3 to save checkpoints for the stream job.

read_consistency_level

read consistency level used in connection.

write_consistency_level

write consistency level used in connection.

More information about processing_time, output_mode and checkpoint_path can be found in Spark documentation: [here](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)

property database: str

Database name.

property format_: Optional[str]

Write format for Spark.

get_options(table: str) Dict[Optional[str], Optional[str]]

Get options for connect to Cassandra DB.

Options will be a dictionary with the write and read configuration for spark to cassandra.

Parameters:

table – table name (keyspace) into Cassandra DB.

Returns:

Configuration to connect to Cassandra DB.

property host: Optional[str]

Host used in connection to Cassandra DB.

property keyspace: Optional[str]

Cassandra DB keyspace to write data.

property local_dc: Optional[str]

Local DC for Cassandra connection.

property mode: Optional[str]

Write mode for Spark.

property password: Optional[str]

Password used in connection to Cassandra DB.

property read_consistency_level: Optional[str]

Read consistency level for Cassandra.

property stream_checkpoint_path: Optional[str]

Path on S3 to save checkpoints for the stream job.

property stream_output_mode: Optional[str]

Specify the mode from writing streaming data.

property stream_processing_time: Optional[str]

Processing time interval for streaming jobs.

translate(schema: List[Dict[str, Any]]) List[Dict[str, Any]]

Get feature set schema to be translated.

The output will be a list of dictionaries regarding cassandra database schema.

Parameters:

schema – feature set schema in spark.

Returns:

Cassandra schema.

property username: Optional[str]

Username used in connection to Cassandra DB.

property write_consistency_level: Optional[str]

Write consistency level for Cassandra.

class butterfree.configs.db.KafkaConfig(kafka_topic: Optional[str] = None, kafka_connection_string: Optional[str] = None, mode: Optional[str] = None, format_: Optional[str] = None, stream_processing_time: Optional[str] = None, stream_output_mode: Optional[str] = None, stream_checkpoint_path: Optional[str] = None)

Bases: AbstractWriteConfig

Configuration for Spark to connect to Kafka.

kafka_topic

string with kafka topic name.

kafka_connection_string

string with hosts and ports to connect.

mode

write mode for Spark.

format_

write format for Spark.

stream_processing_time

processing time interval for streaming jobs.

stream_output_mode

specify the mode from writing streaming data.

stream_checkpoint_path

path on S3 to save checkpoints for the stream job.

More information about processing_time, output_mode and checkpoint_path can be found in Spark documentation: [here](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)

property database: str

Database name.

property format_: Optional[str]

Write format for Spark.

get_options(topic: str) Dict[Optional[str], Optional[str]]

Get options for connecting to Kafka.

Options will be a dictionary with the write and read configuration for spark to kafka.

Parameters:

topic – topic related to Kafka.

Returns:

Configuration to connect to Kafka.

property kafka_connection_string: Optional[str]

Kafka connection string with hosts and ports to connect.

property kafka_topic: Optional[str]

Kafka topic name.

property mode: Optional[str]

Write mode for Spark.

property stream_checkpoint_path: Optional[str]

Path on S3 to save checkpoints for the stream job.

property stream_output_mode: Optional[str]

Specify the mode from writing streaming data.

property stream_processing_time: Optional[str]

Processing time interval for streaming jobs.

translate(schema: List[Dict[str, Any]]) List[Dict[str, Any]]

Get feature set schema to be translated.

The output will be a list of dictionaries regarding cassandra database schema.

Parameters:

schema – feature set schema in spark.

Returns:

Kafka schema.

class butterfree.configs.db.MetastoreConfig(path: Optional[str] = None, mode: Optional[str] = None, format_: Optional[str] = None, file_system: Optional[str] = None)

Bases: AbstractWriteConfig

Configuration for Spark metastore database stored.

By default the configuration is for AWS S3.

path

database root location.

mode

writing mode used be writers.

format_

expected stored file format.

file_system

file schema uri, like: s3a, file.

property database: str

Database name.

property file_system: Optional[str]

Writing mode used be writers.

property format_: Optional[str]

Expected stored file format.

get_options(key: str) Dict[Optional[str], Optional[str]]

Get options for Metastore.

Options will be a dictionary with the write and read configuration for Spark Metastore.

Parameters:

key – path to save data into Metastore.

Returns:

Options configuration for Metastore.

get_path_with_partitions(key: str, dataframe: DataFrame) List

Get options for AWS S3 from partitioned parquet file.

Options will be a dictionary with the write and read configuration for Spark to AWS S3.

Parameters:
  • key – path to save data into AWS S3 bucket.

  • dataframe – spark dataframe containing data from a feature set.

Returns:

A list of string for file-system backed data sources.

property mode: Optional[str]

Writing mode used be writers.

property path: Optional[str]

Bucket name.

translate(schema: List[Dict[str, Any]]) List[Dict[str, Any]]

Translate feature set spark schema to the corresponding database.