butterfree.configs.db package

Submodules

Abstract classes for database configurations with spark.

class butterfree.configs.db.abstract_config.AbstractWriteConfig

Bases: abc.ABC

Abstract class for database write configurations with spark.

abstract property database

Database name.

abstract property format_

Config option “format” for spark write.

Args:

Returns

format.

Return type

str

abstract property mode

Config option “mode” for spark write.

Args:

Returns

mode.

Return type

str

abstract translate(schema: Any) → List[Dict[Any, Any]]

Translate feature set spark schema to the corresponding database.

Parameters

schema – feature set schema

Returns

Corresponding database schema.

Holds configurations to read and write with Spark to Cassandra DB.

class butterfree.configs.db.cassandra_config.CassandraConfig(username: str = None, password: str = None, host: str = None, keyspace: str = None, mode: str = None, format_: str = None, stream_processing_time: str = None, stream_output_mode: str = None, stream_checkpoint_path: str = None, read_consistency_level: str = None, write_consistency_level: str = None, local_dc: str = None)

Bases: butterfree.configs.db.abstract_config.AbstractWriteConfig

Configuration for Spark to connect on Cassandra DB.

References can be found [here](https://docs.databricks.com/data/data-sources/cassandra.html).

username

username to use in connection.

password

password to use in connection.

host

host to use in connection.

keyspace

Cassandra DB keyspace to write data.

mode

write mode for Spark.

format_

write format for Spark.

stream_processing_time

processing time interval for streaming jobs.

stream_output_mode

specify the mode from writing streaming data.

stream_checkpoint_path

path on S3 to save checkpoints for the stream job.

read_consistency_level

read consistency level used in connection.

write_consistency_level

write consistency level used in connection.

More information about processing_time, output_mode and checkpoint_path can be found in Spark documentation: [here](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)

property database

Database name.

property format_

Write format for Spark.

get_options(table: str) → Dict[Optional[str], Optional[str]]

Get options for connect to Cassandra DB.

Options will be a dictionary with the write and read configuration for spark to cassandra.

Parameters

table – table name (keyspace) into Cassandra DB.

Returns

Configuration to connect to Cassandra DB.

property host

Host used in connection to Cassandra DB.

property keyspace

Cassandra DB keyspace to write data.

property local_dc

Local DC for Cassandra connection.

property mode

Write mode for Spark.

property password

Password used in connection to Cassandra DB.

property read_consistency_level

Read consistency level for Cassandra.

property stream_checkpoint_path

Path on S3 to save checkpoints for the stream job.

property stream_output_mode

Specify the mode from writing streaming data.

property stream_processing_time

Processing time interval for streaming jobs.

translate(schema: List[Dict[str, Any]]) → List[Dict[str, Any]]

Get feature set schema to be translated.

The output will be a list of dictionaries regarding cassandra database schema.

Parameters

schema – feature set schema in spark.

Returns

Cassandra schema.

property username

Username used in connection to Cassandra DB.

property write_consistency_level

Write consistency level for Cassandra.

Holds configurations to read and write with Spark to Kafka.

class butterfree.configs.db.kafka_config.KafkaConfig(kafka_topic: str = None, kafka_connection_string: str = None, mode: str = None, format_: str = None, stream_processing_time: str = None, stream_output_mode: str = None, stream_checkpoint_path: str = None)

Bases: butterfree.configs.db.abstract_config.AbstractWriteConfig

Configuration for Spark to connect to Kafka.

kafka_topic

string with kafka topic name.

kafka_connection_string

string with hosts and ports to connect.

mode

write mode for Spark.

format_

write format for Spark.

stream_processing_time

processing time interval for streaming jobs.

stream_output_mode

specify the mode from writing streaming data.

stream_checkpoint_path

path on S3 to save checkpoints for the stream job.

More information about processing_time, output_mode and checkpoint_path can be found in Spark documentation: [here](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)

property database

Database name.

property format_

Write format for Spark.

get_options(topic: str) → Dict[Optional[str], Optional[str]]

Get options for connecting to Kafka.

Options will be a dictionary with the write and read configuration for spark to kafka.

Parameters

topic – topic related to Kafka.

Returns

Configuration to connect to Kafka.

property kafka_connection_string

Kafka connection string with hosts and ports to connect.

property kafka_topic

Kafka topic name.

property mode

Write mode for Spark.

property stream_checkpoint_path

Path on S3 to save checkpoints for the stream job.

property stream_output_mode

Specify the mode from writing streaming data.

property stream_processing_time

Processing time interval for streaming jobs.

translate(schema: List[Dict[str, Any]]) → List[Dict[str, Any]]

Get feature set schema to be translated.

The output will be a list of dictionaries regarding cassandra database schema.

Parameters

schema – feature set schema in spark.

Returns

Kafka schema.

Holds configurations to read and write with Spark to AWS S3.

class butterfree.configs.db.metastore_config.MetastoreConfig(path: str = None, mode: str = None, format_: str = None, file_system: str = None)

Bases: butterfree.configs.db.abstract_config.AbstractWriteConfig

Configuration for Spark metastore database stored.

By default the configuration is for AWS S3.

path

database root location.

mode

writing mode used be writers.

format_

expected stored file format.

file_system

file schema uri, like: s3a, file.

property database

Database name.

property file_system

Writing mode used be writers.

property format_

Expected stored file format.

get_options(key: str) → Dict[Optional[str], Optional[str]]

Get options for Metastore.

Options will be a dictionary with the write and read configuration for Spark Metastore.

Parameters

key – path to save data into Metastore.

Returns

Options configuration for Metastore.

get_path_with_partitions(key: str, dataframe: pyspark.sql.dataframe.DataFrame) → List

Get options for AWS S3 from partitioned parquet file.

Options will be a dictionary with the write and read configuration for Spark to AWS S3.

Parameters
  • key – path to save data into AWS S3 bucket.

  • dataframe – spark dataframe containing data from a feature set.

Returns

A list of string for file-system backed data sources.

property mode

Writing mode used be writers.

property path

Bucket name.

translate(schema: List[Dict[str, Any]]) → List[Dict[str, Any]]

Translate feature set spark schema to the corresponding database.

Module contents

This module holds database configurations to be used by clients.

class butterfree.configs.db.AbstractWriteConfig

Bases: abc.ABC

Abstract class for database write configurations with spark.

abstract property database

Database name.

abstract property format_

Config option “format” for spark write.

Args:

Returns

format.

Return type

str

abstract property mode

Config option “mode” for spark write.

Args:

Returns

mode.

Return type

str

abstract translate(schema: Any) → List[Dict[Any, Any]]

Translate feature set spark schema to the corresponding database.

Parameters

schema – feature set schema

Returns

Corresponding database schema.

class butterfree.configs.db.CassandraConfig(username: str = None, password: str = None, host: str = None, keyspace: str = None, mode: str = None, format_: str = None, stream_processing_time: str = None, stream_output_mode: str = None, stream_checkpoint_path: str = None, read_consistency_level: str = None, write_consistency_level: str = None, local_dc: str = None)

Bases: butterfree.configs.db.abstract_config.AbstractWriteConfig

Configuration for Spark to connect on Cassandra DB.

References can be found [here](https://docs.databricks.com/data/data-sources/cassandra.html).

username

username to use in connection.

password

password to use in connection.

host

host to use in connection.

keyspace

Cassandra DB keyspace to write data.

mode

write mode for Spark.

format_

write format for Spark.

stream_processing_time

processing time interval for streaming jobs.

stream_output_mode

specify the mode from writing streaming data.

stream_checkpoint_path

path on S3 to save checkpoints for the stream job.

read_consistency_level

read consistency level used in connection.

write_consistency_level

write consistency level used in connection.

More information about processing_time, output_mode and checkpoint_path can be found in Spark documentation: [here](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)

property database

Database name.

property format_

Write format for Spark.

get_options(table: str) → Dict[Optional[str], Optional[str]]

Get options for connect to Cassandra DB.

Options will be a dictionary with the write and read configuration for spark to cassandra.

Parameters

table – table name (keyspace) into Cassandra DB.

Returns

Configuration to connect to Cassandra DB.

property host

Host used in connection to Cassandra DB.

property keyspace

Cassandra DB keyspace to write data.

property local_dc

Local DC for Cassandra connection.

property mode

Write mode for Spark.

property password

Password used in connection to Cassandra DB.

property read_consistency_level

Read consistency level for Cassandra.

property stream_checkpoint_path

Path on S3 to save checkpoints for the stream job.

property stream_output_mode

Specify the mode from writing streaming data.

property stream_processing_time

Processing time interval for streaming jobs.

translate(schema: List[Dict[str, Any]]) → List[Dict[str, Any]]

Get feature set schema to be translated.

The output will be a list of dictionaries regarding cassandra database schema.

Parameters

schema – feature set schema in spark.

Returns

Cassandra schema.

property username

Username used in connection to Cassandra DB.

property write_consistency_level

Write consistency level for Cassandra.

class butterfree.configs.db.KafkaConfig(kafka_topic: str = None, kafka_connection_string: str = None, mode: str = None, format_: str = None, stream_processing_time: str = None, stream_output_mode: str = None, stream_checkpoint_path: str = None)

Bases: butterfree.configs.db.abstract_config.AbstractWriteConfig

Configuration for Spark to connect to Kafka.

kafka_topic

string with kafka topic name.

kafka_connection_string

string with hosts and ports to connect.

mode

write mode for Spark.

format_

write format for Spark.

stream_processing_time

processing time interval for streaming jobs.

stream_output_mode

specify the mode from writing streaming data.

stream_checkpoint_path

path on S3 to save checkpoints for the stream job.

More information about processing_time, output_mode and checkpoint_path can be found in Spark documentation: [here](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)

property database

Database name.

property format_

Write format for Spark.

get_options(topic: str) → Dict[Optional[str], Optional[str]]

Get options for connecting to Kafka.

Options will be a dictionary with the write and read configuration for spark to kafka.

Parameters

topic – topic related to Kafka.

Returns

Configuration to connect to Kafka.

property kafka_connection_string

Kafka connection string with hosts and ports to connect.

property kafka_topic

Kafka topic name.

property mode

Write mode for Spark.

property stream_checkpoint_path

Path on S3 to save checkpoints for the stream job.

property stream_output_mode

Specify the mode from writing streaming data.

property stream_processing_time

Processing time interval for streaming jobs.

translate(schema: List[Dict[str, Any]]) → List[Dict[str, Any]]

Get feature set schema to be translated.

The output will be a list of dictionaries regarding cassandra database schema.

Parameters

schema – feature set schema in spark.

Returns

Kafka schema.

class butterfree.configs.db.MetastoreConfig(path: str = None, mode: str = None, format_: str = None, file_system: str = None)

Bases: butterfree.configs.db.abstract_config.AbstractWriteConfig

Configuration for Spark metastore database stored.

By default the configuration is for AWS S3.

path

database root location.

mode

writing mode used be writers.

format_

expected stored file format.

file_system

file schema uri, like: s3a, file.

property database

Database name.

property file_system

Writing mode used be writers.

property format_

Expected stored file format.

get_options(key: str) → Dict[Optional[str], Optional[str]]

Get options for Metastore.

Options will be a dictionary with the write and read configuration for Spark Metastore.

Parameters

key – path to save data into Metastore.

Returns

Options configuration for Metastore.

get_path_with_partitions(key: str, dataframe: pyspark.sql.dataframe.DataFrame) → List

Get options for AWS S3 from partitioned parquet file.

Options will be a dictionary with the write and read configuration for Spark to AWS S3.

Parameters
  • key – path to save data into AWS S3 bucket.

  • dataframe – spark dataframe containing data from a feature set.

Returns

A list of string for file-system backed data sources.

property mode

Writing mode used be writers.

property path

Bucket name.

translate(schema: List[Dict[str, Any]]) → List[Dict[str, Any]]

Translate feature set spark schema to the corresponding database.