butterfree.dataframe_service package

Submodules

Module where there are repartition methods.

butterfree.dataframe_service.repartition.repartition_df(dataframe: pyspark.sql.dataframe.DataFrame, partition_by: List[str], num_partitions: int = None, num_processors: int = None)

Partition the DataFrame.

Parameters
  • dataframe – Spark DataFrame.

  • partition_by – list of partitions.

  • num_processors – number of processors.

  • num_partitions – number of partitions.

Returns

Partitioned dataframe.

butterfree.dataframe_service.repartition.repartition_sort_df(dataframe: pyspark.sql.dataframe.DataFrame, partition_by: List[str], order_by: List[str], num_processors: int = None, num_partitions: int = None)

Partition and Sort the DataFrame.

Parameters
  • dataframe – Spark DataFrame.

  • partition_by – list of columns to partition by.

  • order_by – list of columns to order by.

  • num_processors – number of processors.

  • num_partitions – number of partitions.

Returns

Partitioned and sorted dataframe.

Module contents

Dataframe optimization components regarding Butterfree.

butterfree.dataframe_service.repartition_df(dataframe: pyspark.sql.dataframe.DataFrame, partition_by: List[str], num_partitions: int = None, num_processors: int = None)

Partition the DataFrame.

Parameters
  • dataframe – Spark DataFrame.

  • partition_by – list of partitions.

  • num_processors – number of processors.

  • num_partitions – number of partitions.

Returns

Partitioned dataframe.

butterfree.dataframe_service.repartition_sort_df(dataframe: pyspark.sql.dataframe.DataFrame, partition_by: List[str], order_by: List[str], num_processors: int = None, num_partitions: int = None)

Partition and Sort the DataFrame.

Parameters
  • dataframe – Spark DataFrame.

  • partition_by – list of columns to partition by.

  • order_by – list of columns to order by.

  • num_processors – number of processors.

  • num_partitions – number of partitions.

Returns

Partitioned and sorted dataframe.