Streaming Feature Sets in Butterfree¶
Spark enables us to deal with streaming processing in a very powerful way. For an introduction of all Spark’s capabilities in streaming you can read more in this link. As the core Spark, Butterfree also let you declare pipelines to deal with streaming data. The best is that the pipeline declaration is almost the same as dealing with batch use-cases, so there isn’t too complex to deal with this type of challenge using Butterfree tools.
Streaming feature sets are the ones that have at least one streaming source of data declared in the
Readers of a
FeatureSetPipeline. The pipeline is considered a streaming job if it has at least one reader in streaming mode (
Using readers in streaming mode will make use of Spark’s
readStream API instead of the normal
read. That means it will produce a stream dataframe (
df.isStreaming == True) instead of a normal Spark’s dataframe.
Online Feature Store Writer¶
OnlineFeatureStoreWriter is currently the only writer that supports streaming dataframes. It will write, in real time, and upserts to Cassandra. It uses
foreachBatch Spark functionality to do that. You can read more about
You can use the
OnlineFeatureStoreWriter in debug mode (
debug_mode=True) with streaming dataframes. So instead of trying to write to Cassandra, the data will be written to an in-memory table. So you can query this table to show the output as it is being calculated. Normally this functionality is used for the purpose of testing if the defined features have the expected results in real time.
Differently from a batch run, a pipeline running with a streaming dataframe will not “finish to run”. The pipeline will continue to get data from the streaming, process the data and save it to the defined sink sources. So when managing a job using this feature, an operation needs to be designed to support a continuously-up streaming job.