Skip to content

Storage Layer

scnakandala edited this page Mar 2, 2021 · 4 revisions

All code related to the storage layer are located inside the cerebro.storage package. Cerebro currently supports two different storage mediums: Local file system/NFS (cerebro.storage.LocalStore) and HDFS (cerebro.storage.HDFSStore). This storage medium is used for the following tasks:

  • Storing training data
  • Storing model checkpoints
  • Storing model training metrics in TensorBoard format.

In order to create a Storage object, user has to specify the prefix path to a directory. Cerebro will create the following sub-directory structure inside that directory to organize the above data artifacts.

  • <prefix_path>/train_data : Contains all the training data
  • <prefix_path>/val_data : Contains all the validation data
  • <prefix_path>/runs : (System generated) Contains the latest checkpoint and logs for every model organized in its own sub-directory named by the model ID.

It is also possible to override the above default naming convention during the Storage object creation time. For more details see here.

Generating Training and Validation Data

Training and validation data can be generated in two ways:

  1. As part of the model selection invocation : With this approach, user can simply pass the input data (e.g., in the form of a Spark DataFrame) directly into the model selection invocation .fit(df) method. Behind the scenes, Cerebro will first materialize the data into the storage medium before invoking the model selection process. Training and validation splits are generated based on a user specfied fraction or a column indicator in the input dataframe which has to be set when intializing the model selection object.

  2. As a separate step without any model selection invocation : With this approach, users can materialize the training data as a separate step. In order to do this they have create a Cerebro backend object (e.g., cerebro.backend.SparkBackend) and invoke the .prepare_data(..) method proving a storage object and input data (e.g., Spark DataFrame) as follows:

     backend = SparkBackend(spark_context=...)
     storage = HDFSStore('hdfs://host:port/exp_data')
     backend.prepare_data(storage, input_df, validation=0.25)
    

    After the above one-time step, users can create a Storage object pointing to the same storage directory and use it for performing model selection. Instead of .fit(df) method, they need to now invoke .fit_on_prepared_data() method.

For both approaches, behind the scenes Cerebro uses the Petastorm library to materialize the data and subsequently read them during model training. Petastorm enables the use of Parquet storage from Tensorflow, Pytorch, and other Python-based ML training frameworks. One could also generate training and validation data using Petastorm outside of Cerebro and still use them for model training in Cerebro.

Adding Support for a New Storage Medium

In order to add support for a new storage medium, one has to implement a new class extending the Store class. Store class provides an abstraction over a filesystem (e.g., local vs HDFS) or blob storage database. It provides the basic semantics for reading and writing objects, and how to access objects with certain definitions.