What is Goldilox?

Goldilox is a tool to empower data scientists to build machine learning solutions into production.

This is in current development, please wait for the first stable version.

Key features

One line from POC to production
Flexible and yet simple
Technology agnostic
Things you didn't know you want:
- Serialization validation
- Missing values validation
- Output validation
- I/O examples
- Variables and description queries

Installing

With pip:

$ pip install goldilox

Pandas + Sklearn support

Any Sklearn + Pandas pipeline/transformer/estimator works can turn to a pipeline with one line of code, which tou can save and run as a server with the CLI. well.

Vaex native

Vaex is an open-source big data technology with similar APIs to Pandas.
We use some of Vaex's special sauce to allow the extreme flexibility for advance pipeline solutions while insuring we have a tool that works on big data.

Examples

1. Data science

SKlearn

import pandas as pd
from xgboost.sklearn import XGBClassifier
from goldilox.datasets import load_iris

# Get teh data
df, features, target = load_iris()

# modeling
model = XGBClassifier().fit(df[features], df[target])

Vaex

import vaex
from goldilox.datasets import load_iris
from vaex.ml.xgboost import XGBoostModel
import numpy as np

df, features, target = load_iris()
df = vaex.from_pandas(df)

# feature engineering example
df["petal_ratio"] = df["petal_length"] / df["petal_width"]

features.append('petal_ratio')
# modeling
booster = XGBoostModel(
    features=features,
    target=target,
    prediction_name="prediction",
    num_boost_round=500,
)
booster.fit(df)
df = booster.transform(df)

# post modeling processing example 
df["prediction"] = np.around(df["prediction"])
df["label"] = df["prediction"].map({0: "setosa", 1: "versicolor", 2: "virginica"})

2. Build a production ready pipeline

In one line (-:

from goldilox import Pipeline

# sklearn - When using sklearn, we want to have an example of the raw production query data
pipeline = Pipeline.from_sklearn(model, raw=Pipeline.to_raw(df[features]))

# vaex
pipeline = Pipeline.from_vaex(df)

# Save and load
pipeline.save( < path >)
pipeline = Pipeline.from_file( < path >)

3. Deploy

glx serve <path>

[2021-11-16 18:54:44 +0100] [74906] [INFO] Starting gunicorn 20.1.0
[2021-11-16 18:54:44 +0100] [74906] [INFO] Listening at: http://127.0.0.1:5000 (74906)
[2021-11-16 18:54:44 +0100] [74906] [INFO] Using worker: uvicorn.workers.UvicornH11Worker
[2021-11-16 18:54:44 +0100] [74911] [INFO] Booting worker with pid: 74911
[2021-11-16 18:54:44 +0100] [74911] [INFO] Started server process [74911]
[2021-11-16 18:54:44 +0100] [74911] [INFO] Waiting for application startup.
[2021-11-16 18:54:44 +0100] [74911] [INFO] Application startup complete.

4. Training: For experiments, cloud training, automations, etc,.

With Vaex, you put everything you want to do to a function which receives and returns a Vaex DataFrame

from vaex.ml.datasets import load_iris
from goldilox import Pipeline


def fit(df):
    from vaex.ml.xgboost import XGBoostModel
    import numpy as np

    df = load_iris()

    # feature engineering example
    df["petal_ratio"] = df["petal_length"] / df["petal_width"]

    # modeling
    booster = XGBoostModel(
        features=['petal_length', 'petal_width', 'sepal_length', 'sepal_width', 'petal_ratio'],
        target='class_',
        prediction_name="prediction",
        num_boost_round=500,
    )
    booster.fit(df)
    df = booster.transform(df)

    # post modeling procssing example 
    df['prediction'] = np.around(df['prediction'])
    df["label"] = df["prediction"].map({0: "setosa", 1: "versicolor", 2: "virginica"})
    return df


df = load_iris()
pipeline = Pipeline.from_vaex(df, fit=fit).fit(df)

With Sklearn the fit would be the standard X and y.

import pandas as pd
from sklearn.datasets import load_iris
from xgboost.sklearn import XGBClassifier

iris = load_iris()
features = iris.feature_names
df = pd.DataFrame(iris.data, columns=features)
df['target'] = iris.target

# we don't need to provide raw example if we do the training from the Goldilox Pipeline - it would be taken automatically from the first row.
classifier = XGBClassifier(n_estimators=10, verbosity=0, use_label_encoder=False)
pipeline = Pipeline.from_sklearn(classifier).fit(df[features], df['target'])

WARNING: Pipeline doesn't handle na for sepal_length
WARNING: Pipeline doesn't handle na for sepal_width
WARNING: Pipeline doesn 't handle na for petal_length
WARNING: Pipeline doesn't handle na for petal_width

We do not handle missing values? Let's fix that!

from goldilox.sklearn.transformers import Imputer

classifier = XGBClassifier(n_estimators=10, verbosity=0, use_label_encoder=False)

sk_pipeline = sklearn.pipeline.Pipeline([('imputer', Imputer(features=features)),
                                         ('classifier', classifier)])

pipeline = Pipeline.from_sklearn(sk_pipeline).fit(df[features], df[target])

We can still deploy a pipeline that doesn't deal with missing values if we want. Other validations such as serialization, and prediction-on-raw must pass.

CLI

Some tools

# Serve model
glx serve <pipeline-path>

# get the variables straight from the file.
glx variables <pipeline-path>

# get the description straight from the file.
glx description <pipeline-path>

# get the raw data example from the file.
glx raw <pipeline-path>

# Get the pipeline requirements
glx freeze <pipeline-path> <path-to-requirements-file-output.txt>

# Update a pipeline file metadata or variables 
glx udpate <pipeline-path> key value --file --variable

Docker

You can build a docker image from a pipeline.

Reference

glx build <pipeline-path> --platform=linux/amd64

MLOps

Export to MLFlow

pipeline.export_mlflow(path, **kwargs)

Export to Gunicorn

pipeline.export_gunicorn(path, **kwargs)

Data science examples

Example Notebooks

Classification / Regression
- LightGBM
- XGBoost
- Catbboost
- Skleran
Clustering
- Kmeans
- hdbscan
Nearest Neighbours
Recommendations
- Implicit (Matrix Factorization)
- Lightfm (Matrix Factorization with side features)
Online Learning
- River
- Vowpal Wabbit
Predictions with Explanations
- SHAP
- Interpret
NLP
Deep Learning
- PyTorch
- MXNet #TODO
- Keras #TODO
- Tensorflow #TODO
Training
Advance
- Sklearn vs Vaex vs PySprak
- Using a package which is not pickalbe

FAQ

Why the name "Goldilox"?
Because most solutions out there are either tou need to do everything from scratch per solution, or you have to take it as it. We consider ourselves in between, you can do most things, with minimal adjustments.
Why do you work with Vaex and not just Pandas? Vaex handles Big-Data on normal computers, which is our target audience. And we relay heavily on it's lazy evaluation which pandas doesn't have.
Why do you use "inference" for predictions and not "predict" or "transform"? Sklearn has a standard, "transform" returns a dataframe, "predict" a numpy array, we wanted to have another word for inference. We want the pipeline to also follow the sklearn standard with fit, transform, and predict.
M1 mac with docker?
You probably want to use --platform=linux/amd64
How to send arguments to the docker serve?
docker run -p 127.0.0.1:5000:5000 --rm -it --platform=linux/amd64 goldilox glx serve $PIPELINE_PATH <args> example:

docker run -p 127.0.0.1:5000:5000 --rm -it --platform=linux/amd64 goldilox glx serve $PIPELINE_PATH --host=0.0.0.0:5000

Contributing

See contributing page.

Notebooks can be a great contribution too!

Roadmap

See roadmap page.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
assets		assets
goldilox		goldilox
notebooks		notebooks
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
MANIFEST.in		MANIFEST.in
readme.md		readme.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is Goldilox?

Key features

Installing

Pandas + Sklearn support

Vaex native

Examples

CLI

Docker

MLOps

Data science examples

Example Notebooks

FAQ

Contributing

Roadmap

About

Releases

Packages

Languages

xdssio/goldilox

Folders and files

Latest commit

History

Repository files navigation

What is Goldilox?

Key features

Installing

Pandas + Sklearn support

Vaex native

Examples

CLI

Docker

MLOps

Data science examples

Example Notebooks

FAQ

Contributing

Roadmap

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages