🚀 Composer v0.12.0

Composer v0.12.0 is released! Install via pip:

pip install mosaicml==0.12.0

New Features

🪵 Logging and ObjectStore Enhancements

There are multiple improvements to our logging and object store support in this release.
- Image visualization using our CometMLLogger (#1710)
  
  We've added support for using our ImageVisualizer callback with CometML to log images and segmentation masks to CometML.
```
from composer.trainer import Trainer

trainer = Trainer(...,
    callbacks=[ImageVisualizer()],
    loggers=[CometMLLogger()]
)
```
- Added direct support for Oracle Cloud Infrastructure (OCI) as an ObjectStore (#1774) and support for Google Cloud Storage (GCS) via URI (#1833)
  
  To use, you can simply set your save_folder or load_path to a URI beginning with oci:// or gs://, to save and load with OCI and GCS respectively.
```
from composer.trainer import Trainer

# Checkpoint saving to Google Cloud Storage.
trainer = Trainer(
    model=model,
    save_folder="gs://my-bucket/{run_name}/checkpoints",
    run_name='my-run',
    save_interval="1ep",
    save_filename="ep{epoch}.pt",
    save_num_checkpoints_to_keep=0,  # delete all checkpoints locally
    ...
)

trainer.fit()
```
- Added basic support for logging with MLFlow (#1795)
  
  We've added basic support for using MLFlow to log experiment metrics.
```
from composer.loggers import MLFlowLogger
from composer.trainer import Trainer

mlflow_logger = MLFlowLogger(experiment_name=mlflow_exp_name,
                             run_name=mlflow_run_name,
                             tracking_uri=mlflow_uri)
trainer = Trainer(..., loggers=[mlflow_logger])
```
- Simplified console and progress bar logging (#1694)
  
  To turn off the progress bar, set progress_bar=False. To turn on logging directly to the console, set log_to_console=True. To control the frequency of logging to console, set console_log_interval (e.g. to 1ep or 1ba).
- getfile supports URIs (#1750)
  
  Our get_file utility now supports URIs directly (s3://, oci://, and gs://) for downloading files.
🏃‍♀️ Support for Mid-Epoch Resumption with the latest release of Streaming

We've added support in Composer for the latest release of our Streaming library. This includes awesome new features like instant mid epoch resumption and deterministic shuffling, regardless of the number of nodes. See the Streaming release notes for more!
🚨 New algorithm - GyroDropout!

Thanks to @jelite for adding a new algorithm, GyroDropout to Composer! Please see the method card for more details.
🤗 HuggingFace + Composer improvements

We've added a new utility to load a 🤗 HuggingFace model and tokenizer out of a Composer checkpoint (#1754), making the pretraining -> finetuning workflow even easier in Composer. Check out the docs for more details, and our example notebook for a full tutorial (#1775)!
🎓 GradMonitor -> OptimizerMonitor

Renames our GradMonitor callback to OptimizerMonitor, and adds the ability to track optimizer specific metrics. Check out the docs for more details, and add to your code just like any other callback!
```
from composer.callbacks import OptimizerMonitor
from composer.trainer import Trainer

trainer = Trainer(
    ..., 
    callbacks=[OptimizerMonitor(log_optimizer_metrics=log_optimizer_metrics)]
)
```
🐳 New PyTorch and CUDA versions

We've expanded our library of Docker images with support for PyTorch 1.13 + CUDA 11.7:
- mosaicml/pytorch:1.13.0_cu117-python3.10-ubuntu20.04
- mosaicml/pytorch:1.13.0_cpu-python3.10-ubuntu20.04
The mosaicml/pytorch:latest, mosaicml/pytorch:cpu_latest and mosaicml/composer:0.12.0 tags are now built from PyTorch 1.13 based images. Please see our DockerHub repository for additional details.

API changes

Replace grad_accum with device_train_microbatch_size (#1749, #1776)

We're deprecating the grad_accum Trainer argument in favor of the more intuitive device_train_microbatch_size. Instead of thinking about how to divide your specified minibatch into microbatches, simply specify the size of your microbatch. For example, let's say you want to split your minibatch of 2048 into two microbatches of 1024:
```
from composer import Trainer

trainer = Trainer(
    ...,
    device_train_microbatch_size=1024,
)
```
If you want Composer to tune the microbatch for you automatically, enable automatic microbatching as follows:
```
from composer import Trainer

trainer = Trainer(
    ...,
    device_train_microbatch_size='auto',
)
```
The grad_accum argument is still supported but will be deprecated in the next Composer release.
Renamed precisions (#1761)

We've renamed precision attributes for clarity. The following values have been removed: ['amp', 'fp16', bf16'].

We have added the following values, prefixed with 'amp' to clarify when an Automatic Mixed Precision type is being used: ['amp_fp16', 'amp_bf16'].

The fp32 precision value remains unchanged.

Deprecations

Removed support for YAHP (#1512)
Removed COCO and SSD datasets (#1717)
Fully removed Streaming v1 support, please see the mosaicml/streaming project for our next-gen streaming datasets (#1787)
Deprecated FusedLayerNorm algorithm (#1789)
Fully removed grad_clip_norm training argument, please use the GradientClipping algorithm instead (#1768)
Removed data_fit, data_epoch, and data_batch from Logger (#1826)

Bug Fixes

Fix FSDP checkpoint strategy (#1734)
Fix gradient clipping with FSDP (#1740)
Adds more supported FSDP config flags (sync_module_states, forward_prefecth, limit_all_gathers) (#1794)
Allow FULL precision with FSDP (#1796)
Fix eval_microbatch modification on EVAL_BEFORE_FORWARD event (#1739)
Fix algorithm API backwards compatibility in checkpoints (#1741)
Fixes a bad None check preventing setting device_id to 0 (#1767)
Unregister engine to make cleaning up memory easier (#1769)
Fix issue if metric_names is not a list (#1798)
Match implementation for list and tensor batch splitting (#1804)
Fixes infinite eval issue (#1815)

What's Changed

Update installation constraints for streaming by @karan6181 in #1661
Update decoupled_weight_decay.md by @jacobfulano in #1672
Notebooks part 2 by @dakinggg in #1659
Add trainer arg for engine passes by @mvpatel2000 in #1673
Autoload algorithms by @mvpatel2000 in #1658
Faster metrics calculations + Fix warnings added by the new version of torchmetrics by @dskhudia in #1674
Update coolname requirement from <2,>=1.1.0 to >=1.1.0,<3 by @dependabot in #1666
Bump ipykernel from 6.16.0 to 6.16.1 by @dependabot in #1667
Bump traitlets from 5.4.0 to 5.5.0 by @dependabot in #1668
Image viz by @dakinggg in #1676
Update checks for Gated Linear Units Method by @jacobfulano in #1575
ADE20k streaming factory method by @Landanjs in #1626
Deyahpify cifar10 by @growlix in #1677
Nuke YAHP by @hanlint in #1512
Imagenet streaming factory method by @codestar12 in #1649
Bump ipykernel from 6.16.1 to 6.16.2 by @dependabot in #1683
Bump pytest from 7.1.3 to 7.2.0 by @dependabot in #1684
Bump pypandoc from 1.9 to 1.10 by @dependabot in #1680
Update py-cpuinfo requirement from <9,>=8.0.0 to >=8.0.0,<10 by @dependabot in #1681
Uncomment and clean up algorithms documentation by @growlix in #1685
Update glu check by @mvpatel2000 in #1689
fix backwards compatability by @mvpatel2000 in #1693
Fix engine pass registration by @mvpatel2000 in #1692
Add Low Precision LayerNorm by @nik-mosaic in #1525
Update codeowners by @mvpatel2000 in #1691
Add nccl env var by @mvpatel2000 in #1695
Fix eval timestamp by @mvpatel2000 in #1697
Update distributed docs by @mvpatel2000 in #1696
Return empty dict if wandb disabled by @dakinggg in #1698
Autoresume related error messages by @dakinggg in #1687
Add log_image to wandb, cometml, and LoggerDestination by @eracah in #1675
Pin PyTorch and supporting package versions by @bandish-shah in #1688
Add in unit tests for log_image function for CometMLLogger and WandBLogger by @eracah in #1701
refactor devices by @mvpatel2000 in #1699
remove as in device by @mvpatel2000 in #1704
Fix device imports by @mvpatel2000 in #1705
Fix typing in EMA's _move_params_to_device() by @coryMosaicML in #1707
Add docs for saving and loading checkpoints with GCS by @eracah in #1702
Clean up imports by @mvpatel2000 in #1700
Add rud docs by @eracah in #1709
Bump cryptography from 38.0.1 to 38.0.3 by @dependabot in #1712
GHA workflow for code quality checks by @bandish-shah in #1719
Add support for Path in CheckpointSaver by @cojennin in #1721
Docs Typo by @mvpatel2000 in #1723
Bump nbsphinx from 0.8.9 to 0.8.10 by @dependabot in #1725
Bump sphinx-argparse from 0.3.2 to 0.4.0 by @dependabot in #1726
Simple nlp tests by @dakinggg in #1716
Build Streaming CIFAR10 Factory Function by @growlix in #1729
Change build_streaming_cifar10_dataloader() to use v2 by default by @growlix in #1730
Clear the Optimizer before wrapping with FSDP by @bcui19 in #1732
Add inf eval check by @mvpatel2000 in #1733
Fix fsdp checkpoint strategy by @bcui19 in #1734
Assign eval microbatch to self.state.batch by @dakinggg in #1739
Add masks to wandblogger.log_image and cometmllogger.log_image and refactor ImageVisualizer to use log_image [WIP] by @eracah in #1710
Protect backwards compatability by @mvpatel2000 in #1741
Add composer version state by @dakinggg in #1742
Adds auto object store creation to get_file by @dakinggg in #1750
Log console interval by @eracah in #1694
Bump sphinxcontrib-katex from 0.9.0 to 0.9.3 by @dependabot in #1757
Bump pandoc from 2.2 to 2.3 by @dependabot in #1756
Bump cryptography from 38.0.3 to 38.0.4 by @dependabot in #1755
Add more event tests by @mvpatel2000 in #1762
Add python 3.10, pytorch 1.13, cuda 11.7 by @mvpatel2000 in #1735
Add huggingface info to state dict by @dakinggg in #1744
Global batch size by @mvpatel2000 in #1746
Add device to state by @mvpatel2000 in #1765
Rename precisions by @mvpatel2000 in #1761
Device id none by @dakinggg in #1767
Autoload HuggingFace model/tokenizer by @dakinggg in #1754
Supporting train_device_microbatch_size by @mvpatel2000 in #1749
Switch flash attention to tag by @mvpatel2000 in #1766
remove grad clip norm by @mvpatel2000 in #1768
unregister engine for memory cleanup by @mvpatel2000 in #1769
Fix hf tokenizer test for new hf version by @dakinggg in #1772
Decrease microbatch size if batch size is smaller by @mvpatel2000 in #1771
remove deprecated code by @mvpatel2000 in #1773
cache call to cpuinfo by @dakinggg in #1778
device train microbatch size pt 2 by @mvpatel2000 in #1776
Huggingface pretrain + finetune notebook by @dakinggg in #1775
Bump traitlets from 5.5.0 to 5.6.0 by @dependabot in #1781
Bump deepspeed from 0.7.5 to 0.7.6 by @dependabot in #1780
Minor docs fix for deepspeed typo by @mvpatel2000 in #1784
Update Auto Microbatching by @mvpatel2000 in #1785
Adding GyroDropout as an algorithm to Composer by @jelite in #1718
Add Deprecation warning for Fused LayerNorm by @nik-mosaic in #1789
Update error msgs by @mvpatel2000 in #1791
Change gyro emoji by @nik-mosaic in #1792
Speeding up tests by @dakinggg in #1779
Add durations arg to pytest by @dakinggg in #1793
Properly implement gradient clipping for FSDP by @bcui19 in #1740
Updating FSDP supported config flags by @bcui19 in #1794
Remove streaming v1 datasets. by @knighton in #1787
Remove references to validate in docs by @dakinggg in #1800
Install latest Git in Docker images by @bandish-shah in #1770
move to pypi release for flash attn by @mvpatel2000 in #1777
Check and make sure that metric names is a list of strings by @dakinggg in #1798
Adding in the possibility of 'None' for MixedPrecision FSDP by @bcui19 in #1796
Updating assertion check for gradient clipping and updating gradient clip tests for FSDP by @bcui19 in #1802
Moving Pytest CPU to GHA by @mvpatel2000 in #1790
Bump sphinxext-opengraph from 0.6.3 to 0.7.3 by @dependabot in #1760
Update distributed_training.rst by @lupesko in #1731
Use streaming v3 by @knighton in #1797
Bump traitlets from 5.6.0 to 5.7.0 by @dependabot in #1806
Bump ipykernel from 6.16.2 to 6.19.2 by @dependabot in #1810
Update packaging requirement from <22,>=21.3.0 to >=21.3.0,<23 by @dependabot in #1808
match list batch splitting and tensor batch splitting by @dakinggg in #1804
Add type ignore for onnx import by @mvpatel2000 in #1811
Remove pip install all from coverage action by @dakinggg in #1805
Remove coco and ssd by @growlix in #1717
Rename matrix by @mvpatel2000 in #1813
Add OCI ObjectStore by @eracah in #1774
Add MLFlowLogger by @eracah in #1795
Object store docs by @dakinggg in #1817
fix inf eval by @mvpatel2000 in #1815
Add fsdp_config to state and add fsdp_config to trainer docstring by @growlix in #1821
Add SHARP support to docker by @mvpatel2000 in #1818
Testing Infra Cleanup by @mvpatel2000 in #1822
Remove dead code in dockerfile by @mvpatel2000 in #1823
Fix Export Docs by @mvpatel2000 in #1824
Remove old deprecated logger methods by @eracah in #1826
NLP metrics tests by @dakinggg in #1830
Nlp pipeline test by @dakinggg in #1828
Add tests for uri helper functions by @eracah in #1827
Add pip targets to installation.rst docs by @eracah in #1829

New Contributors

@cojennin made their first contribution in #1721
@jelite made their first contribution in #1718

Full Changelog: v0.11.1...v0.12.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.12.0

🚀 Composer v0.12.0

New Features

API changes

Deprecations

Bug Fixes

What's Changed

New Contributors

Contributors