v0.12.0
π Composer v0.12.0
Composer v0.12.0 is released! Install via pip
:
pip install mosaicml==0.12.0
New Features
-
πͺ΅ Logging and ObjectStore Enhancements
There are multiple improvements to our logging and object store support in this release.
-
Image visualization using our
CometMLLogger
(#1710)We've added support for using our
ImageVisualizer
callback with CometML to log images and segmentation masks to CometML.from composer.trainer import Trainer trainer = Trainer(..., callbacks=[ImageVisualizer()], loggers=[CometMLLogger()] )
-
Added direct support for Oracle Cloud Infrastructure (OCI) as an
ObjectStore
(#1774) and support for Google Cloud Storage (GCS) via URI (#1833)To use, you can simply set your
save_folder
orload_path
to a URI beginning withoci://
orgs://
, to save and load with OCI and GCS respectively.from composer.trainer import Trainer # Checkpoint saving to Google Cloud Storage. trainer = Trainer( model=model, save_folder="gs://my-bucket/{run_name}/checkpoints", run_name='my-run', save_interval="1ep", save_filename="ep{epoch}.pt", save_num_checkpoints_to_keep=0, # delete all checkpoints locally ... ) trainer.fit()
-
Added basic support for logging with MLFlow (#1795)
We've added basic support for using MLFlow to log experiment metrics.
from composer.loggers import MLFlowLogger from composer.trainer import Trainer mlflow_logger = MLFlowLogger(experiment_name=mlflow_exp_name, run_name=mlflow_run_name, tracking_uri=mlflow_uri) trainer = Trainer(..., loggers=[mlflow_logger])
-
Simplified console and progress bar logging (#1694)
To turn off the progress bar, set
progress_bar=False
. To turn on logging directly to the console, setlog_to_console=True
. To control the frequency of logging to console, setconsole_log_interval
(e.g. to1ep
or1ba
). -
Our
get_file
utility now supports URIs directly (s3://
,oci://
, andgs://
) for downloading files.
-
-
πββοΈ Support for Mid-Epoch Resumption with the latest release of Streaming
We've added support in Composer for the latest release of our Streaming library. This includes awesome new features like instant mid epoch resumption and deterministic shuffling, regardless of the number of nodes. See the Streaming release notes for more!
-
π¨ New algorithm -
GyroDropout
!Thanks to @jelite for adding a new algorithm,
GyroDropout
to Composer! Please see the method card for more details. -
π€ HuggingFace + Composer improvements
We've added a new utility to load a π€ HuggingFace model and tokenizer out of a Composer checkpoint (#1754), making the pretraining -> finetuning workflow even easier in Composer. Check out the docs for more details, and our example notebook for a full tutorial (#1775)!
-
π GradMonitor -> OptimizerMonitor
Renames our
GradMonitor
callback toOptimizerMonitor
, and adds the ability to track optimizer specific metrics. Check out the docs for more details, and add to your code just like any other callback!from composer.callbacks import OptimizerMonitor from composer.trainer import Trainer trainer = Trainer( ..., callbacks=[OptimizerMonitor(log_optimizer_metrics=log_optimizer_metrics)] )
-
π³ New PyTorch and CUDA versions
We've expanded our library of Docker images with support for PyTorch 1.13 + CUDA 11.7:
mosaicml/pytorch:1.13.0_cu117-python3.10-ubuntu20.04
mosaicml/pytorch:1.13.0_cpu-python3.10-ubuntu20.04
The
mosaicml/pytorch:latest
,mosaicml/pytorch:cpu_latest
andmosaicml/composer:0.12.0
tags are now built from PyTorch 1.13 based images. Please see our DockerHub repository for additional details.
API changes
-
Replace
grad_accum
withdevice_train_microbatch_size
(#1749, #1776)We're deprecating the
grad_accum
Trainer argument in favor of the more intuitivedevice_train_microbatch_size
. Instead of thinking about how to divide your specified minibatch into microbatches, simply specify the size of your microbatch. For example, let's say you want to split your minibatch of 2048 into two microbatches of 1024:from composer import Trainer trainer = Trainer( ..., device_train_microbatch_size=1024, )
If you want Composer to tune the microbatch for you automatically, enable automatic microbatching as follows:
from composer import Trainer trainer = Trainer( ..., device_train_microbatch_size='auto', )
The
grad_accum
argument is still supported but will be deprecated in the next Composer release. -
Renamed precisions (#1761)
We've renamed precision attributes for clarity. The following values have been removed:
['amp', 'fp16', bf16']
.We have added the following values, prefixed with 'amp' to clarify when an Automatic Mixed Precision type is being used:
['amp_fp16', 'amp_bf16']
.The
fp32
precision value remains unchanged.
Deprecations
- Removed support for YAHP (#1512)
- Removed COCO and SSD datasets (#1717)
- Fully removed Streaming v1 support, please see the mosaicml/streaming project for our next-gen streaming datasets (#1787)
- Deprecated
FusedLayerNorm
algorithm (#1789) - Fully removed
grad_clip_norm
training argument, please use theGradientClipping
algorithm instead (#1768) - Removed
data_fit
,data_epoch
, anddata_batch
fromLogger
(#1826)
Bug Fixes
- Fix FSDP checkpoint strategy (#1734)
- Fix gradient clipping with FSDP (#1740)
- Adds more supported FSDP config flags (
sync_module_states
,forward_prefecth
,limit_all_gathers
) (#1794) - Allow
FULL
precision with FSDP (#1796) - Fix
eval_microbatch
modification onEVAL_BEFORE_FORWARD
event (#1739) - Fix algorithm API backwards compatibility in checkpoints (#1741)
- Fixes a bad
None
check preventing settingdevice_id
to0
(#1767) - Unregister engine to make cleaning up memory easier (#1769)
- Fix issue if
metric_names
is not a list (#1798) - Match implementation for list and tensor batch splitting (#1804)
- Fixes infinite eval issue (#1815)
What's Changed
- Update installation constraints for streaming by @karan6181 in #1661
- Update decoupled_weight_decay.md by @jacobfulano in #1672
- Notebooks part 2 by @dakinggg in #1659
- Add trainer arg for engine passes by @mvpatel2000 in #1673
- Autoload algorithms by @mvpatel2000 in #1658
- Faster metrics calculations + Fix warnings added by the new version of torchmetrics by @dskhudia in #1674
- Update coolname requirement from <2,>=1.1.0 to >=1.1.0,<3 by @dependabot in #1666
- Bump ipykernel from 6.16.0 to 6.16.1 by @dependabot in #1667
- Bump traitlets from 5.4.0 to 5.5.0 by @dependabot in #1668
- Image viz by @dakinggg in #1676
- Update checks for Gated Linear Units Method by @jacobfulano in #1575
- ADE20k streaming factory method by @Landanjs in #1626
- Deyahpify cifar10 by @growlix in #1677
- Nuke YAHP by @hanlint in #1512
- Imagenet streaming factory method by @codestar12 in #1649
- Bump ipykernel from 6.16.1 to 6.16.2 by @dependabot in #1683
- Bump pytest from 7.1.3 to 7.2.0 by @dependabot in #1684
- Bump pypandoc from 1.9 to 1.10 by @dependabot in #1680
- Update py-cpuinfo requirement from <9,>=8.0.0 to >=8.0.0,<10 by @dependabot in #1681
- Uncomment and clean up algorithms documentation by @growlix in #1685
- Update glu check by @mvpatel2000 in #1689
- fix backwards compatability by @mvpatel2000 in #1693
- Fix engine pass registration by @mvpatel2000 in #1692
- Add Low Precision LayerNorm by @nik-mosaic in #1525
- Update codeowners by @mvpatel2000 in #1691
- Add nccl env var by @mvpatel2000 in #1695
- Fix eval timestamp by @mvpatel2000 in #1697
- Update distributed docs by @mvpatel2000 in #1696
- Return empty dict if wandb disabled by @dakinggg in #1698
- Autoresume related error messages by @dakinggg in #1687
- Add log_image to wandb, cometml, and LoggerDestination by @eracah in #1675
- Pin PyTorch and supporting package versions by @bandish-shah in #1688
- Add in unit tests for log_image function for CometMLLogger and WandBLogger by @eracah in #1701
- refactor devices by @mvpatel2000 in #1699
- remove as in device by @mvpatel2000 in #1704
- Fix device imports by @mvpatel2000 in #1705
- Fix typing in EMA's _move_params_to_device() by @coryMosaicML in #1707
- Add docs for saving and loading checkpoints with GCS by @eracah in #1702
- Clean up imports by @mvpatel2000 in #1700
- Add rud docs by @eracah in #1709
- Bump cryptography from 38.0.1 to 38.0.3 by @dependabot in #1712
- GHA workflow for code quality checks by @bandish-shah in #1719
- Add support for Path in CheckpointSaver by @cojennin in #1721
- Docs Typo by @mvpatel2000 in #1723
- Bump nbsphinx from 0.8.9 to 0.8.10 by @dependabot in #1725
- Bump sphinx-argparse from 0.3.2 to 0.4.0 by @dependabot in #1726
- Simple nlp tests by @dakinggg in #1716
- Build Streaming CIFAR10 Factory Function by @growlix in #1729
- Change
build_streaming_cifar10_dataloader()
to use v2 by default by @growlix in #1730 - Clear the Optimizer before wrapping with FSDP by @bcui19 in #1732
- Add inf eval check by @mvpatel2000 in #1733
- Fix fsdp checkpoint strategy by @bcui19 in #1734
- Assign eval microbatch to self.state.batch by @dakinggg in #1739
- Add masks to wandblogger.log_image and cometmllogger.log_image and refactor ImageVisualizer to use log_image [WIP] by @eracah in #1710
- Protect backwards compatability by @mvpatel2000 in #1741
- Add composer version state by @dakinggg in #1742
- Adds auto object store creation to
get_file
by @dakinggg in #1750 - Log console interval by @eracah in #1694
- Bump sphinxcontrib-katex from 0.9.0 to 0.9.3 by @dependabot in #1757
- Bump pandoc from 2.2 to 2.3 by @dependabot in #1756
- Bump cryptography from 38.0.3 to 38.0.4 by @dependabot in #1755
- Add more event tests by @mvpatel2000 in #1762
- Add python 3.10, pytorch 1.13, cuda 11.7 by @mvpatel2000 in #1735
- Add huggingface info to state dict by @dakinggg in #1744
- Global batch size by @mvpatel2000 in #1746
- Add device to state by @mvpatel2000 in #1765
- Rename precisions by @mvpatel2000 in #1761
- Device id none by @dakinggg in #1767
- Autoload HuggingFace model/tokenizer by @dakinggg in #1754
- Supporting
train_device_microbatch_size
by @mvpatel2000 in #1749 - Switch flash attention to tag by @mvpatel2000 in #1766
- remove grad clip norm by @mvpatel2000 in #1768
- unregister engine for memory cleanup by @mvpatel2000 in #1769
- Fix hf tokenizer test for new hf version by @dakinggg in #1772
- Decrease microbatch size if batch size is smaller by @mvpatel2000 in #1771
- remove deprecated code by @mvpatel2000 in #1773
- cache call to cpuinfo by @dakinggg in #1778
- device train microbatch size pt 2 by @mvpatel2000 in #1776
- Huggingface pretrain + finetune notebook by @dakinggg in #1775
- Bump traitlets from 5.5.0 to 5.6.0 by @dependabot in #1781
- Bump deepspeed from 0.7.5 to 0.7.6 by @dependabot in #1780
- Minor docs fix for deepspeed typo by @mvpatel2000 in #1784
- Update Auto Microbatching by @mvpatel2000 in #1785
- Adding GyroDropout as an algorithm to Composer by @jelite in #1718
- Add Deprecation warning for Fused LayerNorm by @nik-mosaic in #1789
- Update error msgs by @mvpatel2000 in #1791
- Change gyro emoji by @nik-mosaic in #1792
- Speeding up tests by @dakinggg in #1779
- Add durations arg to pytest by @dakinggg in #1793
- Properly implement gradient clipping for FSDP by @bcui19 in #1740
- Updating FSDP supported config flags by @bcui19 in #1794
- Remove streaming v1 datasets. by @knighton in #1787
- Remove references to validate in docs by @dakinggg in #1800
- Install latest Git in Docker images by @bandish-shah in #1770
- move to pypi release for flash attn by @mvpatel2000 in #1777
- Check and make sure that metric names is a list of strings by @dakinggg in #1798
- Adding in the possibility of 'None' for MixedPrecision FSDP by @bcui19 in #1796
- Updating assertion check for gradient clipping and updating gradient clip tests for FSDP by @bcui19 in #1802
- Moving Pytest CPU to GHA by @mvpatel2000 in #1790
- Bump sphinxext-opengraph from 0.6.3 to 0.7.3 by @dependabot in #1760
- Update distributed_training.rst by @lupesko in #1731
- Use streaming v3 by @knighton in #1797
- Bump traitlets from 5.6.0 to 5.7.0 by @dependabot in #1806
- Bump ipykernel from 6.16.2 to 6.19.2 by @dependabot in #1810
- Update packaging requirement from <22,>=21.3.0 to >=21.3.0,<23 by @dependabot in #1808
- match list batch splitting and tensor batch splitting by @dakinggg in #1804
- Add type ignore for onnx import by @mvpatel2000 in #1811
- Remove pip install all from coverage action by @dakinggg in #1805
- Remove coco and ssd by @growlix in #1717
- Rename matrix by @mvpatel2000 in #1813
- Add OCI ObjectStore by @eracah in #1774
- Add MLFlowLogger by @eracah in #1795
- Object store docs by @dakinggg in #1817
- fix inf eval by @mvpatel2000 in #1815
- Add
fsdp_config
tostate
and add fsdp_config to trainer docstring by @growlix in #1821 - Add SHARP support to docker by @mvpatel2000 in #1818
- Testing Infra Cleanup by @mvpatel2000 in #1822
- Remove dead code in dockerfile by @mvpatel2000 in #1823
- Fix Export Docs by @mvpatel2000 in #1824
- Remove old deprecated logger methods by @eracah in #1826
- NLP metrics tests by @dakinggg in #1830
- Nlp pipeline test by @dakinggg in #1828
- Add tests for uri helper functions by @eracah in #1827
- Add pip targets to installation.rst docs by @eracah in #1829
New Contributors
Full Changelog: v0.11.1...v0.12.0