-
Notifications
You must be signed in to change notification settings - Fork 689
Issues: kubeflow/training-operator
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
Update Prometheus monitoring docs for Training Operator
area/docs
good first issue
help wanted
kind/feature
#2254
opened Sep 10, 2024 by
andreyvelich
Create Slurm runtime for model training using V2 APIs
area/runtime
kind/feature
#2249
opened Sep 5, 2024 by
andreyvelich
[Test] E2e Tests for Notebook Examples
area/testing
good first issue
help wanted
kind/feature
#2246
opened Sep 2, 2024 by
Electronic-Waste
KEP-2170: Create model exporter for checkpointing and training output
area/storage
#2245
opened Aug 30, 2024 by
andreyvelich
Include multiple files for TrainingClient().create_job()
area/sdk
kind/feature
#2233
opened Aug 23, 2024 by
u66u
Support Local Execution of Training Jobs
area/sdk
kind/feature
#2231
opened Aug 21, 2024 by
franciscojavierarceo
KEP-2170: Provide the client-go library for the TrainJob and TarriningRuntime
area/api
#2224
opened Aug 16, 2024 by
tenzen-y
Regarding whether the tf-job-operator v1.0 metrics can expose specific failed pods
area/monitoring
kind/feature
#2220
opened Aug 15, 2024 by
SecretSun
KEP-2170: Implement validations for TrainingRuntime and ClusterTrainingRuntime
area/webhook
#2219
opened Aug 14, 2024 by
tenzen-y
KEP-2170: Support the PodSpecOverrides API in TrainJob
area/controller
#2218
opened Aug 14, 2024 by
andreyvelich
KEP-2170: Create LLM training runtime for Llama 2 7b
area/runtime
#2212
opened Aug 14, 2024 by
andreyvelich
KEP-2170: Create PyTorch multi-node distributed training runtime
area/runtime
#2211
opened Aug 14, 2024 by
andreyvelich
KEP-2170: Create dataset and model initializers
area/storage
#2210
opened Aug 14, 2024 by
andreyvelich
KEP-2170: Create Kustomize manifests to deploy JobSet and TrainJob controllers
area/controller
#2208
opened Aug 14, 2024 by
andreyvelich
Previous Next
ProTip!
Type g p on any issue or pull request to go back to the pull request listing page.