Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Juno do not merge #6

Open
wants to merge 69 commits into
base: 2025-1
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
2b0f370
add user quotas to fs by default, 1g per user on each fs
mboisson Sep 30, 2024
9f0c173
add juno to the configuration, make ncpu and friends into a structure…
mboisson Sep 30, 2024
dc85fe5
initial config for mcgill scs
mboisson Sep 30, 2024
112dcb0
fix typo
mboisson Sep 30, 2024
ab0cb41
fix typo
mboisson Sep 30, 2024
c1b5eaf
define network parameters, required for Juno
mboisson Sep 30, 2024
749e10e
removing static node, using cpu snapshot for new cpu nodes
mboisson Sep 30, 2024
94d1e91
add nvidia-driver-cuda to the passthrough packages
mboisson Sep 30, 2024
65ef7b8
use snapshot for gpu nodes too
mboisson Sep 30, 2024
c34c244
remove disk size for gpu nodes
mboisson Sep 30, 2024
c54a287
re-create new snapshot from smaller node
mboisson Sep 30, 2024
4ede73d
try new snapshot for new pool node, add pool nodes
mboisson Sep 30, 2024
003ac59
reduce suspend_time of Slurm
mboisson Sep 30, 2024
f520307
restrict number of MIGs to 1, add more GPU nodes in the pool
mboisson Oct 1, 2024
b3fc7b6
increase the suspend time to 1 day temporarily, for debugging
mboisson Oct 1, 2024
fcd0d4b
up commit of MC to latest, pin GPU drivers to 550
mboisson Oct 1, 2024
e7569b3
removed cpupool nodes
mboisson Oct 1, 2024
cfef6c0
update snapshot for gpu node
mboisson Oct 1, 2024
afe4959
reduce suspend_time to 15 minutes
mboisson Oct 1, 2024
fed018c
destroy all gpupool nodes
mboisson Oct 1, 2024
b8ae432
re-add gpupool nodes
mboisson Oct 1, 2024
e0a795b
test with drivers 555-dkms
mboisson Oct 2, 2024
5b84c5b
retry with dkms-560
mboisson Oct 2, 2024
68740d4
shuffle MIG config between node types
mboisson Oct 2, 2024
d80f2f1
remove MIG from nodegpupool16* to test
mboisson Oct 2, 2024
76b1294
disable mig on all nodes
mboisson Oct 2, 2024
a9b217b
add MIG back with drivers 560, from GPUs that have MIG disabled
mboisson Oct 2, 2024
2584ff7
roll back to 550 drivers
mboisson Oct 2, 2024
eb1178f
swap mig config
mboisson Oct 2, 2024
c2d1fec
restore the use of the snapshot
mboisson Oct 2, 2024
0e96d0f
make config_git_url configurable
mboisson Oct 2, 2024
4609e19
test MC from Maxime's fork, with StdEnv/2023
mboisson Oct 2, 2024
8c08fb3
add new cpu node
mboisson Oct 2, 2024
5ad4022
update custom.tf for the test cluster to the n. structure
mboisson Oct 2, 2024
04bdd0b
remove static node
mboisson Oct 2, 2024
0e5e28f
merge 2024-3 in juno
mboisson Oct 9, 2024
201d2ba
change mig config to use only 3g.20gb, and bump puppet-mc commit
mboisson Oct 16, 2024
070c824
initial configuration for nvidia workshop
mboisson Oct 16, 2024
c1e6115
use 'internal' as cluster purpose
mboisson Oct 16, 2024
e3d3a6d
add cuda/12.2 in the default modules
mboisson Oct 16, 2024
c8a044f
configure admin validation required
mboisson Oct 16, 2024
3c65f22
require admin verification
mboisson Oct 16, 2024
24af347
customize number of default users
mboisson Oct 16, 2024
6bb643e
configure more nodes in 1g.5gb
mboisson Oct 17, 2024
0fca901
bump version of slurm
mboisson Oct 17, 2024
c7e68be
test limit of 15 gres/gpu per user
mboisson Oct 17, 2024
fd68622
removed MaxTRESRunMinsPerUser as this is not a cluster option, but a …
mboisson Oct 17, 2024
e66f9c3
removed configuration per class
mboisson Oct 17, 2024
afca279
require admin verification
mboisson Oct 17, 2024
2700939
require admin verification by default
mboisson Oct 17, 2024
82b0ddc
add cq and j nodes
mboisson Oct 18, 2024
9e8b287
add cq and j nodes
mboisson Oct 18, 2024
b1dbfd8
boot new node from Rocky 8 image
mboisson Oct 18, 2024
add5c15
boot second gpu node
mboisson Oct 18, 2024
32f0ef5
configure mig in 7x1g.5gb
mboisson Oct 18, 2024
5a7e16e
switch static node to smaller flavor
mboisson Oct 18, 2024
609f9ff
remove static nodes
mboisson Oct 18, 2024
1a333dc
re-create static nodes
mboisson Oct 18, 2024
1279109
remove static nodes
mboisson Oct 18, 2024
cb5e036
recreate static node
mboisson Oct 18, 2024
1ff4627
remove static node, switch to new snapshot
mboisson Oct 18, 2024
471059d
remove 80gb gpus
mboisson Oct 22, 2024
f3d2040
restrict the flavours of GPUs that can be requested
mboisson Oct 22, 2024
0ea918a
remove 80gb gpus, add more flavors of gpus
mboisson Oct 22, 2024
1164642
remove comments
mboisson Oct 22, 2024
bf06824
fix spacing
mboisson Oct 23, 2024
d314ed1
renamed n to nnodes for clarity
mboisson Oct 23, 2024
28ac226
Merge branch '2025-1' into juno
mboisson Oct 23, 2024
262aab3
Merge branch '2025-1' into juno
mboisson Oct 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion common/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,11 @@ jupyterhub::jupyterhub_config_hash:
SlurmFormSpawner:
start_timeout: 900

profile::freeipa::mokey::require_verify_admin: false
profile::freeipa::mokey::require_verify_admin: true
profile::slurm::base::slurm_version: '23.02'
# when using snapshots, it is quick enough to boot nodes that 900 seconds is enough for suspend
profile::slurm::base::suspend_time: 900
profile::gpu::install::passthrough::nvidia_driver_stream: '550-dkms'

prometheus::global_config:
scrape_interval: '1m'
Expand All @@ -22,3 +25,4 @@ prometheus::remote_write_configs:
basic_auth:
username: 'cqformation'
password: "%{alias('prometheus_password')}"

332 changes: 240 additions & 92 deletions common/main.tf

Large diffs are not rendered by default.

29 changes: 29 additions & 0 deletions mcgill-scs/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
jupyterhub::jupyterhub_config_hash:
SbatchForm:
runtime:
min: 3.5
def: 3.5
max: 5.0
nprocs:
min: 1
def: 1
max: 1
memory:
min: 1024
max: 2048
def: 2048
oversubscribe:
def: true
lock: true
gpus:
def: 'gpu:0'
choices: ['gpu:0', 'gpu:1g.5gb:1', 'gpu:3g.20gb:1' ]
ui:
def: 'lab'
SlurmFormSpawner:
disable_form: false

profile::software_stack::lmod_default_modules: ['StdEnv/2023', 'gcc/12.3', 'openmpi/4.1.5', 'python/3.10', 'ipython-kernel/3.10']
profile::freeipa::mokey::require_verify_admin: true
profile::slurm::base::slurm_version: '24.05'

37 changes: 37 additions & 0 deletions mcgill-scs/custom.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
locals {
custom = {
nnodes = {
cpu = 0
cpupool = 0
gpupool16 = 6
gpupool16-cq = 4
gpupool12 = 2
gpupool12-j = 8
gpupool80 = 0
}
home_size = 100
project_size = 500
scratch_size = 400
nb_users = 1

user_quotas = {
home = "1g"
project = "2g"
scratch = "4g"
}

mig = {
gpupool16-cq = { "1g.5gb" = 7 }
gpupool16 = { "1g.5gb" = 7 }
gpupool12 = { "3g.20gb" = 2 }
gpupool12-j = { "1g.5gb" = 7 }
}

image_cpu = "snapshot-cpunode-2024-R810.5"
image_gpu = "snapshot-gpunode-2024-R810.5"

config_version = "dc6b37f4d2c077a37d88bf4862ba57a09eed7213"
}

name = "mcgill-scs"
}
1 change: 1 addition & 0 deletions mcgill-scs/main.tf
54 changes: 54 additions & 0 deletions nvidia-workshop/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
jupyterhub::jupyterhub_config_hash:
SbatchForm:
runtime:
min: 3.5
def: 3.5
max: 5.0
nprocs:
min: 1
def: 1
max: 1
memory:
min: 1024
max: 2048
def: 2048
oversubscribe:
def: true
lock: true
gpus:
def: 'gpu:0'
choices: ['gpu:0', 'gpu:1g.5gb:1', 'gpu:3g.20gb:1' ]
ui:
def: 'lab'
SlurmFormSpawner:
disable_form: false

profile::software_stack::lmod_default_modules: ['StdEnv/2023', 'nvhpc/23.9', 'cuda/12.2', 'openmpi/4.1.5', 'python/3.11', 'ipython-kernel/3.11']
profile::freeipa::mokey::require_verify_admin: true
#profile::slurm::base::suspend_time: 86400

profile::users::ldap::users:
dummy_cours1:
count: 1
groups: ['def-cours1']

dummy_cours2:
count: 1
groups: ['def-cours2']

dummy_cours3:
count: 1
groups: ['def-cours3']

profile::slurm::accounting::accounts:
def-cours1:
Fairshare: 1
MaxJobs: 1
def-cours2:
Fairshare: 1
MaxJobs: 10
def-cours3:
Fairshare: 1
MaxJobs: 10


46 changes: 46 additions & 0 deletions nvidia-workshop/custom.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
locals {
custom = {
nnodes = {
cpu = 0
cpupool = 0
gpu = 0
gpupool16 = 16
gpupool16-cq = 4
gpupool12 = 4
gpupool12-j = 20
gpupool80 = 0
}
cluster_purpose = "internal"
home_size = 50
project_size = 50
scratch_size = 50

user_quotas = {
home = "1g"
project = "1g"
scratch = "1g"
}

instances_type_map = {
juno = {
gpu = "gpu12-120-850gb-a100x1"
}
}
mig = {
gpu = { "1g.5gb" = 7 }
gpupool16 = { "1g.5gb" = 7 }
gpupool16-cq = { "1g.5gb" = 7 }
gpupool12 = { "3g.20gb" = 2 }
gpupool12-j = { "1g.5gb" = 7 }
gpupool80 = { "2g.20gb" = 3 }
}

image_cpu = "snapshot-cpunode-2024-R810.5"
#image_gpu = "Rocky-8.10"
image_gpu = "snapshot-gpunode-2024-R810.5"

config_version = "dc6b37f4d2c077a37d88bf4862ba57a09eed7213"
}

name = "nvidia-workshop"
}
1 change: 1 addition & 0 deletions nvidia-workshop/main.tf
1 change: 0 additions & 1 deletion test-mc-infra-cours/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ jupyterhub::jupyterhub_config_hash:
disable_form: false

profile::freeipa::mokey::require_verify_admin: false
profile::software_stack::lmod_default_modules: ['StdEnv/2023', 'gcc/12.3', 'openmpi/4.1.5', 'python/3.10', 'ipython-kernel/3.10']

profile::users::ldap::users:
dummy_cours1:
Expand Down
16 changes: 10 additions & 6 deletions test-mc-infra-cours/custom.tf
Original file line number Diff line number Diff line change
@@ -1,15 +1,19 @@
locals {
custom = {
ncpu = 0
ncpupool = 1
ngpu = 0
ngpupool = 1
nnodes = {
cpu = 0
cpupool = 1
gpu = 0
gpupool = 1
}
# home_size = 100
# project_size = 100
# scratch_size = 50
image_cpu = "snapshot-cpunode-2024-R810.4"
image_gpu = "snapshot-gpunode-2024-R810.4"
image_cpu = "snapshot-cpunode-2024-R810.5"
image_gpu = "snapshot-gpunode-2024-R810.5"

config_git_url = "https://github.com/mboisson/puppet-magic_castle.git"
config_version = "1b45e1f"
volumes = {
nfs = {
home = { size = 100, quota = "1g" }
Expand Down