Skip to content

wandb/terraform-google-wandb

Repository files navigation

Weights & Biases Google Module

This is a Terraform module for provisioning a Weights & Biases Cluster on Google Cloud. Weights & Biases Local is our self-hosted distribution of wandb.ai. It offers enterprises a private instance of the Weights & Biases application, with no resource limits and with additional enterprise-grade architectural features like audit logging and single sign-on.

About This Module

Pre-requisites

This module is intended to run in an Google Cloud account with minimal preparation, however it does have the following pre-requisites:

Terrafom version >= 1

Credentials / Permissions

Google Services Used

  • Google SQL Cloud (MySQL)
  • Google Kubernetes Engine
  • Google Storage Bucket
  • Google PubSub
  • Google Managed Certificates
  • Google Cloud DNS

How to Use This Module

  • Ensure account meets module pre-requisites from above.
  • Create a Terraform configuration that pulls in this module and specifies values of the required variables:
provider "google" {
  project = "<desired google project>"
  region = "<desired google region>"
  zone = "<desired google zone>"
}

module "wandb" {
  source    = "<filepath to cloned module directory>"
  namespace = "<prefix for naming google resources>"
}
  • Run terraform init and terraform apply

Cluster Sizing

By default, the type of kubernetes instances, number of instances, redis cluster size, and database instance sizes are standardized via configurations in ./deployment-size.tf, and is configured via the size input variable.

Available sizes are, small, medium, large, xlarge, and xxlarge. Default is small.

All the values set via deployment-size.tf can be overridden by setting the appropriate input variables.

  • gke_machine_type - The instance type for the GKE nodes
  • gke_min_node_count - The minimum number of nodes in the GKE cluster
  • gke_max_node_count - The maximum number of nodes in the GKE cluster
  • redis_memory_size_gb - The memory size of the redis cluster
  • database_machine_type - The instance type for the database

Examples

We have included documentation and reference examples for common installation scenarios, as well as examples for supporting resources that lack official modules.

Requirements

Name Version
terraform ~> 1.0
google ~> 5.30
helm ~> 2.10
kubernetes ~> 2.23
time 0.11.2

Providers

Name Version
google ~> 5.30

Modules

Name Source Version
app_gke ./modules/app_gke n/a
app_lb ./modules/app_lb n/a
clickhouse ./modules/clickhouse n/a
database ./modules/database n/a
gke_app wandb/wandb/kubernetes 1.14.1
kms ./modules/kms n/a
kms_default_bucket ./modules/kms n/a
kms_default_sql ./modules/kms n/a
networking ./modules/networking n/a
private_link ./modules/private_link n/a
project_factory_project_services terraform-google-modules/project-factory/google//modules/project_services ~> 14.0
redis ./modules/redis n/a
service_accounts ./modules/service_accounts n/a
sleep matti/resource/shell 1.5.0
storage ./modules/storage n/a
wandb wandb/wandb/helm 1.2.0

Resources

Name Type
google_client_config.current data source
google_compute_forwarding_rules.all data source

Inputs

Name Description Type Default Required
allowed_inbound_cidrs Which IPv4 addresses/ranges to allow access. This must be explicitly provided, and by default is set to ["*"] list(string)
[
"*"
]
no
allowed_project_names A map of allowed projects where each key is a project number and the value is the connection limit. map(number) {} no
app_wandb_env Extra environment variables for W&B map(string) {} no
bucket_default_encryption Boolean to determine if a default bucket encryption key should be used. If true, a default key will be created. Takes precedence over bucket_kms_key_id. bool false no
bucket_kms_key_id ID of the customer-provided bucket KMS key. string null no
bucket_location Location of the bucket (US, EU, ASIA) string "US" no
bucket_name Use an existing bucket. string "" no
bucket_path path of where to store data for the instance-level bucket string "" no
clickhouse_private_endpoint_service_name ClickHouse private endpoint 'Service name' (ends in -clickhouse-cloud). string "" no
clickhouse_region ClickHouse region (us-east1, us-central1, etc). string "" no
clickhouse_subnetwork_cidr ClickHouse private service connect subnetwork string "10.50.0.0/24" no
controller_image_tag Tag of the controller image to deploy string "1.14.0" no
create_private_link Whether to create a private link service. bool false no
create_redis Boolean indicating whether to provision an redis instance (true) or not (false). bool false no
create_workload_identity Flag to indicate whether to create a workload identity for the service account. bool false no
database_machine_type Specifies the machine type to be allocated for the database. Defaults to null and value from deployment-size.tf is used string null no
database_sort_buffer_size Specifies the sort_buffer_size value to set for the database number 67108864 no
database_version Version for MySQL string "MYSQL_8_0_31" no
db_kms_key_id ID of the customer-provided SQL KMS key. string null no
deletion_protection If the instance should have deletion protection enabled. The database / Bucket can't be deleted when this value is set to true. bool true no
disable_code_saving Boolean indicating if code saving is disabled bool false no
domain_name Domain for accessing the Weights & Biases UI. string null no
enable_stackdriver n/a bool false no
force_ssl Enforce SSL through the usage of the Cloud SQL Proxy (cloudsql://) in the DB connection string bool false no
gke_machine_type Specifies the machine type for nodes in the GKE cluster. Defaults to null and value from deployment-size.tf is used string null no
gke_max_node_count Maximum number of nodes for the GKE cluster. Defaults to null and value from deployment-size.tf is used number null no
gke_min_node_count Initial number of nodes for the GKE cluster, if gke_max_node_count is set, this is the minimum number of nodes. Defaults to null and value from deployment-size.tf is used number null no
ilb_proxynetwork_cidr Internal load balancer proxy subnetwork string "10.127.0.0/24" no
labels Labels to apply to resources map(string) {} no
license Your wandb/local license string n/a yes
local_restore Restores W&B to a stable state if needed bool false no
namespace String used for prefix resources. string n/a yes
network Pre-existing network self link string null no
oidc_auth_method OIDC auth method string "implicit" no
oidc_client_id The Client ID of application in your identity provider string "" no
oidc_issuer A url to your Open ID Connect identity provider, i.e. https://cognito-idp.us-east-1.amazonaws.com/us-east-1_uiIFNdacd string "" no
oidc_secret The Client secret of application in your identity provider string "" no
operator_chart_version Version of the operator chart to deploy string "1.3.4" no
other_wandb_env Extra environment variables for W&B map(string) {} no
parquet_wandb_env Extra environment variables for W&B map(string) {} no
psc_subnetwork_cidr Private link service reserved subnetwork string "192.168.0.0/24" no
public_access Whether to create a public endpoint for wandb access. bool true no
redis_memory_size_gb Specifies the memory size in GB for the Redis instance. Defaults to null and value from deployment-size.tf is used number null no
redis_reserved_ip_range Reserved IP range for REDIS peering connection string "10.30.0.0/16" no
redis_tier Specifies the tier for this Redis instance string "STANDARD_HA" no
resource_limits Specifies the resource limits for the wandb deployment map(string)
{
"cpu": null,
"memory": null
}
no
resource_requests Specifies the resource requests for the wandb deployment map(string)
{
"cpu": "2000m",
"memory": "2G"
}
no
size Deployment size for the instance string "small" no
skip_bucket_admin_role Flag to indicate whether to skip the bucket policy creation. bool false no
sql_default_encryption Boolean to determine if a default SQL encryption key should be used. If true, a default key will be created. Takes precedence over db_kms_key_id. bool false no
ssl Enable SSL certificate bool true no
stackdriver_sa_name n/a string "wandb-stackdriver" no
subdomain Subdomain for accessing the Weights & Biases UI. Default creates record at Route53 Route. string null no
subnetwork Pre-existing subnetwork self link string null no
use_internal_queue Uses an internal redis queue instead of using google pubsub. bool false no
wandb_image Docker repository of to pull the wandb image from. string "wandb/local" no
wandb_version The version of Weights & Biases local to deploy. string "latest" no
weave_wandb_env Extra environment variables for W&B map(string) {} no

Outputs

Name Description
address n/a
bucket_name Name of google bucket.
bucket_path path of where to store data for the instance-level bucket
bucket_queue_name Pubsub queue created for google bucket file upload events.
clickhouse_private_endpoint_id ClickHouse Private endpoint Endpoint ID to secure access inside VPC
cluster_ca_certificate Certificate of the kubernetes (GKE) cluster.
cluster_client_certificate n/a
cluster_client_key n/a
cluster_endpoint Endpoint of the kubernetes (GKE) cluster.
cluster_id ID of the kubernetes (GKE) cluster.
cluster_name n/a
cluster_node_pool Default node pool where Weights & Biases should be deployed into.
cluster_self_link Self link of the kubernetes (GKE) cluster.
database_connection_string Full database connection string. You must be in the VPC to access the database.
database_instance_type n/a
fqdn The FQDN to the W&B application
gke_max_node_count n/a
gke_node_count n/a
gke_node_instance_type n/a
private_attachement_id n/a
sa_account_email This output provides the email address of the service account created for workload identity, if workload identity is enabled. Otherwise, it returns null
service_account Weights & Biases service account used to manage resources.
standardized_size n/a
url The URL to the W&B application

Migrations

5.x -> 6.x

6.0.0 introduced autoscaling to the GKE cluster and made the size variable the preferred way to set the cluster size. Previously, unless the size variable was set explicitly, there were default values for the following variables:

  • gke_machine_type
  • gke_node_count
  • redis_memory_size_gb
  • db_machine_type

The size variable is now defaulted to small, and the following values to can be used to partially override the values set by the size variable:

  • gke_machine_type
  • gke_min_node_count
  • gke_max_node_count
  • redis_memory_size_gb
  • database_machine_type

For more information on the available sizes, see the Cluster Sizing section.

If having the cluster scale nodes in and out is not desired, the gke_min_node_count and gke_max_node_count can be set to the same value to prevent the cluster from scaling.

3.x -> 4.x

3.6.0 introduced a change in the Google Provider that isn't backwards compatible with prior versions. Nothing needs to be done to upgrade, but it is not backwards compatible.