A wrapper of the Apache Spark Connect client with additional functionalities that allow applications to communicate with a remote Dataproc Spark cluster using the Spark Connect protocol without requiring additional steps.
pip install dataproc_spark_connect
pip uninstall dataproc_spark_connect
This client requires permissions to manage Dataproc sessions and session templates. If you are running the client outside of Google Cloud, you must set following environment variables:
- GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
- GOOGLE_CLOUD_REGION - The Compute Engine region where you run the Spark workload.
- GOOGLE_APPLICATION_CREDENTIALS - Your Application Credentials
- DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as
tests/integration/resources/session.textproto
- Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:
pip install google_cloud_dataproc --force-reinstall
pip install dataproc_spark_connect --force-reinstall
- Add the required import into your PySpark application or notebook:
from google.cloud.dataproc_spark_connect import DataprocSparkSession
- There are two ways to create a spark session,
-
Start a Spark session using properties defined in
DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG
:spark = DataprocSparkSession.builder.getOrCreate()
-
Start a Spark session with the following code instead of using a config file:
from google.cloud.dataproc_v1 import SparkConnectConfig from google.cloud.dataproc_v1 import Session dataproc_config = Session() dataproc_config.spark_connect_session = SparkConnectConfig() dataproc_config.environment_config.execution_config.subnetwork_uri = "<subnet>" dataproc_config.runtime_config.version = '3.0' spark = DataprocSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()
-
As this client runs the spark workload on Dataproc, your project will be billed as per Dataproc Serverless Pricing. This will happen even if you are running the client from a non-GCE instance.
- Install the requirements in virtual environment.
pip install -r requirements.txt
- Build the code.
python setup.py sdist bdist_wheel
- Copy the generated
.whl
file to Cloud Storage. Use the version specified in thesetup.py
file.
VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>
- Download the new SDK on Vertex, then uninstall the old version and install the new one.
%%bash
export VERSION=<version>
gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
yes | pip uninstall dataproc_spark_connect
pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl