Skip to content

Provides a Spark backend for executing Dataflow pipelines.

License

Notifications You must be signed in to change notification settings

dennishuo/spark-dataflow

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-dataflow

Spark-dataflow is an early prototype. If this project interests you, you should know that we encourage outside contributions. So, hack away! To get an idea of what we have already identified as areas that need improvement, checkout the issues listed in the github repo.

Spark-dataflow allows users to execute dataflow pipelines with Spark. Executing a pipeline on a spark cluster is easy: Depend on spark-dataflow in your project and execute your pipeline in a program by calling SparkPipelineRunner.run.

The Maven coordinates of the current version of this project are:

<groupId>com.cloudera.dataflow.spark</groupId>
<artifactId>dataflow-spark</artifactId>
<version>0.0.1</version>

If we wanted to run a dataflow pipeline with the default options of a single threaded spark instance in local mode, we would do the following:

Pipeline p = <logic for pipeline creation >
EvaluationResult result = SparkPipelineRunner.create().run(p);

To create a pipeline runner to run against a different spark cluster, with a custom master url we would do the following:

Pipeline p = <logic for pipeline creation >
SparkPipelineOptions options = SparkPipelineOptionsFactory.create();
options.setSparkMaster("spark://host:port");
EvaluationResult result = SparkPipelineRunner.create(options).run(p);

About

Provides a Spark backend for executing Dataflow pipelines.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%