spark-dataflow

Spark-dataflow is an early prototype. If this project interests you, you should know that we encourage outside contributions. So, hack away! To get an idea of what we have already identified as areas that need improvement, checkout the issues listed in the github repo.

Spark-dataflow allows users to execute dataflow pipelines with Spark. Executing a pipeline on a spark cluster is easy: Depend on spark-dataflow in your project and execute your pipeline in a program by calling SparkPipelineRunner.run.

The Maven coordinates of the current version of this project are:

<groupId>com.cloudera.dataflow.spark</groupId>
<artifactId>dataflow-spark</artifactId>
<version>0.0.1</version>

If we wanted to run a dataflow pipeline with the default options of a single threaded spark instance in local mode, we would do the following:

Pipeline p = <logic for pipeline creation >
EvaluationResult result = SparkPipelineRunner.create().run(p);

To create a pipeline runner to run against a different spark cluster, with a custom master url we would do the following:

Pipeline p = <logic for pipeline creation >
SparkPipelineOptions options = SparkPipelineOptionsFactory.create();
options.setSparkMaster("spark://host:port");
EvaluationResult result = SparkPipelineRunner.create(options).run(p);

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
build-resources		build-resources
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-dataflow

About

Releases

Packages

Languages

License

dennishuo/spark-dataflow

Folders and files

Latest commit

History

Repository files navigation

spark-dataflow

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages