Skip to content
This repository has been archived by the owner on Dec 30, 2019. It is now read-only.

Support JavaRDD<T> #3

Open
dfdx opened this issue May 24, 2015 · 0 comments
Open

Support JavaRDD<T> #3

dfdx opened this issue May 24, 2015 · 0 comments

Comments

@dfdx
Copy link

dfdx commented May 24, 2015

It's a kind of a feature request/discussion issue. Currently the only way to create new RDD is to call parallelize(), which boxes every value in collection into JuliaObject and returns corresponding JavaRDD<JuliaObject>. In practice, however, we will need to deal with custom RDDs, i.e. JavaRDD<T>.

Simplest way to deal with it is to restrict T to be either byte array, or, as a special case, string. This will enable us to call things like textFile and get RDD{String}, which is already enough for real applications.

More interesting and tricky way is to support custom serializers / deserializers. Say, we can request interested users to implement some kind of JuliaSerializer<T> which will transform T into byte array on Java side and corresponding convert method that will construct corresponding object on Julia side.

I'm currently looking at PySpark implementation of their default AutoBatchedSerializer, but any ideas are welcome.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant