Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for streaming? #27

Open
gnilrets opened this issue Mar 24, 2014 · 4 comments
Open

Support for streaming? #27

gnilrets opened this issue Mar 24, 2014 · 4 comments

Comments

@gnilrets
Copy link
Contributor

From the examples in the homepage README, it looks like any bulk query you submit must fit into memory as a Ruby array of hashes. Would it be possible to stream the results to a file for queries that pull back a lot of data?

@yatish27
Copy link
Owner

Need to work on that
If you have a POC you can send a PR

@thomasdziedzic
Copy link
Contributor

When you're getting the batches and merging them [0], you could instead just yield on individual batch results.

This feature is essential since handling large data sets (the bulk api is supposed to be good at this) becomes very slow or not possible. For reference, when I pull down all my Accounts (~ 50,000 records) with all my fields, I eat up all 16gigs of my memory, so I have to kill the process. Now imagine if I wanted to pull down my Tasks (~ 700,000 records). I know that there are also much larger orgs out there also.

[0] - https://github.com/yatish27/salesforce_bulk_api/blob/master/lib/salesforce_bulk_api/job.rb#L197

Apparently salesforce doesn't expose batches for bulk queries...
"Bulk batch sizes are not used for bulk queries." taken from https://www.salesforce.com/us/developer/docs/api_asynch/

According to the documentation, salesforce will return up to 15 files up to 1GiB a piece. So 1 GiB is our batch size.. nice. What I mentioned earlier still applies, but will be a lot less useful at 1GiB a batch, which will certainly halt any computer built in today's world.

I think streaming from http into an xml parser should still be possible, but more tedious since now the batching is left up to the library to implement.

Here is some documentation on how to stream http responses in ruby:
http://ruby-doc.org/stdlib-2.2.0/libdoc/net/http/rdoc/Net/HTTP.html#class-Net::HTTP-label-Streaming+Response+Bodies

Another option for http streaming is downloading individual byte ranges using the http Range header and then yielding on each chunk until the end.

Here is an example of a gem that does xml stream parsing:
https://github.com/craigambrose/sax_stream

I believe that combining http streaming and xml stream parsing will accomplish what this feature request is asking for.

@thomasdziedzic
Copy link
Contributor

Just to add to my previous comment, you could also download the entire batch file and then stream the file into an xml stream parser.

@yatish27
Copy link
Owner

@gostrc Can you send a branch with proposed changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants