Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Platform test to show Hadoop and Tez partitioning difference #59

Open
wants to merge 1 commit into
base: wip-3.2
Choose a base branch
from

Conversation

piyushnarang
Copy link

Noticed this on one of our test jobs that we were using to compare the performance of MR and Tez.
I've built a unit test to show a subset of the graph where Cascading on Hadoop is combining more nodes and thus lowering the quantity of data streamed between nodes / steps.

The job starts off with two vertices V0, V1 reading around 3,025,369,753 tuples (10 odd TB). They're then merged + grouped in vertex V2. This is then passed on to Vertex V3 which performs some aggregations (everys) and reduces the data to around 1 TB.

In case of Hadoop, V0, V1 are done on the job's mappers. V2 + V3 are combined and done on the reducers. We then end up writing out this 1TB or so of data and that's picked up by the downstream steps.

Wondering if we should have a rule to collapse these aggregations into the step doing the groupBy?

@cwensel
Copy link
Owner

cwensel commented Jun 14, 2023

Leaving this open in the hope I have time to look into it, even though it's likely no longer a concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants