-
Notifications
You must be signed in to change notification settings - Fork 0
/
presentation.txt
159 lines (96 loc) · 3.34 KB
/
presentation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
------------------------------------------------
MapReduce System (MRS)
Lars Johnson + Tej Kanwar
[Diagram of Network]
------------------------------------------------
Project Objectives
* Explore applying concepts from 6.945 to systems programming
* Emphasize flexibility by creating a clean abstraction
* Enable operations on data sets in a simple and fundamental way
------------------------------------------------
What is MapReduce?
Google -- Jeffrey Dean and Sanjay Ghemawat
------------------------------------------------
What is MapReduce?
Map: (k, v) => (k', v')+
(doc-id, doc-text) => (word, 1)...
Reduce: (k', [list of v']) -> (k'', v'')
(word, [1, 1, 1 …]) => (word, total)
------------------------------------------------
Basic example using (mrs:run-computation)
[word count scheme code example]
[+ simple network]
------------------------------------------------
Extending MapReduce:
Map: (k,v) --> (k', v')
Filter: (k,v) --> #t/#f
Reduce: (k,[v]) --> (k', v')
Aggregate: [(k1,v1), (k2, v2), (k1, v3)] -->
[(k1, [v1,v3]), (k2, [v2])]
Common theme: Operate on data sets
[(k1, v1) (k2, v2) (k3, v3) ... ]
------------------------------------------------
Key Ideas:
1. Build a graph of data sets connected by operations
2. Feed data into data sets and it will be processed in a distributed
manner across a worker pool
3. Abstraction system to allow for streaming implementations
4. Provide programmers with a combinator-like family of reusable parts
------------------------------------------------
Abstract Design
* Data set: (list (k1 . v1) (k2 . v2) (k3 . v3) ...)
* Operations: Full multi-map function (DS in --> DS out)
- Combinator-like design, for higher-level operations
- Special case: Aggregate
* Workers: Single instance of multi-map function
- (k,v) in --> zero or more (k’, v’) out
------------------------------------------------
Combinators!
Reduce => [Aggregate + aggregated + Map]
------------------------------------------------
Design Implementation
User Operations:
(mrs:map...)
(mrs:reduce...)
(mrs:create-data-set)
(mrs:run-computation)
Master:
create-distributor
create-data-sets
manage data sets...
Distributor:
create-workers
distribute tasks
poll data-sets and workers
Worker:
multi-map function
------------------------------------------------
Design Implementation:
* Data sets: multi-reader, multi-writer queues of elements
* Operations: Distributor system with multi-map workers
- Generic operators for flexibility
* User control:
- Build data sets and operation bindings
- Feed inputs to data sets
- (run-computation thunk)
------------------------------------------------
Demonstration
[Scheme + 2d/3d vector demo]
[Map/Reduce Network]
------------------------------------------------
"Done" Propagation
Problem:
* Aggregator different from other operations
* Branched case, looped case
Current solution:
* Restrict to DAGs
* Top-down propagation
------------------------------------------------
Future Work:
1) Correctly handle operation loops
* Currently non-aggregator loops work
2) Proper REPL environment for interactive data set operations
3) Additional implementation of generic workers, for multi-computer
processing
4) Failure handling
------------------------------------------------