-
Notifications
You must be signed in to change notification settings - Fork 70
/
Copy pathassignment4-451.html
512 lines (393 loc) · 18.6 KB
/
assignment4-451.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="Course homepage for CS 451/651 431/631 Data-Intensive Distributed Computing (Winter 2018) at the University of Waterloo">
<meta name="author" content="Jimmy Lin">
<title>Data-Intensive Distributed Computing</title>
<!-- Bootstrap core CSS -->
<link href="css/bootstrap.min.css" rel="stylesheet">
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<link href="css/ie10-viewport-bug-workaround.css" rel="stylesheet">
<!-- Just for debugging purposes. Don't actually copy these 2 lines! -->
<!--[if lt IE 9]><script src="../../assets/js/ie8-responsive-file-warning.js"></script><![endif]-->
<script src="js/ie-emulation-modes-warning.js"></script>
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<style>
body {
padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
}
</style>
</head>
<body>
<nav class="navbar navbar-inverse navbar-fixed-top">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li><a href="index.html">Overview</a></li>
<li><a href="organization.html">Organization</a></li>
<li><a href="syllabus.html">Syllabus</a></li>
<li class="active"><a href="assignments.html">Assignments</a></li>
<li><a href="software.html">Software</a></li>
</ul>
</div><!--/.nav-collapse -->
</div>
</nav>
<div class="container">
<div class="page-header">
<div style="float: right"><img width="250" src="images/waterloo_logo.png" alt="University of Waterloo logo"/></div>
<h1>Assignments <br/><small>Data-Intensive Distributed Computing (Winter 2018)</small></h1>
</div>
<p>Note that there separate sets of assignments for CS 451/651 and CS
431/631. Make sure you work on the correct asssignments!</p>
<p><a href="assignments-451.html" class="btn btn-success btn-large">CS 451/651 Assignments</a></p>
<div class="subnav">
<ul class="nav nav-pills">
<li><a href="assignment0-451.html">0</a></li>
<li><a href="assignment1-451.html">1</a></li>
<li><a href="assignment2-451.html">2</a></li>
<li><a href="assignment3-451.html">3</a></li>
<li><a href="assignment4-451.html">4</a></li>
<li><a href="assignment5-451.html">5</a></li>
<li><a href="assignment6-451.html">6</a></li>
<li><a href="assignment7-451.html">7</a></li>
<li><a href="project-451.html">Final Project</a></li>
</ul>
</div>
<h3>Assignment 4 <small>due 2:30pm February 27</small></h3>
<p>For this assignment, you will be working in the same repo as
before, except that everything should go into the package namespace
<code>ca.uwaterloo.cs451.a4</code>.</p>
<p>Begin by taking the time to understand
the <a href="https://github.com/lintool/bespin/tree/master/src/main/java/io/bespin/java/mapreduce/pagerank">PageRank
reference implementation</a> in <a href="http://bespin.io">Bespin</a>
(particularly <code>RunPageRankBasic</code>). For this assignment,
you are going to implement multiple-source personalized PageRank. As
we discussed in class, personalized PageRank is different from
ordinary PageRank in a few respects:</p>
<ul>
<li>There is the notion of a <i>source</i> node, which is what we're
computing the personalization with respect to.</li>
<li>When initializing PageRank, instead of a uniform distribution
across all nodes, the source node gets a mass of one and every other
node gets a mass of zero.</li>
<li>Whenever the model makes a random jump, the random jump is
always back to the source node; this is unlike in ordinary PageRank,
where there is an equal probability of jumping to any node.</li>
<li>All mass lost in the dangling nodes are put back into the source
node; this is unlike in ordinary PageRank, where the missing mass is
evenly distributed across all nodes.</li>
</ul>
<p>Here are some publications about personalized PageRank if you're
interested. They're just provided for background; neither is necessary
for completing the assignment.</p>
<ul>
<li>Daniel Fogaras, Balazs Racz, Karoly Csalogany, and Tamas Sarlos. (2005) <a href="assignments/Fogaras_etal_2005.pdf">Towards Scaling Fully Personalized PageRank: Algorithms, Lower Bounds, and Experiments.</a> Internet Mathematics, 2(3):333-358.</li>
<li>Bahman Bahmani, Abdur Chowdhury, and Ashish Goel. (2010) <a href="assignments/Bahmani_etal_VLDB2010.pdf">Fast Incremental and Personalized PageRank.</a> Proceedings of the 36th International Conference on Very Large Data Bases (VLDB 2010).</li>
</ul>
<p>Your implementation is going to run multiple personalized PageRank
computations in parallel, one with respect to each source. The sources
will be specified on the command line. This means that each
PageRank node object (i.e., <code>Writable</code>) is going to contain
an array of PageRank values.</p>
<p>Start by working with the Gnutella graph
(a <a href="http://snap.stanford.edu/data/p2p-Gnutella08.html">snapshot</a>
of the Gnutella peer-to-peer file sharing network from August 2002),
the same as the one used in the Bespin demo:</p>
<pre>
$ mkdir data
$ curl http://lintool.github.io/bespin-data/p2p-Gnutella08-adj.txt > data/p2p-Gnutella08-adj.txt
</pre>
<p>Here's how the implementation is going to work: it largely follows
the reference implementation of <code>RunPageRankBasic</code>. You
must make your implementation work with respect to the command-line
invocations specified below.</p>
<p>First, convert the adjacency list into PageRank node records:</p>
<pre>
$ hadoop jar target/assignments-1.0.jar \
ca.uwaterloo.cs451.a4.BuildPersonalizedPageRankRecords \
-input data/p2p-Gnutella08-adj.txt -output cs451-lintool-a4-Gnutella-PageRankRecords \
-numNodes 6301 -sources 367,249,145
</pre>
<p>The <code>-sources</code> option specifies the source nodes for the
personalized PageRank computations. In this case, we're running three
computations in parallel, with respect to node ids 367, 249, and
145. You can expect the option value to be in the form of a
comma-separated list, and that all node ids actually exist in the
graph. The list of source nodes may be arbitrarily long, but for
practical purposes we won't test your code with more than a few.</p>
<p>Since we're running three personalized PageRank computations in
parallel, each PageRank node is going to hold an array of three
values, the personalized PageRank values with respect to the first
source, second source, and third source. You can expect the array
positions to correspond exactly to the position of the node id in the
source string.</p>
<p>Next, partition the graph (hash partitioning) and get ready to
iterate:</p>
<pre>
$ hadoop fs -mkdir cs451-lintool-a4-Gnutella-PageRank
$ hadoop jar target/assignments-1.0.jar \
ca.uwaterloo.cs451.a4.PartitionGraph \
-input cs451-lintool-a4-Gnutella-PageRankRecords \
-output cs451-lintool-a4-Gnutella-PageRank/iter0000 -numPartitions 5 -numNodes 6301
</pre>
<p>After setting everything up, iterate multi-source personalized
PageRank:</p>
<pre>
$ hadoop jar target/assignments-1.0.jar \
ca.uwaterloo.cs451.a4.RunPersonalizedPageRankBasic \
-base cs451-lintool-a4-Gnutella-PageRank -numNodes 6301 -start 0 -end 20 -sources 367,249,145
</pre>
<p>Note that the sources are passed in from the command-line
again. Here, we're running twenty iterations.</p>
<p>Finally, run a program to extract the top ten personalized PageRank
values, with respect to each source.</p>
<pre>
$ hadoop jar target/assignments-1.0.jar \
ca.uwaterloo.cs451.a4.ExtractTopPersonalizedPageRankNodes \
-input cs451-lintool-a4-Gnutella-PageRank/iter0020 -output cs451-lintool-a4-Gnutella-PageRank-top10 \
-top 10 -sources 367,249,145
</pre>
<p>The above program should print the following answer to stdout:</p>
<pre>
Source: 367
0.35257 367
0.04181 264
0.03889 266
0.03888 559
0.03883 5
0.03860 1317
0.03824 666
0.03817 7
0.03799 4
0.00850 249
Source: 249
0.34089 249
0.04034 251
0.03721 762
0.03688 123
0.03686 250
0.03668 753
0.03627 755
0.03623 351
0.03622 983
0.00949 367
Source: 145
0.36937 145
0.04195 1317
0.04120 265
0.03847 390
0.03606 367
0.03566 246
0.03525 667
0.03519 717
0.03513 149
0.03496 2098
</pre>
<h4 style="padding-top: 10px">Additional Specifications</h4>
<p>To make the final output easier to read, in the
class <code>ExtractTopPersonalizedPageRankNodes</code>, use the
following format to print each (personalized PageRank value, node id)
pair:</p>
<pre>
String.format("%.5f %d", pagerank, nodeid)
</pre>
<p>This will generate the final results in the same format as
above. Also note: print actual probabilities, not log
probabilities—although during the actual PageRank computation
keep values as log probabilities.</p>
<p>The final class <code>ExtractTopPersonalizedPageRankNodes</code>
does not need to be a MapReduce job (but it does need to read from
HDFS). Obviously, the other classes need to run MapReduce jobs.</p>
<p>The reference implementation of PageRank in Bespin has many
options: you can either use in-mapper combining or ordinary
combiners. In your implementation, use ordinary combiners.
Also, the reference implementation has
an option to either use range partitioning or hash partitioning: you
only need to implement hash partitioning. You can start with the
reference implementation and remove code that you don't need (see #2
below).</p>
<h4 style="padding-top: 10px">Hints and Suggestion</h4>
<p>To help you out, there's a small helper program in Bespin that
computes personalized PageRank using a sequential algorithm. Use it to
check your answers:</p>
<pre>
$ hadoop jar target/bespin-1.0.2-SNAPSHOT-fatjar.jar io.bespin.java.mapreduce.pagerank.SequentialPersonalizedPageRank \
-input data/p2p-Gnutella08-adj.txt -source 367
</pre>
<p>Note that this isn't actually a MapReduce job; we're simply using
Hadoop to run the <code>main</code> for convenience. The values from
your implementation should be pretty close to the output of the above
program, but might differ a bit due to convergence issues. After 20
iterations, the output of the MapReduce implementation should match to
at least the fourth decimal place.</p>
<p>This is a complex assignment. We would suggest breaking the
implementation into the following steps:</p>
<ol>
<li>First, copy the reference PageRank implementation into your own
assignments repo (renaming the classes appropriately). Make sure you
can get it to run and output the correct results with ordinary
PageRank.</li>
<li>Simplify the code; i.e., if you decide to use the in-mapper
combiner, remove the code that works with ordinary combiners.</li>
<li>Implement personalized PageRank from a single source; that is, if
the user sets option <code>-sources w,x,y,z</code>, simply
ignore <code>x,y,z</code> and run personalized PageRank with respect
to <code>w</code>. This can be accomplished with the
existing <code>PageRankNode</code>, which holds a single floating
point value.</li>
<li>Extend the <code>PageRankNode</code> class to store an array of
floats (length of array is the number of sources) instead of a single
float. Make sure single-source personalized PageRank still runs.</li>
<li>Implement multi-source personalized PageRank.</li>
</ol>
<p>In particular, #3 is a nice checkpoint. If you're not able to get
the multiple-source personalized PageRank to work, at least completing
the single-source implementation will allow us to give you partial
credit.</p>
<h4 style="padding-top: 10px">Running on the Altiscale cluster</h4>
<p>Once you get your implementation working and debugged in the Linux
environment, you're going to run your code on a non-trivial graph: the
link structure of (all of) Wikipedia. The adjacency lists are stored
in <code>/shared/uwaterloo/cs451/data/wiki-adj</code> on HDFS. The graph has
16,117,779 vertices and 155,472,640 edges.</p>
<p>First, convert the adjacency list into PageRank node records:</p>
<pre>
$ hadoop jar target/assignments-1.0.jar \
ca.uwaterloo.cs451.a4.BuildPersonalizedPageRankRecords \
-input /shared/uwaterloo/cs451/data/wiki-adj -output cs451-lintool-a4-wiki-PageRankRecords \
-numNodes 16117779 -sources 73273,73276
</pre>
<p>Next, partition the graph (hash partitioning) and get ready to
iterate:</p>
<pre>
$ hadoop fs -mkdir cs451-lintool-a4-wiki-PageRank
$ hadoop jar target/assignments-1.0.jar \
ca.uwaterloo.cs451.a4.PartitionGraph \
-input cs451-lintool-a4-wiki-PageRankRecords \
-output cs451-lintool-a4-wiki-PageRank/iter0000 -numPartitions 10 -numNodes 16117779
</pre>
<p>After setting everything up, iterate multi-source
personalized PageRank:</p>
<pre>
$ hadoop jar target/assignments-1.0.jar \
ca.uwaterloo.cs451.a4.RunPersonalizedPageRankBasic \
-base cs451-lintool-a4-wiki-PageRank -numNodes 16117779 -start 0 -end 20 -sources 73273,73276
</pre>
<p>On the Altiscale cluster, each iteration shouldn't take more than a
couple of minutes to complete. If it's taking more than five minutes
per iteration, you're doing something wrong.</p>
<p>Finally, run a program to extract the top ten personalized PageRank
values, with respect to each source.</p>
<pre>
$ hadoop jar target/assignments-1.0.jar \
ca.uwaterloo.cs451.a4.ExtractTopPersonalizedPageRankNodes \
-input cs451-lintool-a4-wiki-PageRank/iter0020 -output cs451-lintool-a4-wiki-PageRank-top10 \
-top 10 -sources 73273,73276
</pre>
<p>In the example above, we're running personalized PageRank with
respect to two sources: 73273 and 73276. What articles do these ids
correspond to? There's a file on HDFS
at <code>/shared/uwaterloo/cs451/data/wiki-titles.txt</code> that provides the
answer. How do you know if your implementation is correct? You can
sanity check your results by taking the ids and looking up the
articles that they correspond to. Do the results make sense?</p>
<h4 style="padding-top: 10px">Turning in the Assignment</h4>
<p>Please follow these instructions carefully!</p>
<p>Make sure your repo has the following items:</p>
<ul>
<li>Similar to the previous assignments, you'll create a file
called <code>bigdata2018w/assignment4.md</code> (more below).</li>
<li>All the implementations described above should be in
package <code>ca.uwaterloo.cs451.a4</code>.</li>
</ul>
<p>Make sure your implementation runs in the Linux student CS
environment on the Gnutella graph and on the Wikipedia graph on the
Altiscale cluster.</p>
<p>In <code>bigdata2018w/assignment4.md</code>, tell us if you were
able to successfully complete the assignment. This is in case we can't
get your code to run, we need to know if it is because you weren't able
to complete the assignment successfully, or if it is due to some other
issue. If you were not able to implement everything, describe how far
you've gotten. Feel free to use this space to tell us additional
things we should look for in your implementation.</p>
<p>Also, in this file, copy the output of your implementation on the
Altiscale cluster, i.e., personalized PageRank with respect to
vertices 73273 and 73276. This will give us something to look at and
check if we can't get your code to run successfully. Something that
looks like this:</p>
<pre>
Source: 73273
0.XXXXX XXX
...
Source: 73276
0.XXXXX XXX
...
</pre>
<p>When grading, we will clone your repo and use the below check
scripts:</p>
<ul>
<li><a href="assignments/check_assignment4_public_linux.py"><code>check_assignment4_public_linux.py</code></a>
in the Linux Student CS environment.</li>
<li><a href="assignments/check_assignment4_public_altiscale.py"><code>check_assignment4_public_altiscale.py</code></a>
on the Altiscale cluster.</li>
</ul>
<p>When you've done everything, commit to your repo and remember to
push back to origin. You should be able to see your edits in the web
interface. Before you consider the assignment "complete", we would
recommend that you verify everything above works by performing a clean
clone of your repo and run the public check scripts.</p>
<p>That's it!</p>
<h4 style="padding-top: 10px">Grading</h4>
<p>The entire assignment is worth 55 points:
<ul>
<li>A correct implementation of single-source personalized PageRank is
worth 10 points.</li>
<li>That we are able to run the single-source personalized PageRank
implementation in the Linux Student CS environment is worth 5
points.</li>
<li>A correct implementation of multiple-source personalized PageRank
is worth 15 points.</li>
<li>That we are able to run the multiple-source personalized PageRank
implementation in the Linux Student CS environment is worth 5
points.</li>
<li>Scaling the single-source personalized PageRank implementation on
Altiscale is worth 10 points.</li>
<li>Scaling the multiple-source personalized PageRank implementation
on Altiscale is worth 10 points.</li>
</ul>
<p>In our private check scripts, we will specify arbitrary source
nodes to verify the correctness of your implementation.</p>
<p>Note that this grading scheme discretizes each component of the
assignment. For example, if you implement everything and it works
correctly on the Linux Student CS environment, but can't get it to
scale on the Altiscale cluster to the larger graph, you'll receive 35
out of 55 points. On the other hand, if you implement single-source
personalized PageRank correctly and it runs in both the Linux Student
CS environment and Altiscale, you will receive 25 out of 55
points. And combinations thereof.</p>
<p style="padding-top: 20px"><a href="#">Back to top</a></p>
<div style="padding-bottom: 100px"></div>
</div><!-- /.container -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
<script src="js/bootstrap.min.js"></script>
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<script src="js/ie10-viewport-bug-workaround.js"></script>
</body>
</html>