add JobLogger to Spark #573

shimingfei · 2013-04-18T08:12:38Z

add a new class named JobLogger
1.each SparkContext has one JobLogger, one folder is created for every JobLogger
2.JobLogger manages all history files of activeJobs running in that SparkContext,the file name is jobID
job history include:
1.additional information from outside
For example: queryplan from Shark
2.RDD Graph for each Job
The RDD graph is printed using a top-down approach
3.stage information and tasks' start/stop & shuffle information
gotten from TaskMetrics and DAGScheduler

AmplabJenkins · 2013-04-18T08:14:01Z

I'm the Jenkins test bot for the UC, Berkeley AMPLab. I've noticed your pull request and will test it once an admin authorizes me to. Thanks for your submission!

rxin · 2013-04-18T16:14:27Z

I will take a look at this today.

Jenkins: this is ok to test

rxin · 2013-04-18T21:21:32Z

Thanks, Mingfei.

Below are some high level comments:

Instead of adding JobLogger specific calls to DAGScheduler, why don't you extend SparkListener to add more callback APIs (for every recordTaskMetrics call in DAGScheduler, create a call back API in SparkListener)
With 1, JobLogger becomes an implementation of the SparkListener
JobLogger itself should also expose the information it collects so we can build a web UI for the JobLogger.

I have some detailed comments on the code that I will also post inline.

rxin · 2013-04-18T21:23:17Z

core/src/main/scala/spark/JobLogger.scala

+import scala.collection.mutable.HashMap
+import scala.collection.mutable.ListBuffer
+import spark.scheduler.Stage
+import scala.io.Source


For imports, sort them in the following order:

java packages

scala packages

everything else in alphabetical order.

Add a blank line for classes in different domain.

Do this for other files - Spark code didn't strictly follow this but we are enforcing that now ...

rxin · 2013-04-18T21:33:49Z

I am done with my detailed comments.

Overall this looks like a great first step - we really appreciate you doing this!

There is one more comment about the SparkListener. Right now, DAGScheduler accepts a list of SparkListener's, and use a for loop to invoke the listeners. I think that can become expensive in large clusters. It would be great to have the DAGScheduler only accept one SparkListener, and if multiple SparkListener's are needed, we can create a composed SparkListener implementation that simply delegates to multiple SparkListener's.

Again, thanks a lot for doing this.

AmplabJenkins · 2013-04-18T22:00:41Z

I'm the Jenkins test bot for the UC, Berkeley AMPLab. I've noticed your pull request and will test it once an admin authorizes me to. Thanks for your submission!

markhamstra · 2013-04-19T00:13:17Z

core/src/main/scala/spark/JobLogger.scala

+      case None =>
+    }
+  }
+


Avoid pattern matching on Option -- especially when you are doing nothing with the None case;

jobIDToStageIDs.get(jobID).map(_.foreach(stageid => stageIDToJobID -= stageid))

markhamstra · 2013-04-19T01:00:15Z

@rxin 3. JobLogger itself should also expose the information it collects so we can build a web UI for the JobLogger.

That's a bit trickier. Not only do you need to construct one or more data structures to which you must add the appropriate information when tasks/stages/jobs start/complete/fail/resubmit, but you also need to figure out when that information is no longer needed so that you can remove it from the appropriate data structure, avoiding that structure's unbounded growth.

I'm working on a substantial portion of that right now.

shimingfei · 2013-04-23T01:02:41Z

Reynold and markhamstra,
Thanks for your comments, I will modify the code according to the comments

pwendell · 2013-05-15T17:39:42Z

core/src/main/scala/spark/scheduler/JobLogger.scala

+  }
+
+  //write log information to log file by stageID, withTime parameter controls whether to recored time stamp for the information
+  private def stageLogInfo(stageID: Int, info: String, withTime: Boolean) = stageIDToJobID.get(stageID).foreach(jobID => jobLogInfo(jobID, info, withTime))


This line is well over 100 characters

pwendell · 2013-05-15T17:51:07Z

Hi Mingfei,

Thanks for updating this - the new approach is exactly what Reynold and I were suggesting.

I've made several comments about style for your latest patch. We adhere to the Scala style guide with a few changes. Lines must be at most 100 characters. If you could review this patch (not only where I pointed out nits, but the entire thing) for style, that would be great. Here is our style guide:
http://spark-project.org/docs/latest/contributing-to-spark.html

From a code perspective, there are a few remaining things:

It would be great if you could add some tests for this to verify the logic inside of JobLogger.scala
I'm still a bit confused why you can't sue getCallSite instead of the generator class. Could you fill us in about the thought there?

Thanks again for taking time to update this.

Patrick

shimingfei · 2013-05-20T05:44:16Z

Hi Patrick
It is impossible to get the RDD's generator from callsite information some time, because some classes may call into the same code where RDD is generated (for example in Shark, many operators call executeProcessPartition in Operator class to generate mapPartitionsWithIndexRDD), it is better to let user's code be able to set who generate the RDD. and I also merged getRddGenerator function to getSparkCallSite
I added it only for analyzing, if it is not suitable to add it into RDD class, I will remove it from the code and use callsite information instead.
Besides I am adding test case for JobLogger now.

Thanks
Mingfei

shimingfei · 2013-05-21T03:11:26Z

Hi Patrick

I have modified code according to your comments and added test case for joblogger

Thanks
Mingfei

…erSuite

pwendell · 2013-05-23T01:51:13Z

core/src/main/scala/spark/SparkContext.scala

    logInfo("Starting job: " + callSite)
    val start = System.nanoTime
-    val result = dagScheduler.runJob(rdd, func, partitions, callSite, allowLocal, resultHandler)
+    val result = dagScheduler.runJob(rdd, func, partitions, callSite + "|" + addInfo.value.toString, allowLocal, resultHandler)


This line is over 100 characters

pwendell · 2013-05-23T02:40:27Z

Hi Mighfei,

I added a handful of comments on this latest submission.

For the callsite stuff, I proposed a slightly different refactoring than the one you give here. In general though I support re-factoring this to use the same code path.
I would prefer not to have the JobLogger based on a Spark configuration option just yet. I'd rather have it be something that people need to explicitly add. Also, you changed the code to add a StatsListener by default which wasn't the case before. My proposal was just to remove the entire StatsListener object which would fix both of these issues.
There were some more style issues which I commented on.
For the unit test - ideally you want to write a test in a way that actually verifies some of the logic in the JobLogger class. For instance, you could try to verify that the hashmaps you maintain are being setup correctly. There are a few ways to do this. One is to mock out calls to the various handler functions then to look directly at the state of the JobLogger. Another way is to add a constructor option that avoids cleaning the state of the maps inside of JobLogger (default false) and in the tests you set this option to true. Then you could just run a job and verify the contents of the hashmaps.

Thanks,
Patrick

shimingfei · 2013-05-30T11:14:10Z

Hi Patrick

I have modified the code and add unit test according to your suggestion,

Thanks
Mingfei

pwendell · 2013-06-06T21:47:42Z

core/src/main/scala/spark/SparkContext.scala

@@ -65,6 +65,9 @@ class SparkContext(
  // Ensure logging is initialized before we spawn any threads
  initLogging()

+  // Allows higher layer frameworks to describe the context of a job
+  val annotation = new DynamicVariable[String]("")


rather than having this be a possibly blank string. How about:

new DynamicVariableOption[String]

pwendell · 2013-06-06T21:59:50Z

Hi Mingfei,

A few notes about the annotation field - it would be good to not use string parsing as a way to pass additional arguments. Instead, you can use an optional value.

Also, I notice that now you've changed the default format for the callsite info - it would be good to actually leave that unchanged.

Overall, this is looking in good shape. Once you fix these I'll ask Matei or Reynold to do a pass.

Jenkins, test this please

shimingfei · 2013-06-08T05:32:53Z

Hi Patric,
I would like to use "localProperties" variable in latest SparkContext to pass job annotation information, so that interface of "runJob" and "submitJob" will not be changed.
and I also intend to use a daemon thread to do joblogger's work, since joblogger has lots of disk access.
But current spark I forked don't have "localProperties" variable in SparkContext, need I close the current pull request and create a new one?

pwendell · 2013-06-10T20:54:30Z

Okay I think we can close this now. I'm going to take a look at the new PR.

Small bug fix to make sure the "spark contents" are copied to the deployment directory correctly. Author: Rahul Singhal <[email protected]> Closes mesos#573 from rahulsinghaliitd/SPARK-1651 and squashes the following commits: 402c999 [Rahul Singhal] SPARK-1651: Delete existing deployment directory

shimingfei added 10 commits April 18, 2013 15:08

Create JobLogger.scala

75be4a4

add generator information

f8ac7c4

add JobLogger and addInfo initialization

3c80ece

Update SparkContext.scala

f7bef28

add getRDDGenerator function

de891e6

Update DAGScheduler.scala

f14a706

Update TaskScheduler.scala

cbf0031

Update LocalScheduler.scala

ce2d8e3

Update DAGScheduler.scala

859aed7

Update JobLogger.scala

1c7d721

rxin reviewed Apr 18, 2013
View reviewed changes

markhamstra reviewed Apr 19, 2013
View reviewed changes

modify according to comments

04e2b0f

pwendell reviewed May 15, 2013
View reviewed changes

modify according to comments and add test case for joblogger

cb8037b

merge test case for joblogger to test("local metrics") in SparkListen…

42f352f

…erSuite

pwendell reviewed May 23, 2013
View reviewed changes

modify according to comments and add unit test

157141c

pwendell reviewed Jun 6, 2013
View reviewed changes

pwendell closed this Jun 10, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add JobLogger to Spark #573

add JobLogger to Spark #573

shimingfei commented Apr 18, 2013

AmplabJenkins commented Apr 18, 2013

rxin commented Apr 18, 2013

rxin commented Apr 18, 2013

rxin Apr 18, 2013

rxin commented Apr 18, 2013

AmplabJenkins commented Apr 18, 2013

markhamstra Apr 19, 2013

markhamstra commented Apr 19, 2013

shimingfei commented Apr 23, 2013

pwendell May 15, 2013

pwendell commented May 15, 2013

shimingfei commented May 20, 2013

shimingfei commented May 21, 2013

pwendell May 23, 2013

pwendell commented May 23, 2013

shimingfei commented May 30, 2013

pwendell Jun 6, 2013

pwendell commented Jun 6, 2013

shimingfei commented Jun 8, 2013

pwendell commented Jun 10, 2013

add JobLogger to Spark #573

add JobLogger to Spark #573

Conversation

shimingfei commented Apr 18, 2013

AmplabJenkins commented Apr 18, 2013

rxin commented Apr 18, 2013

rxin commented Apr 18, 2013

rxin Apr 18, 2013

Choose a reason for hiding this comment

rxin commented Apr 18, 2013

AmplabJenkins commented Apr 18, 2013

markhamstra Apr 19, 2013

Choose a reason for hiding this comment

markhamstra commented Apr 19, 2013

shimingfei commented Apr 23, 2013

pwendell May 15, 2013

Choose a reason for hiding this comment

pwendell commented May 15, 2013

shimingfei commented May 20, 2013

shimingfei commented May 21, 2013

pwendell May 23, 2013

Choose a reason for hiding this comment

pwendell commented May 23, 2013

shimingfei commented May 30, 2013

pwendell Jun 6, 2013

Choose a reason for hiding this comment

pwendell commented Jun 6, 2013

shimingfei commented Jun 8, 2013

pwendell commented Jun 10, 2013