Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallelCrawl emits visits/stream-results in one chunk #34

Open
visox opened this issue Aug 28, 2017 · 0 comments
Open

parallelCrawl emits visits/stream-results in one chunk #34

visox opened this issue Aug 28, 2017 · 0 comments
Assignees
Labels

Comments

@visox
Copy link
Collaborator

visox commented Aug 28, 2017

code to reproduce

import com.marekkadek.scraper.Document
import com.marekkadek.scraper.jsoup.JsoupBrowser
import com.marekkadek.scrawler.crawlers.{Visit, YieldData, Yield, Crawler}
import fs2.{Strategy, Stream, Task}
import scala.concurrent.duration._

class BadCrawler extends Crawler[Task, Int](Seq(JsoupBrowser[Task](
  connectionTimeout = 20 seconds
))) {

  var visited = 0

  override protected def onDocument(document: Document): Stream[Task, Yield[Int]] = {
    val visit = (1 to 10).map{_ =>
      Visit("http://example.com/")
    }

    visited = visited + 1

    println(s"visited: $visited")

    Stream.emit(YieldData(visited)) ++ Stream.emits(visit)
  }
}

object BadCrawler extends App {
  implicit val strategy: Strategy = Strategy.fromFixedDaemonPool(100)

  val crawler = new BadCrawler()

  val stream: Stream[Task, Int] = crawler.parallelCrawl("http://example.com/", maxConnections = 10)

  stream
    .map{result =>
      println(s"result: $result")
      result
    }
    .runLog
    .unsafeRun()

}

Once run, the output is like this

visited: 1
result: 1
visited: 2
visited: 3
visited: 4
visited: 5
visited: 6
visited: 7
visited: 8
visited: 9
visited: 10
visited: 11
result: 2
result: 3
result: 4
result: 5
result: 6
result: 7
result: 8
result: 9
result: 10
result: 11
visited: 12
...
visited: 111
result: 12
...
result: 111
visited: 112
// FOR SOME TIME NOTHING 
...
visited: 1111
result: 112
...
result: 1111
// NOTHING HAPPENS (only after quite some time)

I dont mind that 10 visits need to happen before i get 10 results but if there are more pages to be visited (then maxConnection) the behavior is lagging. Both visited/result output appear suddenly after some long evaluation.

It would be desired to emit results as soon as they are available.

Right now, privately to overcome this problem, i store the toVisit urls collection and provide the urls in a managed size in onDocument that way i have to wait only for the next 10 results

@visox visox added the bug label Aug 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants