You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Requirement - what kind of business use case are you trying to solve?
Runtime and resource consumption is very high when running the dependencies job on a large dataset.
Problem - what in Jaeger blocks you from solving the requirement?
Currently the Spark job uses JavaEsSpark.esJsonRDD which has no optimizations (DataFrames and their pushdown - https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-pushdown would be). So apart from a custom query (i.e. as in PR #86) all docs are fetched and instantiated into full Span objects, even though not all fields of the spans are required for the dependency extraction. This causes many gigabytes of data transferred from ES to the instance running the Spark job and creates a massive memory footprint as well as lots of turnover / GC on the JVM running the job.
Proposal - what do you suggest to solve the problem or improve the existing situation?
I propose to use Spark DataFrames when accessing and selecting data from ElasticSearch to make use of the pushdown applying filters and aggregations directly in the storage layer.
Requirement - what kind of business use case are you trying to solve?
Runtime and resource consumption is very high when running the dependencies job on a large dataset.
Problem - what in Jaeger blocks you from solving the requirement?
Currently the Spark job uses JavaEsSpark.esJsonRDD which has no optimizations (DataFrames and their pushdown - https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-pushdown would be). So apart from a custom query (i.e. as in PR #86) all docs are fetched and instantiated into full Span objects, even though not all fields of the spans are required for the dependency extraction. This causes many gigabytes of data transferred from ES to the instance running the Spark job and creates a massive memory footprint as well as lots of turnover / GC on the JVM running the job.
Also the write of the extracted dependencies is not done via the dependencystore API (https://github.com/jaegertracing/jaeger/blob/master/storage/dependencystore/interface.go), but directly to elasticsearch. While not a problem in itself it requires changes to the storage to always cover "both sides"
Proposal - what do you suggest to solve the problem or improve the existing situation?
The text was updated successfully, but these errors were encountered: