Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

MinMax analysis util throws exception on large dataset #528

Open
dai-chen opened this issue Feb 10, 2022 · 1 comment
Open

MinMax analysis util throws exception on large dataset #528

dai-chen opened this issue Feb 10, 2022 · 1 comment
Labels
untriaged This is the default tag for a newly created issue

Comments

@dai-chen
Copy link

Describe the issue

I tried to use this MinMaxAnalysisUtil to analyze distribution of column. It worked well on small data set, however threw exception on my TPC-H dataset which has around 10GB data in 1k partitions.

To Reproduce

Run analysis tool on TPC-H 10GB data set:

scala> println(MinMaxAnalysisUtil.analyze(df, Seq("l_discount", "l_quantity"), format = "text"))
java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal
  at org.apache.spark.sql.types.Decimal$DecimalIsFractional$.compare(Decimal.scala:688)
  at scala.math.Ordering.equiv(Ordering.scala:103)
  at scala.math.Ordering.equiv$(Ordering.scala:103)
  at org.apache.spark.sql.types.Decimal$DecimalIsFractional$.equiv(Decimal.scala:688)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.$anonfun$analyzeMinMaxHistogram$2(MinMaxAnalysisUtil.scala:661)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.$anonfun$analyzeMinMaxHistogram$2$adapted(MinMaxAnalysisUtil.scala:661)
  at scala.math.Ordering$$anon$6.compare(Ordering.scala:203)
  at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
  at java.util.TimSort.sort(TimSort.java:234)
  at java.util.Arrays.sort(Arrays.java:1438)
  at scala.collection.SeqLike.sorted(SeqLike.scala:659)
  at scala.collection.SeqLike.sorted$(SeqLike.scala:647)
  at scala.collection.AbstractSeq.sorted(Seq.scala:45)
  at scala.collection.SeqLike.sortWith(SeqLike.scala:612)
  at scala.collection.SeqLike.sortWith$(SeqLike.scala:612)
  at scala.collection.AbstractSeq.sortWith(Seq.scala:45)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.analyzeMinMaxHistogram(MinMaxAnalysisUtil.scala:661)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.analyzeMinMaxHistogram$(MinMaxAnalysisUtil.scala:635)
  at com.microsoft.hyperspace.util.DataframeMinMaxAnalyzer.analyzeMinMaxHistogram(MinMaxAnalysisUtil.scala:735)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.$anonfun$analyze$1(MinMaxAnalysisUtil.scala:630)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.$anonfun$analyze$1$adapted(MinMaxAnalysisUtil.scala:629)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.analyze(MinMaxAnalysisUtil.scala:629)
  at com.microsoft.hyperspace.util.MinMaxAnalyzer.analyze$(MinMaxAnalysisUtil.scala:624)
  at com.microsoft.hyperspace.util.DataframeMinMaxAnalyzer.analyze(MinMaxAnalysisUtil.scala:735)
  at com.microsoft.hyperspace.util.MinMaxAnalysis.analyzeDataframe(MinMaxAnalysisUtil.scala:763)
  at com.microsoft.hyperspace.util.MinMaxAnalysis.analyzeDataframe$(MinMaxAnalysisUtil.scala:760)
  at com.microsoft.hyperspace.util.MinMaxAnalysisUtil$.analyzeDataframe(MinMaxAnalysisUtil.scala:768)
  at com.microsoft.hyperspace.util.MinMaxAnalysisUtil$.analyze(MinMaxAnalysisUtil.scala:774)
  ... 49 elided

Expected behavior

Print diagram as usual.

Environment

Please complete the following information if applicable:

  • OS: iOS
  • Apache Spark Version: 3.1.2
  • Platform: local with master branch code
@dai-chen dai-chen added the untriaged This is the default tag for a newly created issue label Feb 10, 2022
@dai-chen
Copy link
Author

This seems caused by BigDecimal column instead of by the size of data size. If the column is created or cast as float/double, the utility has no issue generating the histogram.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
untriaged This is the default tag for a newly created issue
Projects
None yet
Development

No branches or pull requests

1 participant