-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate options for running on a spark cluster #27
Comments
UKCEH's datalabs has the capability of using Spark clusters. I will try there with a minimal version of the pipeline first, to see if it is at least viable in general |
This is proving difficult. TL;DR: Full: First is as the Beam docs (see the "Deploying Spark with your application" section), start the service using Docker, an image is already available for this, or run the service from the source code. What they don't say about the latter is that you can only access the necessary 'gradlew' file from directly from the github repository. Releases zip files on GitHub and their equivalents in conda/pypi do not include the file.
This is apparently workaroundable but needs some extra arguments passed in to a particular part of the program which I don't know enough to be able to do, or is something to do with incompatibilities between Beam, Java and Spark (and https://expertbeacon.com/solving-the-cannot-access-class-sun-nio-ch-directbuffer-error/). For example Beam claims it offers support for the 2.3.x version of Spark, but that on DataLabs is 2.5.x..., with the Java version in /usr/bin/java as 17.0.12. The second way of running it using Beam by compiling the pipeline as an 'UberJar' which is something vaguely similar to a container but specific to Spark, and which contains the Beam JobServer in it (I think...).
However, this fails too as the Spark Master (the scheduler, iow) needs to have a REST API available to use, for the Pipeline 'job' to be submitted to it this way, and from what I could tell, this isn't available on the DataLabs Spark implementation. The third way of running it is similar. The Pipeline is compiled to a Jar as before, but not an UberJar, and then is submitted to the Spark Master using the spark-submit executable.
This again compiles fine but runs into a problem where for some reason a connection to the Spark Master cannot be established. This is the same error I get if I try to run a simple Spark example via a Jupyter notebook, leading me to believe this is something to do with the DataLabs implementation of Spark and not my code.
|
Next thing to try might be this? https://github.com/moradology/beam-pyspark-runner |
there are some problems using the directrunner on jasmin (crashes on http errors).
We need a better way to run the scripts, probably a spark cluster would be most stable, using the beam to spark runner.
Investigate how we can do this, does jasmin have an existing spark cluster, is there one available to CEH somewhere (Iain might know/be able to point in a direction).
Another option is spinning up our own spark cluster on jasmin, investigate if possible/reasonable.
acceptance criteria
The text was updated successfully, but these errors were encountered: