Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

n1850.ne30_tn14.hybrid_fatessp.20241219/ fails after 107 years #628

Closed
adagj opened this issue Jan 23, 2025 · 6 comments
Closed

n1850.ne30_tn14.hybrid_fatessp.20241219/ fails after 107 years #628

adagj opened this issue Jan 23, 2025 · 6 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@adagj
Copy link
Contributor

adagj commented Jan 23, 2025

Describe the bug
Please provide a clear and concise description of what the bug is.

  • NorESM version: noresm2 alpba8d

  • HPC platform: Betzy

  • Error message (if applicable):

    90305.6321842380 -> 30000.0000000000
    764: [b3359:542858] *** An error occurred in MPI_Waitall
    764: [b3359:542858] *** reported by process [4461265905046126592,4461274545734550268]
    764: [b3359:542858] *** on communicator MPI COMMUNICATOR 67 CREATE FROM 66
    764: [b3359:542858] *** MPI_ERR_TRUNCATE: message truncated
    764: [b3359:542858] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
    764: [b3359:542858] *** and potentially your MPI job)
    0: slurmstepd: error: *** STEP 1060566.0 ON b3352 CANCELLED AT 2025-01-22T13:47:50 ***

  • Case described here: n1850.ne30_tn14.hybrid_fatessp.20241219 noresm3_dev_simulations#67

  • Case folder on BETZY: /cluster/projects/nn9560k/adagj/cases-noresm2.5/n1850.ne30_tn14.hybrid_fatessp.20241219/

  • Work folder on BETZY: /cluster/work/users/adagj/noresm/n1850.ne30_tn14.hybrid_fatessp.20241219/

  • Case folder on NIRD: /datalake/NS9560K/noresm3/cases/n1850.ne30_tn14.hybrid_fatessp.20241219/

Additional context
I was decided today to continue this simulation, but I can't because it continues to fail.

@adagj adagj added the bug Something isn't working label Jan 23, 2025
@adagj adagj added this to the NorESM2.5 milestone Jan 23, 2025
@mvdebolskiy
Copy link

I don't see atm.log* files in the run directory.
Also, I see that b3559 node is giving the error every time. Can you try to exclude it?

@oyvindseland
Copy link

#587

shows how to exclude a node

@adagj
Copy link
Contributor Author

adagj commented Jan 23, 2025

Ah, thanks! I will try to exclude it. Do I need to rebuild?

@adagj
Copy link
Contributor Author

adagj commented Jan 23, 2025

So in #587
It says: $SRCROOT/ccsm_config/machines/betzy/env_batch.xml , but that folder doesn't exist in alpha08. I can find

$SRCROOT/ccs_config/machines/betzy , but no env_batch.xml file. Can somebody provide a path?

@adagj
Copy link
Contributor Author

adagj commented Jan 23, 2025

OK, so if someone else run into this problem, do this:

open $SRCROOT/ccs_config/machines/betzy/config_batch.xml

and add

 <directives>
    <directive> --ntasks={{ total_tasks }}</directive>
    <directive> --export=ALL</directive>
    <directive> --switches=1</directive>
    <directive> --exclude=b3559</directive> <=== add this line
  </directives>

@mvdebolskiy
Copy link

@adagj can this be closed?

@adagj adagj closed this as completed Jan 24, 2025
@github-project-automation github-project-automation bot moved this from Todo to Done in NorESM Development Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

7 participants