-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Issue with RunIII2024Summer24DRPremix #46975
Comments
cms-bot internal usage |
A new Issue was created by @vlimant. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core |
New categories assigned: core @Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Plotting the RSS and VSIZE from It seems to me the overall memory footprint is too much for 8 GB limit on 4 cores. |
we would probably need the same for a successful job then, to figure out why the production is not failing all over. There must be something specific to 3% (overall such failure rate of the workflow) of the input files that makes the job start so high on memory usage |
Is there any correlation between failures and sites? |
I ran the job from I reran the job then with reading the same files over xrootd, and again without modifying the pileup file list, and got this In the full pileup file list case, the |
This behavior is reproducible with 1 thread (extra cost being around 900 MB), and is visible at the level memory allocations (e.g. with MaxMemoryPreload AllocMonitor). |
I think I found the culprit. Comparing IgProf live profiles after 10th event (running with 1 thread) between 1 local pileup file and all xrootd pileup files shows 488 MB increase (per stream!) in The job was configured with 499465 pileup files, translating to ~1025 bytes per file. On a quick look I see some potential (and confusing) duplication of file name data in the Another quick thought would to be to avoid replicating the I also can't avoid asking if the scale of 499465 pileup files is really something that every job has to see? |
There are some other changes I'd like to make to EmbeddedRootSource to share information across streams (e.g., the mapping from the file identifier to the filename ought to be cached and shared). Refactoring the InputFileCatalog would be a natural fit. |
Knowing now a cause, there can be site dependence on the memory usage. The |
nice ! |
#47013 removes duplication (really triplication) of file name data, that seemed to be simple-enough to be done quickly and to be backported to 14_0_X and 14_1_X. MaxMemoryPreload showed 197 MB reduction per stream, so on a 4-thread/stream job that would translate to 787 MB. On a local test at CERN the RSS and VSIZE decreased like this There is more potential (like ~900 MB on 4 streams) with #46975 (comment), but that will take more time. I hope #47013 would at least allow more jobs to stay under the memory limit, it not to avoid all the failures. I can think of further memory reduction options in |
Going through the logs again, I see
The |
And just to report findings from IgProf MEM_LIVE after 10th event, on one local pileup file (so won't show the Total allocated memory at the time of the dump was 2100 MB, divides mainly in
|
I also looked at the total number of memory allocations (as indication of memory churn). The profile showed total of 143 million allocations, of which
|
@vlimant The backportable fix #47013 was merged in master, and I opened backports to
I'd like to see a few rounds of IBs with #47013 before the backports, so I'd expect to sign the backports on Monday next week. |
Given the current interest in memory leak (#46901) I am putting here a report for large memory usage in MC production in the Summer24 campaign.
Using this workflow : https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192 as an example (there are several others with the same symptom)
The error report : https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192#GEN-RunIII2024Summer24wmLHEGS-00051_0 shows a good fraction of job going beyond 8G with 4 cores.
Logs are available under https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/
and the cmsRun2 indeed does get interrupted early : https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/487de694-b24d-4a5e-9722-90188c9e4bbe-750-0-logArchive/job/WMTaskSpace/cmsRun2/cmsRun2-stdout.log
(memory issue not necessarily related to the last module in the last line of the MemoryCheck though)
generic cmsDriver
and configuration file https://cmsweb.cern.ch/couchdb/reqmgr_config_cache/09cd8d21fa53b131732b49d6e27a16d1/configFile
https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/487de694-b24d-4a5e-9722-90188c9e4bbe-750-0-logArchive/job/WMTaskSpace/cmsRun2/PSet.pkl and https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00051__v1_T_241126_105820_5192/50660/GEN-RunIII2024Summer24wmLHEGS-00051_0/487de694-b24d-4a5e-9722-90188c9e4bbe-750-0-logArchive/job/WMTaskSpace/cmsRun2/PSet.py for that particular failed cmsRun2.
Could someone look into this ?
Side note, to be propagated to WMCore, I note that even though cmsRun2 was killed, the next steps are ran regardless cmsRun3,4,5, and give that the full job is marked as failed, the output will be just send to the bin ; I wonder how (in)efficient this is. i.e to keep running steps even though the output will be tossed away.
The text was updated successfully, but these errors were encountered: