Excessive memory usage when using source grouping #1808

keflavich · 2024-07-10T17:24:42Z

I have an MWE that fails reliably now. It's not all that minimal, but minimal enough I hope.

import numpy as np
from photutils.psf import PSFPhotometry, IterativePSFPhotometry, SourceGrouper
from photutils.detection import DAOStarFinder
from photutils.background import LocalBackground

import scipy.ndimage
from astropy.io import fits
from astropy.modeling.fitting import LevMarLSQFitter

from webbpsf.utils import to_griddedpsfmodel

basepath = '/blue/adamginsburg/adamginsburg/jwst/brick/'

im1 = fits.open(f'{basepath}/analysis/MWE_example.fits')
obsdate = im1[0].header['DATE-OBS']
module = 'merged'
data = im1['SCI'].data[:400, :400]
err = im1['ERR'].data[:400, :400]

weight = err**-1
maxweight = np.percentile(weight[np.isfinite(weight)], 95)
minweight = np.percentile(weight[np.isfinite(weight)], 5)
badweight = np.percentile(weight[np.isfinite(weight)], 1)
weight[err < 1e-5] = 0
weight[np.isnan(weight)] = 0
bad = np.isnan(weight) | (data == 0) | np.isnan(data) | (weight == 0) | (err == 0) | (data < 1e-5)

mask = bad

mask = scipy.ndimage.binary_dilation(scipy.ndimage.binary_erosion(mask, iterations=1), iterations=1)
mask = scipy.ndimage.binary_erosion(scipy.ndimage.binary_dilation(mask, iterations=1), iterations=1)

filtered_errest = np.nanmedian(err)

fwhm_pix = 2.302

daofind_tuned = DAOStarFinder(threshold=5 * filtered_errest,
                              fwhm=fwhm_pix, roundhi=1.0, roundlo=-1.0,
                              sharplo=0.30, sharphi=1.40,
                              exclude_border=True
                              )
grouper = SourceGrouper(2 * fwhm_pix)


oversample = 1
proposal_id = 1182
field = '004'
filtername = 'f444w'

blur_ = '_blur'
dao_psf_model = to_griddedpsfmodel(f'{basepath}/psfs/{filtername.upper()}_{proposal_id}_{field}_merged_PSFgrid_oversample{oversample}{blur_}.fits')
dao_psf_model.flux.min = 0

print("Initializing IterativePSFPhotometry")
phot_g_iter = IterativePSFPhotometry(finder=daofind_tuned,
                                     localbkg_estimator=LocalBackground(5, 15),
                                     grouper=grouper,
                                     psf_model=dao_psf_model,
                                     fitter=LevMarLSQFitter(),
                                     maxiters=5,
                                     fit_shape=(5, 5),
                                     aperture_radius=2 * fwhm_pix,
                                     progress_bar=True,
                                     xy_bounds=2,
                                     mode='all',
                                     )

print("Beginning iterations")
result_g_iter = phot_g_iter(data, mask=mask)

This is running on a 400x400 image. It runs out of memory on an 8GB node. I'm running with memory profiler on 16GB, 32GB, 64GB to see what the peak usage is / see if I can get it to complete.

I think peak memory usage is happening somewhere in the n'th iteration.

keflavich · 2024-07-10T17:26:15Z

Here's the report for 8, 16, 32:

Initializing IterativePSFPhotometry
Beginning iterations
Fit source/group:  20%|██        | 402/2007 [00:08<00:37, 42.77it/s]/tmp/slurmd/job36932843/slurm_script: line 4: 152380 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python310/bin/python -m memory_profiler /blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=36932843.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Initializing IterativePSFPhotometry
Beginning iterations
Fit source/group:  42%|████▏     | 847/2007 [00:56<06:04,  3.18it/s]/tmp/slurmd/job36934208/slurm_script: line 4: 3603512 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python310/bin/python -m memory_profiler /blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=36934208.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Initializing IterativePSFPhotometry
Beginning iterations
Fit source/group:  90%|████████▉ | 1803/2007 [00:55<00:06, 33.04it/s]/tmp/slurmd/job36934209/slurm_script: line 4: 25232 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python310/bin/python -m memory_profiler /blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=36934209.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

and 64 is still going

Initializing IterativePSFPhotometry
Beginning iterations
Fit source/group: 100%|██████████| 2007/2007 [01:02<00:00, 32.20it/s]
Add model sources: 100%|██████████| 3205/3205 [00:04<00:00, 761.94it/s]

keflavich · 2024-07-10T17:30:26Z

64gb eventually failed:

Initializing IterativePSFPhotometry
Beginning iterations
Fit source/group: 100%|██████████| 2007/2007 [01:02<00:00, 32.20it/s]
Add model sources: 100%|██████████| 3205/3205 [00:04<00:00, 761.94it/s]
Fit source/group: 100%|██████████| 1262/1262 [04:09<00:00,  5.07it/s]
Add model sources: 100%|██████████| 6761/6761 [00:09<00:00, 743.57it/s]
Fit source/group:   2%|▏         | 15/783 [00:59<50:28,  3.94s/it]
Traceback (most recent call last):
  File "/blue/adamginsburg/adamginsburg/miniconda3/envs/python310/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/blue/adamginsburg/adamginsburg/miniconda3/envs/python310/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/blue/adamginsburg/adamginsburg/miniconda3/envs/python310/lib/python3.10/site-packages/memory_profiler.py", line 1351, in <module>
    exec_with_profiler(script_filename, prof, args.backend, script_args)
  File "/blue/adamginsburg/adamginsburg/miniconda3/envs/python310/lib/python3.10/site-packages/memory_profiler.py", line 1252, in exec_with_profiler
    exec(compile(f.read(), filename, 'exec'), ns, ns)
  File "/blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py", line 69, in <module>
    result_g_iter = phot_g_iter(data, mask=mask)
  File "/blue/adamginsburg/adamginsburg/repos/photutils/photutils/psf/photometry.py", line 1944, in __call__
    new_tbl = self._psfphot(residual_data, mask=mask, error=error,
  File "/blue/adamginsburg/adamginsburg/repos/photutils/photutils/psf/photometry.py", line 1389, in __call__
    fit_params = self._fit_sources(data, init_params, error=error,
  File "/blue/adamginsburg/adamginsburg/repos/photutils/photutils/psf/photometry.py", line 1048, in _fit_sources
    yi, xi, cutout = self._define_fit_data(sources_, data, mask)
  File "/blue/adamginsburg/adamginsburg/repos/photutils/photutils/psf/photometry.py", line 893, in _define_fit_data
    raise ValueError(msg)
ValueError: Source at (55.492494305009046, 10.849163838754908) is completely masked. Remove the source from init_params or correct the input mask.

keflavich · 2024-07-10T17:49:01Z

That last failure may indicate a problem with the source parameter validation in IterativePSFPhotometry

larrybradley · 2024-07-10T17:51:13Z

The question is why a source landed on a completely masked region.

keflavich · 2024-07-10T17:57:38Z

One OOM explanation is the compound models:

models = []
x = 1000000
from tqdm.auto import tqdm
for i in tqdm(range(x)):
    models.append(dao_psf_model.copy())
    if i == 0:
        psf_model = dao_psf_model
    else:
        psf_model += dao_psf_model

(example suggested by Larry on Slack)

gives me

  0%|          | 2303/1000000 [01:02<8:40:40, 31.94it/s]/tmp/slurmd/job36937239/slurm_script: line 4: 155531 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python310/bin/python -m memory_profiler /blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=36937239.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

keflavich · 2024-07-10T17:58:56Z

Confirmed that grouping causes the problem; without the grouper, 8GB works:

Initializing PSFPhotometry (no groups)
Beginning iterations
Fit source/group: 100%|██████████| 3205/3205 [00:20<00:00, 155.58it/s]
WARNING: One or more fit(s) may not have converged. Please check the "flags" column in the output table. [photutils.psf.photometry]
Initializing PSFPhotometry
Beginning iterations
Fit source/group:  20%|█▉        | 400/2007 [00:08<00:38, 41.62it/s]/tmp/slurmd/job36937400/slurm_script: line 4: 2472349 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python310/bin/python -m memory_profiler /blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=36937400.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

larrybradley · 2024-07-11T01:14:17Z

The issue is due to excessive memory used by compound Astropy models: astropy/astropy#16701

SterlingYM · 2024-07-28T00:04:20Z

I was looking into this issue (since it has been an issue for me while running psf photometry on 250x250 cutout with 100-400 sources: often requiring more than 16GB, sometimes 32GB on cluster) and noticed a behavior that the memory allocation (heap size) keeps increasing during a loop in which I'm newly initializing and calling IterativePSFPhotometry (as a wrapped function call).

Maybe I'm not understanding how python memory and garbage collection works, but I was expecting to see a rise and drop in memory usage for each iteration.

Is this a system-specific behavior, the upstream Astropy issue, or because of my object structure?
I am trying to find a temporary workaround for the memory issue while we wait for the upstream fix.

Not a MWE, but a structure is something like this:

class MyClass():
    def __init__(self,data,error):
        self.data = data
        self.error = error

    def do_photometry(self,**kwargs):
        # . . .
        # initialize Source Grouper, LocalBackground, and DAOStarFinder here...
        self.do_some_stuff()
        self.do_more_stuff()
        # . . .

        # newly initialize IterativePSFPhotometry object
        psf_iter = IterativePSFPhotometry(**some_kwargs)
        phot_result = psf_iter(self.data, error=self.error)

        # don't models and other large memory-eating objects get deleted upon return?
        return phot_result

if __name__ == "__main__":
    my_class = MyClass(data,error)

    # iterating over some kwargs to test photometry with different settings
    for kwargs in list_of_kwargs[:10]:
        myclass.do_photometry(**kwargs)
        # memory heap size keeps increasing during the whole loop
        # resident size drops once or twice, but generally keeps increasing too

larrybradley · 2024-08-05T20:23:53Z

@SterlingYM The excessive memory issue is triggered when using source grouping. Source grouping creates a compound Astropy model. Every source in the group contributes part of the compound model so that the group can be fit simultaneously. If the number of sources in the group gets large, the compound Astropy model requires a huge amount of memory. When Astropy issue astropy/astropy#16701 is fixed, this will no longer be a problem.

In the meantime, you can try limiting your group sizes with a larger separation if you are running into this issue.

larrybradley · 2024-08-05T20:24:24Z

I've changed the title of the issue to indicate that this is specifically triggered by source grouping.

SterlingYM · 2024-08-08T13:49:06Z

Adjusting group size helped with memory usage. Thank you for the suggestion!

larrybradley added psf upstream fix required labels Jul 11, 2024

This was referenced Jul 11, 2024

Memory errors with refactored code #1580

Closed

PSF photometry hangs, may be bug #737

Closed

Possibly Excessive resource usage in PSF Photometry #1217

Closed

larrybradley changed the title ~~Excessive memory usage in IterativePSFPhotometry~~ Excessive memory usage when using source grouping Aug 5, 2024

larrybradley mentioned this issue Aug 29, 2024

Add notes about excessive memory usage for large star groups #1856

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive memory usage when using source grouping #1808

Excessive memory usage when using source grouping #1808

keflavich commented Jul 10, 2024

keflavich commented Jul 10, 2024

keflavich commented Jul 10, 2024

keflavich commented Jul 10, 2024

larrybradley commented Jul 10, 2024

keflavich commented Jul 10, 2024

keflavich commented Jul 10, 2024

larrybradley commented Jul 11, 2024

SterlingYM commented Jul 28, 2024 •

edited

Loading

larrybradley commented Aug 5, 2024

larrybradley commented Aug 5, 2024

SterlingYM commented Aug 8, 2024

Excessive memory usage when using source grouping #1808

Excessive memory usage when using source grouping #1808

Comments

keflavich commented Jul 10, 2024

keflavich commented Jul 10, 2024

keflavich commented Jul 10, 2024

keflavich commented Jul 10, 2024

larrybradley commented Jul 10, 2024

keflavich commented Jul 10, 2024

keflavich commented Jul 10, 2024

larrybradley commented Jul 11, 2024

SterlingYM commented Jul 28, 2024 • edited Loading

larrybradley commented Aug 5, 2024

larrybradley commented Aug 5, 2024

SterlingYM commented Aug 8, 2024

SterlingYM commented Jul 28, 2024 •

edited

Loading