Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive memory usage when using source grouping #1808

Open
keflavich opened this issue Jul 10, 2024 · 11 comments
Open

Excessive memory usage when using source grouping #1808

keflavich opened this issue Jul 10, 2024 · 11 comments

Comments

@keflavich
Copy link
Contributor

I have an MWE that fails reliably now. It's not all that minimal, but minimal enough I hope.

import numpy as np
from photutils.psf import PSFPhotometry, IterativePSFPhotometry, SourceGrouper
from photutils.detection import DAOStarFinder
from photutils.background import LocalBackground

import scipy.ndimage
from astropy.io import fits
from astropy.modeling.fitting import LevMarLSQFitter

from webbpsf.utils import to_griddedpsfmodel

basepath = '/blue/adamginsburg/adamginsburg/jwst/brick/'

im1 = fits.open(f'{basepath}/analysis/MWE_example.fits')
obsdate = im1[0].header['DATE-OBS']
module = 'merged'
data = im1['SCI'].data[:400, :400]
err = im1['ERR'].data[:400, :400]

weight = err**-1
maxweight = np.percentile(weight[np.isfinite(weight)], 95)
minweight = np.percentile(weight[np.isfinite(weight)], 5)
badweight = np.percentile(weight[np.isfinite(weight)], 1)
weight[err < 1e-5] = 0
weight[np.isnan(weight)] = 0
bad = np.isnan(weight) | (data == 0) | np.isnan(data) | (weight == 0) | (err == 0) | (data < 1e-5)

mask = bad

mask = scipy.ndimage.binary_dilation(scipy.ndimage.binary_erosion(mask, iterations=1), iterations=1)
mask = scipy.ndimage.binary_erosion(scipy.ndimage.binary_dilation(mask, iterations=1), iterations=1)

filtered_errest = np.nanmedian(err)

fwhm_pix = 2.302

daofind_tuned = DAOStarFinder(threshold=5 * filtered_errest,
                              fwhm=fwhm_pix, roundhi=1.0, roundlo=-1.0,
                              sharplo=0.30, sharphi=1.40,
                              exclude_border=True
                              )
grouper = SourceGrouper(2 * fwhm_pix)


oversample = 1
proposal_id = 1182
field = '004'
filtername = 'f444w'

blur_ = '_blur'
dao_psf_model = to_griddedpsfmodel(f'{basepath}/psfs/{filtername.upper()}_{proposal_id}_{field}_merged_PSFgrid_oversample{oversample}{blur_}.fits')
dao_psf_model.flux.min = 0

print("Initializing IterativePSFPhotometry")
phot_g_iter = IterativePSFPhotometry(finder=daofind_tuned,
                                     localbkg_estimator=LocalBackground(5, 15),
                                     grouper=grouper,
                                     psf_model=dao_psf_model,
                                     fitter=LevMarLSQFitter(),
                                     maxiters=5,
                                     fit_shape=(5, 5),
                                     aperture_radius=2 * fwhm_pix,
                                     progress_bar=True,
                                     xy_bounds=2,
                                     mode='all',
                                     )

print("Beginning iterations")
result_g_iter = phot_g_iter(data, mask=mask)

This is running on a 400x400 image. It runs out of memory on an 8GB node. I'm running with memory profiler on 16GB, 32GB, 64GB to see what the peak usage is / see if I can get it to complete.

I think peak memory usage is happening somewhere in the n'th iteration.

@keflavich
Copy link
Contributor Author

Here's the report for 8, 16, 32:

Initializing IterativePSFPhotometry
Beginning iterations
Fit source/group:  20%|██        | 402/2007 [00:08<00:37, 42.77it/s]/tmp/slurmd/job36932843/slurm_script: line 4: 152380 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python310/bin/python -m memory_profiler /blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=36932843.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Initializing IterativePSFPhotometry
Beginning iterations
Fit source/group:  42%|████▏     | 847/2007 [00:56<06:04,  3.18it/s]/tmp/slurmd/job36934208/slurm_script: line 4: 3603512 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python310/bin/python -m memory_profiler /blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=36934208.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Initializing IterativePSFPhotometry
Beginning iterations
Fit source/group:  90%|████████▉ | 1803/2007 [00:55<00:06, 33.04it/s]/tmp/slurmd/job36934209/slurm_script: line 4: 25232 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python310/bin/python -m memory_profiler /blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=36934209.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

and 64 is still going

Initializing IterativePSFPhotometry
Beginning iterations
Fit source/group: 100%|██████████| 2007/2007 [01:02<00:00, 32.20it/s]
Add model sources: 100%|██████████| 3205/3205 [00:04<00:00, 761.94it/s]

@keflavich
Copy link
Contributor Author

64gb eventually failed:

Initializing IterativePSFPhotometry
Beginning iterations
Fit source/group: 100%|██████████| 2007/2007 [01:02<00:00, 32.20it/s]
Add model sources: 100%|██████████| 3205/3205 [00:04<00:00, 761.94it/s]
Fit source/group: 100%|██████████| 1262/1262 [04:09<00:00,  5.07it/s]
Add model sources: 100%|██████████| 6761/6761 [00:09<00:00, 743.57it/s]
Fit source/group:   2%|▏         | 15/783 [00:59<50:28,  3.94s/it]
Traceback (most recent call last):
  File "/blue/adamginsburg/adamginsburg/miniconda3/envs/python310/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/blue/adamginsburg/adamginsburg/miniconda3/envs/python310/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/blue/adamginsburg/adamginsburg/miniconda3/envs/python310/lib/python3.10/site-packages/memory_profiler.py", line 1351, in <module>
    exec_with_profiler(script_filename, prof, args.backend, script_args)
  File "/blue/adamginsburg/adamginsburg/miniconda3/envs/python310/lib/python3.10/site-packages/memory_profiler.py", line 1252, in exec_with_profiler
    exec(compile(f.read(), filename, 'exec'), ns, ns)
  File "/blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py", line 69, in <module>
    result_g_iter = phot_g_iter(data, mask=mask)
  File "/blue/adamginsburg/adamginsburg/repos/photutils/photutils/psf/photometry.py", line 1944, in __call__
    new_tbl = self._psfphot(residual_data, mask=mask, error=error,
  File "/blue/adamginsburg/adamginsburg/repos/photutils/photutils/psf/photometry.py", line 1389, in __call__
    fit_params = self._fit_sources(data, init_params, error=error,
  File "/blue/adamginsburg/adamginsburg/repos/photutils/photutils/psf/photometry.py", line 1048, in _fit_sources
    yi, xi, cutout = self._define_fit_data(sources_, data, mask)
  File "/blue/adamginsburg/adamginsburg/repos/photutils/photutils/psf/photometry.py", line 893, in _define_fit_data
    raise ValueError(msg)
ValueError: Source at (55.492494305009046, 10.849163838754908) is completely masked. Remove the source from init_params or correct the input mask.

@keflavich
Copy link
Contributor Author

That last failure may indicate a problem with the source parameter validation in IterativePSFPhotometry

@larrybradley
Copy link
Member

The question is why a source landed on a completely masked region.

@keflavich
Copy link
Contributor Author

One OOM explanation is the compound models:

models = []
x = 1000000
from tqdm.auto import tqdm
for i in tqdm(range(x)):
    models.append(dao_psf_model.copy())
    if i == 0:
        psf_model = dao_psf_model
    else:
        psf_model += dao_psf_model

(example suggested by Larry on Slack)

gives me

  0%|          | 2303/1000000 [01:02<8:40:40, 31.94it/s]/tmp/slurmd/job36937239/slurm_script: line 4: 155531 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python310/bin/python -m memory_profiler /blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=36937239.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

@keflavich
Copy link
Contributor Author

Confirmed that grouping causes the problem; without the grouper, 8GB works:

Initializing PSFPhotometry (no groups)
Beginning iterations
Fit source/group: 100%|██████████| 3205/3205 [00:20<00:00, 155.58it/s]
WARNING: One or more fit(s) may not have converged. Please check the "flags" column in the output table. [photutils.psf.photometry]
Initializing PSFPhotometry
Beginning iterations
Fit source/group:  20%|█▉        | 400/2007 [00:08<00:38, 41.62it/s]/tmp/slurmd/job36937400/slurm_script: line 4: 2472349 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python310/bin/python -m memory_profiler /blue/adamginsburg/adamginsburg/jwst/brick/analysis/dao_iterative_memory_mwe.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=36937400.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

@larrybradley
Copy link
Member

The issue is due to excessive memory used by compound Astropy models: astropy/astropy#16701

@SterlingYM
Copy link

SterlingYM commented Jul 28, 2024

I was looking into this issue (since it has been an issue for me while running psf photometry on 250x250 cutout with 100-400 sources: often requiring more than 16GB, sometimes 32GB on cluster) and noticed a behavior that the memory allocation (heap size) keeps increasing during a loop in which I'm newly initializing and calling IterativePSFPhotometry (as a wrapped function call).

Maybe I'm not understanding how python memory and garbage collection works, but I was expecting to see a rise and drop in memory usage for each iteration.

Is this a system-specific behavior, the upstream Astropy issue, or because of my object structure?
I am trying to find a temporary workaround for the memory issue while we wait for the upstream fix.

Not a MWE, but a structure is something like this:

class MyClass():
    def __init__(self,data,error):
        self.data = data
        self.error = error

    def do_photometry(self,**kwargs):
        # . . .
        # initialize Source Grouper, LocalBackground, and DAOStarFinder here...
        self.do_some_stuff()
        self.do_more_stuff()
        # . . .

        # newly initialize IterativePSFPhotometry object
        psf_iter = IterativePSFPhotometry(**some_kwargs)
        phot_result = psf_iter(self.data, error=self.error)

        # don't models and other large memory-eating objects get deleted upon return?
        return phot_result

if __name__ == "__main__":
    my_class = MyClass(data,error)

    # iterating over some kwargs to test photometry with different settings
    for kwargs in list_of_kwargs[:10]:
        myclass.do_photometry(**kwargs)
        # memory heap size keeps increasing during the whole loop
        # resident size drops once or twice, but generally keeps increasing too

@larrybradley larrybradley changed the title Excessive memory usage in IterativePSFPhotometry Excessive memory usage when using source grouping Aug 5, 2024
@larrybradley
Copy link
Member

@SterlingYM The excessive memory issue is triggered when using source grouping. Source grouping creates a compound Astropy model. Every source in the group contributes part of the compound model so that the group can be fit simultaneously. If the number of sources in the group gets large, the compound Astropy model requires a huge amount of memory. When Astropy issue astropy/astropy#16701 is fixed, this will no longer be a problem.

In the meantime, you can try limiting your group sizes with a larger separation if you are running into this issue.

@larrybradley
Copy link
Member

I've changed the title of the issue to indicate that this is specifically triggered by source grouping.

@SterlingYM
Copy link

Adjusting group size helped with memory usage. Thank you for the suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants