Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing Issues on SLURM Cluster Servers #187

Open
aaron-jencks opened this issue Nov 8, 2024 · 0 comments
Open

Multiprocessing Issues on SLURM Cluster Servers #187

aaron-jencks opened this issue Nov 8, 2024 · 0 comments

Comments

@aaron-jencks
Copy link

Describe the bug
I'm trying to convert a dataset from English into IPA using huggingface's datasets package, but when I ask it to use more then 4 processes it crashes saying:

Traceback (most recent call last):
  File "/users/PAS2836/ajencks/workspace/epitran-transcription/./main.py", line 33, in <module>
    tokenized = dataset.map(
  File "/users/PAS2836/ajencks/.conda/envs/epitran/lib/python3.10/site-packages/datasets/dataset_dict.py", line 886, in map
    {
  File "/users/PAS2836/ajencks/.conda/envs/epitran/lib/python3.10/site-packages/datasets/dataset_dict.py", line 887, in <dictcomp>
    k: dataset.map(
  File "/users/PAS2836/ajencks/.conda/envs/epitran/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 560, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/users/PAS2836/ajencks/.conda/envs/epitran/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3147, in map
    for rank, done, content in iflatmap_unordered(
  File "/users/PAS2836/ajencks/.conda/envs/epitran/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 711, in iflatmap_unordered
    raise RuntimeError(
RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

I've tried other packages and I haven't had this problem, the code looks something like this:

import logging
import os

from datasets import load_dataset
from phonemizer.backend import EspeakBackend


if __name__ == '__main__':
    logger = logging.getLogger()
    logger.addHandler(logging.NullHandler())

    phone = EspeakBackend('en-us', preserve_punctuation=True, with_stress=True, logger=logger)

    dataset = load_dataset('openwebtext', num_proc=os.cpu_count(), trust_remote_code=True)

    def process(example):
        ipa = phone.phonemize(
                [example['text']]
        )
        out = {'ipa': ipa}
        return out

    tokenized = dataset.map(
        process,
        desc="converting to ipa",
        num_proc=16,
    )

    tokenized.save_to_disk('openwebtextipa.hf')

    print('done!')

Phonemizer version
The output of phonemize --version from command line, very helpfull!

phonemizer-3.3.0
available backends: espeak-ng-1.50, segments-2.2.1
uninstalled backends: espeak-mbrola, festival

System
Your OS (Linux distribution, Windows, ...), eventually Python version.

Linux ...hpc.osc.edu 5.14.0-284.88.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 4 11:09:23 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
Python 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0] on linux

To reproduce
A short example (Python script or command) reproducing the bug.

See the script I've supplied above, requires packages: datasets phonemizer

Expected behavior
A clear and concise description of what you expected to happen.

The program runs and converts the desired dataset into ipa then saves it to disk, using parallelization to speed up transcription.

Additional context
Add any other context about the problem here.

The error only occurs on the slurm job environment, when I run the code on my own machine, it works just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant