Multiprocessing Issues on SLURM Cluster Servers #187

aaron-jencks · 2024-11-08T18:25:05Z

Describe the bug
I'm trying to convert a dataset from English into IPA using huggingface's datasets package, but when I ask it to use more then 4 processes it crashes saying:

Traceback (most recent call last):
  File "/users/PAS2836/ajencks/workspace/epitran-transcription/./main.py", line 33, in <module>
    tokenized = dataset.map(
  File "/users/PAS2836/ajencks/.conda/envs/epitran/lib/python3.10/site-packages/datasets/dataset_dict.py", line 886, in map
    {
  File "/users/PAS2836/ajencks/.conda/envs/epitran/lib/python3.10/site-packages/datasets/dataset_dict.py", line 887, in <dictcomp>
    k: dataset.map(
  File "/users/PAS2836/ajencks/.conda/envs/epitran/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 560, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/users/PAS2836/ajencks/.conda/envs/epitran/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3147, in map
    for rank, done, content in iflatmap_unordered(
  File "/users/PAS2836/ajencks/.conda/envs/epitran/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 711, in iflatmap_unordered
    raise RuntimeError(
RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

I've tried other packages and I haven't had this problem, the code looks something like this:

import logging
import os

from datasets import load_dataset
from phonemizer.backend import EspeakBackend


if __name__ == '__main__':
    logger = logging.getLogger()
    logger.addHandler(logging.NullHandler())

    phone = EspeakBackend('en-us', preserve_punctuation=True, with_stress=True, logger=logger)

    dataset = load_dataset('openwebtext', num_proc=os.cpu_count(), trust_remote_code=True)

    def process(example):
        ipa = phone.phonemize(
                [example['text']]
        )
        out = {'ipa': ipa}
        return out

    tokenized = dataset.map(
        process,
        desc="converting to ipa",
        num_proc=16,
    )

    tokenized.save_to_disk('openwebtextipa.hf')

    print('done!')

Phonemizer version
The output of phonemize --version from command line, very helpfull!

phonemizer-3.3.0
available backends: espeak-ng-1.50, segments-2.2.1
uninstalled backends: espeak-mbrola, festival

System
Your OS (Linux distribution, Windows, ...), eventually Python version.

Linux ...hpc.osc.edu 5.14.0-284.88.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 4 11:09:23 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
Python 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0] on linux

To reproduce
A short example (Python script or command) reproducing the bug.

See the script I've supplied above, requires packages: datasets phonemizer

Expected behavior
A clear and concise description of what you expected to happen.

The program runs and converts the desired dataset into ipa then saves it to disk, using parallelization to speed up transcription.

Additional context
Add any other context about the problem here.

The error only occurs on the slurm job environment, when I run the code on my own machine, it works just fine.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocessing Issues on SLURM Cluster Servers #187

Multiprocessing Issues on SLURM Cluster Servers #187

aaron-jencks commented Nov 8, 2024

Multiprocessing Issues on SLURM Cluster Servers #187

Multiprocessing Issues on SLURM Cluster Servers #187

Comments

aaron-jencks commented Nov 8, 2024