Can we do multiprocessing or distributed computing through PyFlink? #21

abhalawat · 2022-02-22T10:06:40Z

I am trying to get about 14million data and want this process to work faster.Is there any way PyFlink could help?

dianfu · 2022-02-22T11:40:37Z

@abhalawat It's processing the data in a distributed way. This is the ability of Flink runtime.

abhalawat · 2022-02-22T20:12:22Z

How will I achieve that?Will PyFlink help me do that?

dianfu · 2022-02-23T05:58:43Z

Yes, when you submit a PyFlink to a remote cluster, e.g. YARN, K8s, etc, it will execute it in a distributed manner (you need to set the parallelism to a value other than 1). See https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/cli/#submitting-pyflink-jobs for more details on how to submit PyFlink jobs to a remote cluster.

abhalawat · 2022-02-23T06:04:02Z

I followed this doc and created task manager and job manager.But while running the first code i.e. 1_word_count.py.

I ma facing the error as:

dianfu · 2022-02-24T03:56:05Z

Could you post the full exception stack?

abhalawat · 2022-02-24T05:47:32Z

Sure here it is:

dianfu · 2022-02-24T06:00:08Z

@abhalawat From the exception stack, it failed to create the output directory. There should be some permission issues. You could change the sink connector to other connectors such as [print](https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/table/print/) to work around this problem.

abhalawat · 2022-02-24T06:05:08Z

Should I change this command:

t_env.execute_sql("""
         CREATE TABLE mySink (
           word STRING,
           `count` BIGINT
         ) WITH (
           'connector' = 'filesystem',
           'format' = 'csv',
           'path' = '/opt/examples/table/output/word_count_output'
         )
     """)

abhalawat · 2022-02-24T06:12:53Z

I created new folder for output named as result then too it was showing error:

abhalawat · 2022-02-24T09:33:36Z

Also,In here connector has to be file system because I am attaching file in SQL query.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we do multiprocessing or distributed computing through PyFlink? #21

Can we do multiprocessing or distributed computing through PyFlink? #21

abhalawat commented Feb 22, 2022

dianfu commented Feb 22, 2022

abhalawat commented Feb 22, 2022 •

edited

Loading

dianfu commented Feb 23, 2022

abhalawat commented Feb 23, 2022

dianfu commented Feb 24, 2022

abhalawat commented Feb 24, 2022

dianfu commented Feb 24, 2022

abhalawat commented Feb 24, 2022

abhalawat commented Feb 24, 2022

abhalawat commented Feb 24, 2022

Can we do multiprocessing or distributed computing through PyFlink? #21

Can we do multiprocessing or distributed computing through PyFlink? #21

Comments

abhalawat commented Feb 22, 2022

dianfu commented Feb 22, 2022

abhalawat commented Feb 22, 2022 • edited Loading

dianfu commented Feb 23, 2022

abhalawat commented Feb 23, 2022

dianfu commented Feb 24, 2022

abhalawat commented Feb 24, 2022

dianfu commented Feb 24, 2022

abhalawat commented Feb 24, 2022

abhalawat commented Feb 24, 2022

abhalawat commented Feb 24, 2022

abhalawat commented Feb 22, 2022 •

edited

Loading