Replication job to trigger setup and carbon flow for replica tables [WIP] #276

rohitkum2506 · 2025-01-09T18:38:52Z

Summary

New job workflow to run Replication setup process on Airflow. Applies to primary tables with defined ReplicationConfig.

Design decisions:

The Replication job does not use JobsClient to trigger a job.
Instead it will use Airflow client to trigger and manage lifecycle of a Airflow job.
Replication will have the task definition in Li-Openhouse side since it needs to leverage AirflowClient.
As there can be multiple ReplicationConfigs for a table. The ReplicationTask goes over each config sequentially to trigger a setup job corresponding to the config.

Future work:

Develop AirflowClient which will allow triggering, managing state of Airflow jobs and integrate with Replication job run.
Develop CarbonClient which can trigger carbon jobs to setup scheduled replication flows.
Integrate CarbonClient with Replica table setup job

Changes

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

Tested on Local docker setup:

Ran the new Replication Job with Local Docker setup. Added a table with replicationConfig.
Observed:

The new job gets picked up by the JobScheduler and follows the task flow
Only Primary tables with defined replicationConfig are considered and others are filtered out

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

abhisheknath2011

Thanks @rohitkum2506. Did initial pass and added some comments.

apps/spark/src/main/java/com/linkedin/openhouse/jobs/scheduler/tasks/TableReplicationTask.java

apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/ReplicationConfig.java

abhisheknath2011 · 2025-01-09T22:19:58Z

apps/spark/src/main/java/com/linkedin/openhouse/jobs/client/TablesClient.java

+    }
+    List<ReplicationConfig> replicationConfigList = new ArrayList<>();
+    Replication replication = response.getPolicies().getReplication();
+    List<com.linkedin.openhouse.tables.client.model.ReplicationConfig> replicationConfig =


So we are using two classes here with different namespace. Is the class com.linkedin.openhouse.jobs.util.ReplicationConfig internal here and returned as part of method response?

Yes, com.linkedin.openhouse.jobs.util.ReplicationConfig is meant as translation layer between jobs model object and maintenance jobs

apps/spark/src/main/java/com/linkedin/openhouse/jobs/scheduler/tasks/TableReplicationTask.java

abhisheknath2011 · 2025-01-09T23:54:56Z

apps/spark/src/main/java/com/linkedin/openhouse/jobs/scheduler/tasks/OperationTasksBuilder.java

+            .filter(m -> m.isPrimary() && (m.getReplicationConfig() != null))
+            .collect(Collectors.toList());
+    log.info(
+        "Fetched metadata for {} tables for replication setup task",


Nit: Better to print the table names as comma separated list.

But the list can be more. Is there a way to track for which tables replication is being setup for every scheduled run?

List of tables could be large. Each task will run for a table and log should capture this detail task.

Sounds good.

abhisheknath2011 · 2025-01-10T00:51:28Z

apps/spark/src/main/java/com/linkedin/openhouse/jobs/client/TablesClient.java

@@ -86,6 +93,31 @@ private Optional<RetentionConfig> getTableRetention(GetTableResponseBody respons
            .build());
  }

+  private Optional<List<ReplicationConfig>> getTableReplication(GetTableResponseBody response) {


How do we plan to filter tables as replication setup is one time activity? Would that be part of li repo?

Identify the tables for which replication setup is needed like recent tables.

Identify the tables for which replication config is updated.

Or we can consider last updated time and include the change in this PR?

Replication job to trigger setup and carbon flow for replica tables

4ceabdd

rohitkum2506 force-pushed the rohikuma/ReplicationJob branch from 727def4 to 4ceabdd Compare January 9, 2025 18:47

remvoving sparkApp for Replicaton

2ad947f

rohitkum2506 marked this pull request as ready for review January 9, 2025 20:51

rohitkum2506 requested a review from abhisheknath2011 January 9, 2025 21:21

Adding tables client test

9e0bbdc

abhisheknath2011 reviewed Jan 9, 2025

View reviewed changes

apps/spark/src/main/java/com/linkedin/openhouse/jobs/scheduler/tasks/TableReplicationTask.java Outdated Show resolved Hide resolved

abhisheknath2011 reviewed Jan 9, 2025

View reviewed changes

apps/spark/src/main/java/com/linkedin/openhouse/jobs/scheduler/tasks/TableReplicationTask.java Outdated Show resolved Hide resolved

rohitkum2506 added 2 commits January 9, 2025 14:50

Removing Replication task as it can be added on Li side

36b1f1f

Addressing review comments

0164574

abhisheknath2011 reviewed Jan 9, 2025

View reviewed changes

abhisheknath2011 approved these changes Jan 10, 2025

View reviewed changes

abhisheknath2011 reviewed Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication job to trigger setup and carbon flow for replica tables [WIP] #276

Replication job to trigger setup and carbon flow for replica tables [WIP] #276

rohitkum2506 commented Jan 9, 2025 •

edited

Loading

abhisheknath2011 left a comment

abhisheknath2011 Jan 9, 2025

rohitkum2506 Jan 9, 2025

abhisheknath2011 Jan 9, 2025

abhisheknath2011 Jan 9, 2025 •

edited

Loading

rohitkum2506 Jan 10, 2025

abhisheknath2011 Jan 10, 2025

abhisheknath2011 Jan 10, 2025 •

edited

Loading

abhisheknath2011 Jan 10, 2025

Replication job to trigger setup and carbon flow for replica tables [WIP] #276

Are you sure you want to change the base?

Replication job to trigger setup and carbon flow for replica tables [WIP] #276

Conversation

rohitkum2506 commented Jan 9, 2025 • edited Loading

Summary

Changes

Testing Done

Additional Information

abhisheknath2011 left a comment

Choose a reason for hiding this comment

abhisheknath2011 Jan 9, 2025

Choose a reason for hiding this comment

rohitkum2506 Jan 9, 2025

Choose a reason for hiding this comment

abhisheknath2011 Jan 9, 2025

Choose a reason for hiding this comment

abhisheknath2011 Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

rohitkum2506 Jan 10, 2025

Choose a reason for hiding this comment

abhisheknath2011 Jan 10, 2025

Choose a reason for hiding this comment

abhisheknath2011 Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

abhisheknath2011 Jan 10, 2025

Choose a reason for hiding this comment

rohitkum2506 commented Jan 9, 2025 •

edited

Loading

abhisheknath2011 Jan 9, 2025 •

edited

Loading

abhisheknath2011 Jan 10, 2025 •

edited

Loading