-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replication job to trigger setup and carbon flow for replica tables [WIP] #276
base: main
Are you sure you want to change the base?
Conversation
727def4
to
4ceabdd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rohitkum2506. Did initial pass and added some comments.
apps/spark/src/main/java/com/linkedin/openhouse/jobs/scheduler/tasks/TableReplicationTask.java
Outdated
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/util/ReplicationConfig.java
Outdated
Show resolved
Hide resolved
} | ||
List<ReplicationConfig> replicationConfigList = new ArrayList<>(); | ||
Replication replication = response.getPolicies().getReplication(); | ||
List<com.linkedin.openhouse.tables.client.model.ReplicationConfig> replicationConfig = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are using two classes here with different namespace. Is the class com.linkedin.openhouse.jobs.util.ReplicationConfig
internal here and returned as part of method response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, com.linkedin.openhouse.jobs.util.ReplicationConfig
is meant as translation layer between jobs model object and maintenance jobs
apps/spark/src/main/java/com/linkedin/openhouse/jobs/scheduler/tasks/TableReplicationTask.java
Outdated
Show resolved
Hide resolved
apps/spark/src/main/java/com/linkedin/openhouse/jobs/scheduler/tasks/TableReplicationTask.java
Outdated
Show resolved
Hide resolved
.filter(m -> m.isPrimary() && (m.getReplicationConfig() != null)) | ||
.collect(Collectors.toList()); | ||
log.info( | ||
"Fetched metadata for {} tables for replication setup task", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Better to print the table names as comma separated list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the list can be more. Is there a way to track for which tables replication is being setup for every scheduled run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List of tables could be large. Each task will run for a table and log should capture this detail task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
@@ -86,6 +93,31 @@ private Optional<RetentionConfig> getTableRetention(GetTableResponseBody respons | |||
.build()); | |||
} | |||
|
|||
private Optional<List<ReplicationConfig>> getTableReplication(GetTableResponseBody response) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we plan to filter tables as replication setup is one time activity? Would that be part of li repo?
- Identify the tables for which replication setup is needed like recent tables.
- Identify the tables for which replication config is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can consider last updated time and include the change in this PR?
Summary
New job workflow to run Replication setup process on Airflow. Applies to primary tables with defined ReplicationConfig.
Design decisions:
Instead it will use Airflow client to trigger and manage lifecycle of a Airflow job.
Future work:
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
Tested on Local docker setup:
Ran the new Replication Job with Local Docker setup. Added a table with replicationConfig.
Observed:
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.