Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow flexible column names #27

Merged
merged 1 commit into from
Nov 29, 2023
Merged

Conversation

psainics
Copy link
Collaborator

@psainics psainics commented Oct 27, 2023

Allow flexible column names (Japanese Characters)

Jira : PLUGIN-1718

Let the user enter non English characters as column names.

A column name can contain the letters (a-z, A-Z), numbers (0-9), or underscores (_), and it must start with a letter or underscore. For more flexible column name support, see flexible column names.

Code change

  • Some classes were moved from a dependency as it was not being maintained and had old regex, these classes are present under the lib package.
  • Updated Docs for BQ and BQMT
  • Update column name regex
  • Update when to use JSON Writer
    • If the field name is not in English characters, then we will use JSON format, We do this as the AVRO load job in BQ does not support non-English characters in field names for now
  • Added new unit test case
  • Some classes were moved from com.google.cloud.hadoop.io.bigquery bigquery-connector library
  • Update the table regex as per the requirement of flexible column names
  • All moved classes are under lib folder

Unit Test

  • testValidateColumnNameWithSpecialCharacter
  • testValidateColumnNameWithNumbers
  • testValidateColumnNameWithCapitalLetters
  • testValidateColumnNameWithDash
  • testValidateColumnNameWithUnderscore
  • testValidateColumnNameWithEmoji
  • testValidateColumnNameWithSpace
  • testValidateColumnNameWithJapaneseColumnName
  • testValidateColumnNameWithInvalidColumnName
  • testValidateColumnNameWithChineseColumnName
  • testValidateColumnNameWithValidColumnName
  • testValidateColumnNameWithChineseColumnName
  • testValidateColumnNameWithInvalidColumnName
  • testValidateColumnNameWithJapaneseColumnName
  • testValidateColumnNameWithSpace
  • testValidateColumnNameWithEmoji
  • testValidateColumnNameWithUnderscore
  • testValidateColumnNameWithDash
  • testValidateColumnNameWithCapitalLetters
  • testValidateColumnNameWithNumbers
  • testValidateColumnNameWithSpecialCharacter
  • testValidateColumnNameWith300Length
  • testValidateColumnNameWith301Length
image

@psainics psainics force-pushed the patch-flexible-column-names branch from 9c5fd2b to e26c028 Compare October 27, 2023 08:27
@vikasrathee-cs vikasrathee-cs self-requested a review November 10, 2023 07:14
@vikasrathee-cs
Copy link
Collaborator

Resolve conflicts @psainics

@@ -25,7 +25,7 @@
import com.google.cloud.bigquery.JobStatistics;
import com.google.cloud.bigquery.Table;
import com.google.cloud.bigquery.TimePartitioning;
import com.google.cloud.hadoop.io.bigquery.output.BigQueryTableFieldSchema;
//import com.google.cloud.hadoop.io.bigquery.output.BigQueryTableFieldSchema;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done !

@@ -344,9 +344,12 @@ public void validate(@Nullable Schema inputSchema, @Nullable Schema outputSchema
String name = field.getName();
// BigQuery column names only allow alphanumeric characters and _
Copy link
Collaborator

@vikasrathee-cs vikasrathee-cs Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove old doc reference and this old comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not old, as per docs the column names cannot have special chars , the special chars are part of flexible-column-names , that is still in preview.
I suggest to keep both docs until BQ docs merges the 2 different concept.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case write one comment on top mentioning keeping these comments as it is in preview, will be removed after GA

// If the field name is not in english characters, then we will use json format
// We do this as the avro load job in BQ does not support non-english characters in field names for now
String fieldName = field.getName();
if (!Pattern.matches("[\\w]+", fieldName)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a variable for this regex

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String wordRegex = "[\\w]+";

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a static final variable with name something like COLUMN_NAME_REGEX

@psainics psainics force-pushed the patch-flexible-column-names branch from 1b927fc to ba0e4ac Compare November 14, 2023 08:56
// If the field name is not in english characters, then we will use json format
// We do this as the avro load job in BQ does not support non-english characters in field names for now
String fieldName = field.getName();
String wordRegex = "[\\w]+";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this cover all special characters like underscore or digits that were supported previously by BQ. If not use the same REGEX that was given in bigqueryConnector library.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not cover everything this only checks for [a-z A-Z 0-9],
we are doing a negative here, so it means as long as we don't have a character outside of [a-z A-Z 0-9] we can use avro format, else we use the JSON format.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case of underscore also, it will write into json then, which is not required

@psainics psainics force-pushed the patch-flexible-column-names branch 4 times, most recently from 3edc1da to 88bb9ca Compare November 16, 2023 08:50
@psainics psainics self-assigned this Nov 17, 2023
@psainics psainics force-pushed the patch-flexible-column-names branch 4 times, most recently from a5bdff1 to b24053f Compare November 21, 2023 10:09
@psainics psainics force-pushed the patch-flexible-column-names branch 2 times, most recently from 514a93e to b6d212c Compare November 28, 2023 20:27
@psainics psainics force-pushed the patch-flexible-column-names branch from b6d212c to cb76da5 Compare November 29, 2023 05:13
@priyabhatnagar25 priyabhatnagar25 merged commit 34e3956 into develop Nov 29, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants