Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Stage macro fails on ghost record creation in Databricks for binary datatype columns #299

Closed
kafkaguru opened this issue Jan 8, 2025 · 5 comments · Fixed by #301
Closed
Labels
bug Something isn't working

Comments

@kafkaguru
Copy link

Describe the bug

When creating a stage view and setting enable_ghost_records=true, the resulting sql is invalid and fails to run the stage macro.

Environment

  • dbt version: cloud
  • datavault4dbt version: 1.8.1
  • Database/Platform: Databricks

To Reproduce

  1. Create a table with two columns: id: string, data: binary
  2. Insert a row into the table
  3. Create source.yml file and add the table to the file
  4. Create a stg_data.sql file with the following code:
{%- set yaml_metadata -%}
source_model:
    "name": "table"
hashed_columns: 
    HK_ID_H:
        - id
    HD_TABLE:
        is_hashdiff: true
        columns:
            - id
            - data

ldts: "CURRENT_TIMESTAMP()"
rsrc: '!name.table'
{%- endset -%}

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{{ datavault4dbt.stage(source_model=metadata_dict['source_model'],
                    ldts=metadata_dict['ldts'],
                    rsrc=metadata_dict['rsrc'],
                    hashed_columns=metadata_dict['hashed_columns'],
                    derived_columns=metadata_dict['derived_columns'],
                    missing_columns=none,
                    prejoined_columns=none,
                    enable_ghost_records=true,
                    include_source_columns=true) }}
  1. Run the dbt model (stage macro)
  2. See the run fail with the error:
Database Error in model stg_data (models/staging/stg_data.sql)
  [WRONG_NUM_ARGS.WITHOUT_SUGGESTION] The `cast` requires 0 parameters but the actual number is 1. Please, refer to 'https://spark.apache.org/docs/latest/sql-ref-functions.html' for a fix. SQLSTATE: 42605;

When validating the resulting compiled sql, the sql is invalid due to:

unknown_values AS (
    SELECT
    '(unknown)' as id,
    CAST('') as data
)

Reading the documentation, it seems that we should be able to configure unknown values for a binary data type as follows:

vars:
  datavault4dbt.unknown_value__binary: '1111111111111111111111111111111111111111'
  datavault4dbt.unknown_value_alt__binary: '1111111111111111111111111111111111111111'
  datavault4dbt.error_value__binary: '2222222222222222222222222222222222222222'
  datavault4dbt.error_value_alt__binary: '2222222222222222222222222222222222222222'

(This was also tried with BINARY in all caps)

Excluding binary columns from the input model and setting enable_ghost_records=false both resolve the issue but is not ideal.

Expected behavior

Ghost records are created with unknown values for binary datatypes.

@kafkaguru kafkaguru added the bug Something isn't working label Jan 8, 2025
@tkiehn
Copy link
Collaborator

tkiehn commented Jan 9, 2025

Hi @kafkaguru and thanks for the clear description of the error you encountered!

I could reproduce and also fix it on my side.

If you set your packages.yml to

packages:
  - git: "https://github.com/ScalefreeCOM/datavault4dbt.git"
    revision: 'fix-databricks-ghost-records-binary-defaults'

you could check if it also works for you.

Best Regards
Theo

@kafkaguru
Copy link
Author

kafkaguru commented Jan 9, 2025

Thanks @tkiehn! Is there any chance you could make this change on top of the fix for the binary hash datatype (#295)? It seems if I apply this fix, I'll lose access to the other fix.

@tkiehn
Copy link
Collaborator

tkiehn commented Jan 10, 2025

Hi @kafkaguru,

True, since these fixes were on separate branches they were independent from each other.
The fix for #295 was merged to main in #300 and I have created a Pull Request for this issue here which will be merged soon.

Then you can install from the main branch and should be able to access both fixes.

Later in the day we plan to publish another release, after that you can fall back to the default installation method of specifying a version number.

Edit to add: Both fixes are also now available in the branch fix-databricks-ghost-records-binary-defaults as well!
So you should be able to just run dbt deps --upgrade

@tkiehn
Copy link
Collaborator

tkiehn commented Jan 10, 2025

FYI @kafkaguru: As this is merged into main now the branch fix-databricks-ghost-records-binary-defaults will be deleted.
Therefore you can either switch to main or use the new release v1.9.0 which is aimed for later today.

Thanks again for your contribution!

Best Regards
Theo

@kafkaguru
Copy link
Author

Thanks @tkiehn! Confirmed that the new version fixes both issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants