Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] adapter deleted most of my bucket. #776

Open
2 tasks done
nbenezri opened this issue Jan 2, 2025 · 6 comments
Open
2 tasks done

[Bug] adapter deleted most of my bucket. #776

nbenezri opened this issue Jan 2, 2025 · 6 comments
Labels
pkg:dbt-athena Issue affects dbt-athena type:documentation Improvements or additions to documentation

Comments

@nbenezri
Copy link

nbenezri commented Jan 2, 2025

Is this a new bug in dbt-athena?

  • I believe this is a new bug in dbt-athena
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I am doing a POC with athena where I try to load data from parquet file into iceberg.
The parquet table was created outsides of dbt with create external table syntax on bucket-a.
In dbt configuration I mention iceberg destination as bucket-b and I don't specify anywhere bucket-a.
I notice after a few test run (first time using this) - that most of bucket -a was deleted. tracing it back with aws cloud train and datadog I found it was dbt-athena that deleted those files with deleteobejcts API call.
Since it was during the initial creation of the repo/poc, I am not sure which configuration exactly was it that led to it. nor do I want to test again as I am not sure how it happen. Luckily the bucket had versioning. Any idea what in this adapter may cause this?

Expected Behavior

dbt does not touch buckets outsides of the models-scope

Steps To Reproduce

I dont have a way to reproduce.

What I can say that the model was:
models/project/staging/stg_raw_tab.sql:

{{ config(
    materialized='table',
    table_type='iceberg',
    format='parquet'
) }}
WITH
tab AS (
SELECT id, name 
FROM db_dev.parquet_tab
LIMIT 1
)

SELECT *
FROM tab

and table ddl is:

CREATE external TABLE db_dev.parquet_tab(
id bigint,
name varchar(100)
)
PARTITIONED BY ( 
  `date` string
)
ROW FORMAT SERDE "org.apache.hive.hcatalog.data.JsonSerDe"
WITH SERDEPROPERTIES (
    'ignore.malformed.json' = 'true'
)
LOCATION 's3://<bucket-a>/<path>/'
TBLPROPERTIES (
    'projection.enabled' = 'true',
    'json.max.read.errors' = '100',
    'compression.type' = 'GZIP' ,
    'projection.enabled' = 'true',
  'projection.date.type' = 'date',
  'projection.date.interval' = '1',
  'projection.date.format' = 'yyyy/MM/dd',
  'timestamp.formats' = "yyyyMMdd'T'HH:mm:ss",
  'projection.date.unit' = 'DAYS',
  'projection.date.range' = '2021/01/01,NOW',
  'storage.location.template' = 's3://<bucket-a>/<path>/${date}/'
);

latest profile

iceberg_dbt:
  target: athena_dev
  outputs:
    athena_dev:
      type: athena
      s3_staging_dir: s3://<bucket-c>/dbt/
      s3_data_dir: s3://<bucket-b>/
      s3_data_naming: table_unique
      s3_tmp_table_dir: s3://<bucket-b>/temp/
      region_name: us-east-1
      schema: db_dev
      database: awsdatacatalog
      threads: 4
      aws_profile_name: default
      work_group: iceberg

Relevant log output

No response

Environment

nir % python3 --version
Python 3.12.8
nir % dbt --version
Core:
  - installed: 1.9.1
  - latest:    1.9.1 - Up to date!

Plugins:
  - glue:     1.9.0 - Up to date!
  - redshift: 1.9.0 - Up to date!
  - postgres: 1.9.0 - Up to date!
  - spark:    1.9.0 - Up to date!
  - athena:   1.8.4 - Update available!

  At least one plugin is out of date with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation


nir % sw_vers
ProductName:            macOS
ProductVersion:         15.1.1
BuildVersion:           24B91


### Additional Context

_No response_
@nbenezri nbenezri added type:bug Something isn't working triage:dbt labels Jan 2, 2025
@nbenezri
Copy link
Author

nbenezri commented Jan 5, 2025

I think I found it, in some dev version models/project/staging/stg_raw_tab.sql was named as parquet_tab and they both exists in the same glue DB. dbt probably did DROP table and delete-objects s3 API call.

There should be some precautions around it in my opinion:

  1. Something like if the table it is about to drop is not iceberg - don't do delete-objects.
  2. dbt-athena should use IAM role instead of leaning on aws profile.

@amychen1776
Copy link
Contributor

@nbenezri just to clarify - did dbt drop files from bucket-a that were also related to stg_raw_tab somehow? How is bucket a related to bucket b?

@nbenezri
Copy link
Author

nbenezri commented Jan 7, 2025

There is no relation between the buckets.

stg_raw_tab was created in athena at first as external table on bucket a. In dbt-athena I then mistakenly named the file the same - stg_raw_tab (in dbt configured as iceberg on bucket b). then, behind the scenes it dropped the table and issue delete-objects from bucket a, although it was not an iceberg table (drop table would have been enough).

@nicor88
Copy link
Contributor

nicor88 commented Jan 7, 2025

There are few reasons why this could happen, and you are not the first to spot this complication, I had many discussions about this with few users.
If a dbt model is called like an exiting table in a glue database, the adapter will delete the s3 objects based on the location in the catalog, and then finally attempt to recreate the table, based on the sql provided by the user.

Another reason why the adapter might delete data from a bucket is when a model is created in the same external location of an existing table. Also in this case the adapter first clean the target location, to avoid issues on creation.

IMO both cases are not an issue, but a misconfiguration from the users, and such behavior must be properly documented.

@amychen1776 I hope that this help you to triage better this issue, I leave the final decision to you folks of dbt Labs.

@nbenezri note that for both iceberg and hive tables, we do a delete object operation and a delete table using glue apis. Drop DDL for iceberg tables lead to situations where not all the s3 objects are removed, and the workaround described allow to have it properly working in a dbt context.

@amychen1776
Copy link
Contributor

amychen1776 commented Jan 7, 2025

Thank you @nicor88 for that context! This is super helpful. This to me is expected behavior to maintain dbt's idempotency (not accidentally create duplicate objects). I will look into getting this documented on our docs site.

@amychen1776 amychen1776 added type:documentation Improvements or additions to documentation and removed type:bug Something isn't working labels Jan 7, 2025
@jessedobbelaere
Copy link
Contributor

jessedobbelaere commented Jan 7, 2025

Chiming in, the reason the table location S3 data needs to be deleted first (in particular for hive tables) is that you receive an Athena error HIVE_PATH_ALREADY_EXISTS if the S3 path contains 1 or more files. There are workarounds like making the table location unique with a uuid.

Something like if the table it is about to drop is not iceberg - don't do delete-objects.

Like Nico says, it's better to still do a cleanup for Iceberg as well. You can configure native_drop though if you want Iceberg to cleanup natively instead of the dbt adapter: https://github.com/dbt-labs/dbt-athena/blob/main/dbt-athena/src/dbt/include/athena/macros/materializations/models/table/create_table_as.sql#L40

dbt-athena should use IAM role instead of leaning on aws profile.

The adapter just uses boto3 which uses a chain of auth locations. There's no need to configure aws_access_key_id or aws_profile_name, or perhaps it uses the default profile from your ~/.aws/ ? . E.g. I run dbt on AWS ECS Fargate and pass a taskRole to the container, dbt runs locally where I assume a role first and it stores tmp credentials via AWS SSO, ...

@mikealfare mikealfare added the pkg:dbt-athena Issue affects dbt-athena label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg:dbt-athena Issue affects dbt-athena type:documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

5 participants