Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get intermittent errors when materializing multiple dlt assets to duckdb with concurrency 1 #26848

Open
pwr-philarmstrong opened this issue Jan 6, 2025 · 1 comment
Assignees
Labels
integration: duckdb Related to DuckDB integrations integration: embedded-elt Related to dagster-embedded-elt which uses Sling and data Load Tool (dlt) type: bug Something isn't working

Comments

@pwr-philarmstrong
Copy link

pwr-philarmstrong commented Jan 6, 2025

What's the issue?

If materialize multiple assets using the dlt integration to a destination of duckdb I sometimes get an error trying to open the duckdb file

e.g.

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "clan_membership":
The above exception was caused by the following exception:
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage load when processing package 1736178808.074418 with exception:

<class 'dlt.destinations.exceptions.DestinationConnectionError'>
Connection with DuckDbSqlClient to dataset name dlt_assets_multi failed. Please check if you configured the credentials at all and provided the right credentials values. You can be also denied access or your internet connection may be down. The actual reason given is: IO Error: Cannot open file "c:\dev\dagster_small\dlt_with_dagster_example2\dlt_assets_multi_pipeline.duckdb": The process cannot access the file because it is being used by another process.

What did you expect to happen?

I expect only a single asset to be materialized at the same time and not to get the error from duckdb if only one asset is using it

How to reproduce?

this repo has some examples of dlt to duckdb

https://github.com/[pwr-philarmstrong/dlt_with_dagster_example2](https://github.com/pwr-philarmstrong/dlt_with_dagster_example2/tree/master)/tree/master

if you try and materialize the assets groups that don't have file in the name then you get the error for one or more of the assets

in this example there is only one asset that uses dlt_assets_incremental__family_pipeline.duckdb so the issue is unlikely to be related to concurrency

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "inc_table_family":
The above exception was caused by the following exception:
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage load when processing package 1736183868.0893185 with exception:

<class 'dlt.destinations.exceptions.DestinationConnectionError'>
Connection with DuckDbSqlClient to dataset name dlt_assets_incremental failed. Please check if you configured the credentials at all and provided the right credentials values. You can be also denied access or your internet connection may be down. The actual reason given is: IO Error: Cannot open file "c:\dev\dagster_small\dlt_with_dagster_example2\dlt_assets_incremental__family_pipeline.duckdb": The process cannot access the file because it is being used by another process.


Stack Trace:
  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dagster\_core\execution\plan\utils.py", line 54, in op_execution_error_boundary
    yield
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dagster\_utils\__init__.py", line 490, in iterate_with_context
    next_output = next(iterator)
                  ^^^^^^^^^^^^^^
,  File "C:\dev\dagster_small\dlt_with_dagster_example2\dlt_with_dagster_example\assets\dlt_assets_incremental.py", line 84, in dagster_sql_assets
    yield from dlt.run(context=context,
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dagster_embedded_elt\dlt\dlt_event_iterator.py", line 76, in __next__
    return next(self._inner_iterator)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dagster_embedded_elt\dlt\resource.py", line 286, in _run
    load_info = dlt_pipeline.run(dlt_source, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\pipeline\pipeline.py", line 226, in _wrap
    step_info = f(self, *args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\pipeline\pipeline.py", line 275, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\pipeline\pipeline.py", line 747, in run
    return self.load(destination, dataset_name, credentials=credentials)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\pipeline\pipeline.py", line 226, in _wrap
    step_info = f(self, *args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\pipeline\pipeline.py", line 166, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\pipeline\pipeline.py", line 275, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\pipeline\pipeline.py", line 615, in load
    raise PipelineStepFailed(
The above exception was caused by the following exception:
dlt.destinations.exceptions.DestinationConnectionError: Connection with DuckDbSqlClient to dataset name dlt_assets_incremental failed. Please check if you configured the credentials at all and provided the right credentials values. You can be also denied access or your internet connection may be down. The actual reason given is: IO Error: Cannot open file "c:\dev\dagster_small\dlt_with_dagster_example2\dlt_assets_incremental__family_pipeline.duckdb": The process cannot access the file because it is being used by another process.

Stack Trace:
  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\pipeline\pipeline.py", line 608, in load
    runner.run_pool(load_step.config, load_step)
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\common\runners\pool_runner.py", line 91, in run_pool
    while _run_func():
          ^^^^^^^^^^^
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\common\runners\pool_runner.py", line 84, in _run_func
    run_metrics = run_f.run(cast(TExecutor, pool))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\load\load.py", line 639, in run
    self.load_single_package(load_id, schema)
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\load\load.py", line 608, in load_single_package
    self.complete_package(load_id, schema, False)
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\load\load.py", line 477, in complete_package
    with self.get_destination_client(schema) as job_client:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\destinations\job_client_impl.py", line 297, in __enter__
    self.sql_client.open_connection()
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\destinations\sql_client.py", line 441, in _wrap
    raise DestinationConnectionError(type(self).__name__, self.dataset_name, str(ex), ex)
The above exception occurred during handling of the following exception:
duckdb.duckdb.IOException: IO Error: Cannot open file "c:\dev\dagster_small\dlt_with_dagster_example2\dlt_assets_incremental__family_pipeline.duckdb": The process cannot access the file because it is being used by another process.


Stack Trace:
  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\destinations\sql_client.py", line 439, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\destinations\impl\duckdb\sql_client.py", line 75, in open_connection
    self._conn = self.credentials.borrow_conn(read_only=self.credentials.read_only)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dlt\destinations\impl\duckdb\configuration.py", line 43, in borrow_conn
    self._conn = duckdb.connect(
                 ^^^^^^^^^^^^^^^

Dagster version

dagster, version 1.9.6

Deployment type

Local

Deployment details

using source of mysql+pymysql://[email protected]:4497/Rfam
and destination of local filesystem or duckdb. each asset script has output_dir var for the local filesystem

Additional information

You might get other errors such from dlt like access denied and timeouts but they are not related to the duckdb access issue

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

@pwr-philarmstrong pwr-philarmstrong added the type: bug Something isn't working label Jan 6, 2025
@garethbrickman garethbrickman added integration: embedded-elt Related to dagster-embedded-elt which uses Sling and data Load Tool (dlt) integration: duckdb Related to DuckDB integrations labels Jan 6, 2025
@cmpadden cmpadden self-assigned this Jan 6, 2025
@pwr-philarmstrong
Copy link
Author

Also got this error on a subsequent run of the assets. This assst was one that failed on with the duckdb error

inc_table_family
STEP_FAILURE
dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "inc_table_family":
The above exception was caused by the following exception:
AttributeError: 'NoneType' object has no attribute 'row_counts'

Stack Trace:
  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dagster\_core\execution\plan\utils.py", line 54, in op_execution_error_boundary
    yield
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dagster\_utils\__init__.py", line 490, in iterate_with_context
    next_output = next(iterator)
                  ^^^^^^^^^^^^^^
,  File "C:\dev\dagster_small\dlt_with_dagster_example2\dlt_with_dagster_example\assets\dlt_assets_incremental.py", line 84, in dagster_sql_assets
    yield from dlt.run(context=context,
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dagster_embedded_elt\dlt\dlt_event_iterator.py", line 76, in __next__
    return next(self._inner_iterator)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dagster_embedded_elt\dlt\resource.py", line 296, in _run
    metadata = self.extract_resource_metadata(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
,  File "C:\Users\PhilArmstrong\AppData\Local\pypoetry\Cache\virtualenvs\dataplatform-cCD1-Q8f-py3.12\Lib\site-packages\dagster_embedded_elt\dlt\resource.py", line 138, in extract_resource_metadata
    rows_loaded = dlt_pipeline.last_trace.last_normalize_info.row_counts.get(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

View full message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration: duckdb Related to DuckDB integrations integration: embedded-elt Related to dagster-embedded-elt which uses Sling and data Load Tool (dlt) type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants