Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve BRAC model performance #82

Closed
njuguna-n opened this issue May 16, 2024 · 31 comments
Closed

Improve BRAC model performance #82

njuguna-n opened this issue May 16, 2024 · 31 comments
Assignees

Comments

@njuguna-n
Copy link
Contributor

njuguna-n commented May 16, 2024

We noticed that dbt run performance becomes increasingly worse as the document count increases. This issue will hold the discussion on what we are trying or have tried to improve the performance.

@njuguna-n njuguna-n converted this from a draft issue May 16, 2024
@andrablaj andrablaj added this to the BRAC Models milestone May 16, 2024
@njuguna-n
Copy link
Contributor Author

Current performance running DBT with the models defined here:
Raw Document count is approximately 14,488,958 documents
Last dbt run

12:08:28  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
12:08:36  Running with dbt=1.8.0
12:08:37  Registered adapter: postgres=1.8.0
12:08:38  Found 16 models, 8 data tests, 1 source, 539 macros
12:08:38
12:08:38  Concurrency: 1 threads (target='default')
12:08:38
12:08:38  1 of 3 START sql incremental model v1.couchdb .................................. [RUN]
12:21:19  1 of 3 OK created sql incremental model v1.couchdb ............................. [INSERT 0 1156001 in 760.94s]
12:21:19  2 of 3 START sql incremental model v1.chp_with_branches ........................ [RUN]
12:21:19  2 of 3 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 0.26s]
12:21:19  3 of 3 START sql incremental model v1.data_record .............................. [RUN]
12:50:46  3 of 3 OK created sql incremental model v1.data_record ......................... [INSERT 0 930161 in 1766.98s]
12:50:46
12:50:46  Finished running 3 incremental models in 0 hours 42 minutes and 8.40 seconds (2528.40s).

As you can see in the logs above, the insertions to two of the three incremental tables take the longest time. The query plans for the two queries are below.

couchdb

explain SELECT
    doc ->> '_id'::text AS uuid,
    doc ->> 'type'::text AS type,
    doc ->> 'name'::text AS name,
    doc ->> 'contact_type'::text AS contact_type,
    doc ->> 'phone'::text AS phone,
    doc ->> 'alternative_phone'::text AS phone2,
    doc ->> 'date_of_birth'::text AS date_of_birth,
    doc #>> '{contact,_id}'::text[] AS contact_uuid,
    doc #>> '{parent,_id}'::text[] AS parent_uuid,
    doc ->> 'is_active'::text AS active,
    doc ->> 'notes'::text AS notes,
    doc ->> 'reported_date'::text AS reported_date,
    doc ->> 'area'::text AS area,
    doc ->> 'region'::text AS region,
    doc ->> 'contact_id'::text AS contact_id,
    *
FROM v1.medic

  WHERE "@timestamp" >= (select coalesce(max("@timestamp"), '1900-01-01') from "data"."v1"."couchdb");
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Seq Scan on medic  (cost=1060798.57..3467868.47 rows=5312852 width=1492)
   Filter: ("@timestamp" >= $1)
   InitPlan 1 (returns $1)
     ->  Finalize Aggregate  (cost=1060798.56..1060798.57 rows=1 width=8)
           ->  Gather  (cost=1060798.34..1060798.55 rows=2 width=8)
                 Workers Planned: 2
                 ->  Partial Aggregate  (cost=1059798.34..1059798.35 rows=1 width=8)
                       ->  Parallel Seq Scan on couchdb  (cost=0.00..1047998.28 rows=4720028 width=8)
 JIT:
   Functions: 9
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(11 rows)

data_record

explain SELECT
  uuid,
  (doc #>> '{fields,patient_id}')::text AS patient_id,
  (doc #>> '{contact,_id}')::text as contact_uuid,
  (doc #>> '{contact,parent,_id}')::text as contact_parent_uuid,
  to_timestamp((NULLIF(doc ->> 'reported_date'::text, ''::text)::bigint / 1000)::double precision) AS reported,
  doc ->> 'form' as form,
  doc,
  "@timestamp"
FROM "data"."v1"."couchdb"
WHERE
  type = 'data_record'

    AND "@timestamp" >= (select coalesce(max("@timestamp"), '1900-01-01') from "data"."v1"."data_record");
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Seq Scan on couchdb  (cost=312827.06..1551160.13 rows=2458694 width=736)
   Filter: (("@timestamp" >= $1) AND (type = 'data_record'::text))
   InitPlan 1 (returns $1)
     ->  Finalize Aggregate  (cost=312827.05..312827.06 rows=1 width=8)
           ->  Gather  (cost=312826.83..312827.04 rows=2 width=8)
                 Workers Planned: 2
                 ->  Partial Aggregate  (cost=311826.83..311826.84 rows=1 width=8)
                       ->  Parallel Seq Scan on data_record  (cost=0.00..304341.27 rows=2994227 width=8)
 JIT:
   Functions: 9
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(11 rows)

@njuguna-n
Copy link
Contributor Author

For additional context below are the logs for the dbt run immediately before and after the one posted above.
Before:

11:42:03  Running with dbt=1.8.0
11:42:04  Registered adapter: postgres=1.8.0
11:42:04  Found 16 models, 8 data tests, 1 source, 539 macros
11:42:04
11:42:05  Concurrency: 1 threads (target='default')
11:42:05
11:42:05  1 of 3 START sql incremental model v1.couchdb .................................. [RUN]
11:51:39  1 of 3 OK created sql incremental model v1.couchdb ............................. [INSERT 0 784126 in 574.11s]
11:51:39  2 of 3 START sql incremental model v1.chp_with_branches ........................ [RUN]
11:51:39  2 of 3 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 0.60s]
11:51:39  3 of 3 START sql incremental model v1.data_record .............................. [RUN]
12:08:28  3 of 3 OK created sql incremental model v1.data_record ......................... [INSERT 0 615364 in 1008.93s]
12:08:28
12:08:28  Finished running 3 incremental models in 0 hours 26 minutes and 23.85 seconds (1583.85s).

After:

12:50:54  Running with dbt=1.8.0
12:50:55  Registered adapter: postgres=1.8.0
12:50:55  Found 16 models, 8 data tests, 1 source, 539 macros
12:50:55
12:50:56  Concurrency: 1 threads (target='default')
12:50:56
12:50:56  1 of 3 START sql incremental model v1.couchdb .................................. [RUN]
13:12:31  1 of 3 OK created sql incremental model v1.couchdb ............................. [INSERT 0 1782126 in 1295.05s]
13:12:31  2 of 3 START sql incremental model v1.chp_with_branches ........................ [RUN]
13:12:31  2 of 3 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 0.32s]
13:12:31  3 of 3 START sql incremental model v1.data_record .............................. [RUN]
14:11:07  3 of 3 OK created sql incremental model v1.data_record ......................... [INSERT 0 1412597 in 3515.48s]
14:11:07
14:11:07  Finished running 3 incremental models in 1 hours 20 minutes and 11.08 seconds (4811.08s).

@witash
Copy link
Contributor

witash commented May 16, 2024

Parallel Seq Scan on couchdb (cost=0.00..1047998.28 rows=4720028 width=8)
Parallel Seq Scan on data_record (cost=0.00..304341.27 rows=2994227 width=8)
y is it not using the indexes on timestamp

@njuguna-n
Copy link
Contributor Author

y is it not using the indexes on timestamp

I honestly don't know. I have tried rewriting the incremental query with no luck

explain select max("@timestamp") from v1.data_record;
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=322734.15..322734.16 rows=1 width=8)
   ->  Gather  (cost=322733.94..322734.15 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=321733.94..321733.95 rows=1 width=8)
               ->  Parallel Seq Scan on data_record  (cost=0.00..314011.35 rows=3089035 width=8)
 JIT:
   Functions: 5
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(8 rows)
explain select "@timestamp" from v1.data_record order by "@timestamp" desc limit 1;
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Limit  (cost=330456.55..330456.67 rows=1 width=8)
   ->  Gather Merge  (cost=330456.55..1051281.71 rows=6178070 width=8)
         Workers Planned: 2
         ->  Sort  (cost=329456.52..337179.11 rows=3089035 width=8)
               Sort Key: "@timestamp" DESC
               ->  Parallel Seq Scan on data_record  (cost=0.00..314011.35 rows=3089035 width=8)
 JIT:
   Functions: 3
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(9 rows)

@witash
Copy link
Contributor

witash commented May 16, 2024

its because of the block range index for some reason.
locally takes 5 seconds to do a seq scan on couchdb

but if I try with a btree index, it uses that and takes <1 ms


medic=# explain analyze select max("@timestamp") from v1.couchdb;
                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=186710.47..186710.48 rows=1 width=8) (actual time=4237.669..4244.168 rows=1 loops=1)
   ->  Gather  (cost=186710.25..186710.46 rows=2 width=8) (actual time=4237.449..4244.145 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=185710.25..185710.26 rows=1 width=8) (actual time=4219.535..4219.537 rows=1 loops=3)
               ->  Parallel Seq Scan on couchdb  (cost=0.00..184159.80 rows=620180 width=8) (actual time=0.207..4041.777 rows=496144 loops=3)
 Planning Time: 0.217 ms
 JIT:
   Functions: 11
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 1.497 ms, Inlining 0.000 ms, Optimization 1.290 ms, Emission 18.418 ms, Total 21.205 ms
 Execution Time: 4244.974 ms
(12 rows)

medic=# CREATE INDEX "5a06b9d93ee7c632f55b50cb4e3d9a6f" ON v1.couchdb USING btree ("@timestamp")
;
CREATE INDEX
medic=# explain analyze select max("@timestamp") from v1.couchdb;
                                                                                    QUERY PLAN                                                                                    
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=0.47..0.48 rows=1 width=8) (actual time=0.060..0.061 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.43..0.47 rows=1 width=8) (actual time=0.056..0.057 rows=1 loops=1)
           ->  Index Only Scan Backward using "5a06b9d93ee7c632f55b50cb4e3d9a6f" on couchdb  (cost=0.43..62546.03 rows=1488432 width=8) (actual time=0.054..0.054 rows=1 loops=1)
                 Index Cond: ("@timestamp" IS NOT NULL)
                 Heap Fetches: 1
 Planning Time: 0.300 ms
 Execution Time: 0.081 ms

@lorerod
Copy link
Contributor

lorerod commented May 16, 2024

I'm going to test this locally to try to figure out why the indexes are not working. Which branch of cht-sync are you using? do you create indexes also for v1.medic table in cht-sync init-dbt-resources.sh?

@lorerod
Copy link
Contributor

lorerod commented May 16, 2024

Also, on couhdb model did you tried the explain plan using:

{% if is_incremental() %}
  WHERE "@timestamp" >= {{ max_existing_timestamp('"@timestamp"') }}
{% endif %}

?
Because when using:

WHERE "@timestamp" >= (select coalesce(max("@timestamp"), '1900-01-01') from "data"."v1"."couchdb");

the coalesce, and max is going to force the full scan.

@lorerod
Copy link
Contributor

lorerod commented May 17, 2024

I tested this locally.
Using cht-sync main branch, and editing /postgres/init-dbt-resources.sh to add the creation of indexes:

CREATE INDEX timestamp ON $POSTGRES_SCHEMA.$POSTGRES_TABLE USING BRIN ("@timestamp");
CREATE INDEX timestamp_btree ON $POSTGRES_SCHEMA.$POSTGRES_TABLE USING btree("@timestamp");

And using the following .env:

.env
# (Optional) project wide
COMPOSE_PROJECT_NAME=pipeline

# postgrest and postgresql
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
POSTGRES_DB=data
POSTGRES_SCHEMA=v1
POSTGRES_TABLE=medic # for dbt use only
POSTGRES_HOST=postgres # Your postgres instance IP or endpoint in "prod".

# dbt
DBT_POSTGRES_USER=postgres
DBT_POSTGRES_PASSWORD=postgres
DBT_POSTGRES_SCHEMA=dbt
DBT_POSTGRES_HOST=postgres # Your postgres instance IP or endpoint in "prod".
CHT_PIPELINE_BRANCH_URL="https://github.com/medic/cht-pipeline.git#stripped-out-brac-models"


# couchdb and logstash
COUCHDB_USER=****
COUCHDB_PASSWORD=****
COUCHDB_DBS="medic" # space separated list of databases you want to sync e.g "medic medic_sentinel"
COUCHDB_HOST=brac-clone-for-phil.dev.medicmobile.org
COUCHDB_PORT=443
COUCHDB_SECURE=true
REDIS_HOST=redis
REDIS_PORT=6379
REDIS_KEY=medic
POSTGREST_ENDPOINT=postgrest:3000
REDIS_BATCH_SIZE=100

Connected to the Postgres database and run the following:
For couchdb:

explain SELECT
    doc ->> '_id'::text AS uuid,
    doc ->> 'type'::text AS type,
    doc ->> 'name'::text AS name,
    doc ->> 'contact_type'::text AS contact_type,
    doc ->> 'phone'::text AS phone,
    doc ->> 'alternative_phone'::text AS phone2,
    doc ->> 'date_of_birth'::text AS date_of_birth,
    doc #>> '{contact,_id}'::text[] AS contact_uuid,
    doc #>> '{parent,_id}'::text[] AS parent_uuid,
    doc ->> 'is_active'::text AS active,
    doc ->> 'notes'::text AS notes,
    doc ->> 'reported_date'::text AS reported_date,
    doc ->> 'area'::text AS area,
    doc ->> 'region'::text AS region,
    doc ->> 'contact_id'::text AS contact_id,
    *
FROM v1.medic

  WHERE "@timestamp" >= (select coalesce(max("@timestamp"), '1900-01-01') from "data"."v1"."couchdb");

The output:

Index Scan using timestamp_btree on medic  (cost=0.92..243606.83 rows=1207606 width=1570)
  Index Cond: ("@timestamp" >= $1)
  InitPlan 2 (returns $1)
    ->  Result  (cost=0.48..0.49 rows=1 width=8)
          InitPlan 1 (returns $0)
            ->  Limit  (cost=0.43..0.48 rows=1 width=8)
                  ->  Index Only Scan Backward using "6740ec69d7df950ef91d9ff0c8b5a0d4" on couchdb  (cost=0.43..175921.66 rows=3508293 width=8)
                        Index Cond: ("@timestamp" IS NOT NULL)
JIT:
  Functions: 7
  Options: Inlining false, Optimization false, Expressions true, Deforming true

For data_record:

 explain SELECT
  uuid,
  (doc #>> '{fields,patient_id}')::text AS patient_id,
  (doc #>> '{contact,_id}')::text as contact_uuid,
  (doc #>> '{contact,parent,_id}')::text as contact_parent_uuid,
  to_timestamp((NULLIF(doc ->> 'reported_date'::text, ''::text)::bigint / 1000)::double precision) AS reported,
  doc ->> 'form' as form,
  doc,
  "@timestamp"
FROM "data"."dbt"."couchdb"
WHERE
  type = 'data_record'

    AND "@timestamp" >= (select coalesce(max("@timestamp"), '1900-01-01') from "data"."dbt"."data_record");

The output:

Index Scan using "6740ec69d7df950ef91d9ff0c8b5a0d4" on couchdb  (cost=0.90..191000.46 rows=494053 width=967)
  Index Cond: ("@timestamp" >= $1)
  Filter: (type = 'data_record'::text)
  InitPlan 2 (returns $1)
    ->  Result  (cost=0.46..0.47 rows=1 width=8)
          InitPlan 1 (returns $0)
            ->  Limit  (cost=0.43..0.46 rows=1 width=8)
                  ->  Index Only Scan Backward using e4c8daed4ea6596ed764e3b188633cb9 on data_record  (cost=0.43..50748.41 rows=1511292 width=8)
                        Index Cond: ("@timestamp" IS NOT NULL)
JIT:
  Functions: 9
  Options: Inlining false, Optimization false, Expressions true, Deforming true

If I understand correctly, this is using the indexes.

@lorerod
Copy link
Contributor

lorerod commented May 17, 2024

The time in my local with approximately 4 million docs in db:

2024-05-16 20:40:13 00:40:13  Running with dbt=1.8.0
2024-05-16 20:40:14 00:40:14  Registered adapter: postgres=1.8.0
2024-05-16 20:40:14 00:40:14  Found 16 models, 8 data tests, 1 source, 539 macros
2024-05-16 20:40:14 00:40:14  
2024-05-16 20:40:15 00:40:15  Concurrency: 1 threads (target='default')
2024-05-16 20:40:15 00:40:15  
2024-05-16 20:40:15 00:40:15  1 of 3 START sql incremental model dbt.couchdb ................................. [RUN]
2024-05-16 20:40:55 00:40:55  1 of 3 OK created sql incremental model dbt.couchdb ............................ [INSERT 0 83043 in 40.29s]
2024-05-16 20:40:55 00:40:55  2 of 3 START sql incremental model dbt.chp_with_branches ....................... [RUN]
2024-05-16 20:40:55 00:40:55  2 of 3 OK created sql incremental model dbt.chp_with_branches .................. [INSERT 0 0 in 0.39s]
2024-05-16 20:40:55 00:40:55  3 of 3 START sql incremental model dbt.data_record ............................. [RUN]
2024-05-16 20:42:13 00:42:13  3 of 3 OK created sql incremental model dbt.data_record ........................ [INSERT 0 50199 in 77.75s]
2024-05-16 20:42:13 00:42:13  
2024-05-16 20:42:13 00:42:13  Finished running 3 incremental models in 0 hours 1 minutes and 58.73 seconds (118.73s).

@njuguna-n
Copy link
Contributor Author

@witash changing the type of index does indeed help. The query plan for data_record below now makes use of the index

 Index Scan using "780af2d8426d8a6255945d09757f50ce" on couchdb  (cost=1.04..1606494.98 rows=4590898 width=1038)
   Index Cond: ("@timestamp" >= $1)
   Filter: (type = 'data_record'::text)
   InitPlan 2 (returns $1)
     ->  Result  (cost=0.47..0.48 rows=1 width=8)
           InitPlan 1 (returns $0)
             ->  Limit  (cost=0.43..0.47 rows=1 width=8)
                   ->  Index Only Scan Backward using "85744a0b560b380f4cd6c427709abb83" on data_record  (cost=0.43..311714.10 rows=9913218 width=8)
                         Index Cond: ("@timestamp" IS NOT NULL)
 JIT:
   Functions: 9
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(12 rows)

@njuguna-n
Copy link
Contributor Author

dbt run with about 32,878,580 raw docs took ~3 hours for the last completed run. The time still goes up with the more docs we get so further optimization definitely required.

23:59:23  Running with dbt=1.8.0
23:59:24  Registered adapter: postgres=1.8.0
23:59:25  Found 16 models, 8 data tests, 1 source, 539 macros
23:59:25
23:59:25  Concurrency: 1 threads (target='default')
23:59:25
23:59:25  1 of 3 START sql incremental model v1.couchdb .................................. [RUN]
00:31:34  1 of 3 OK created sql incremental model v1.couchdb ............................. [INSERT 0 2827052 in 1929.07s]
00:31:34  2 of 3 START sql incremental model v1.chp_with_branches ........................ [RUN]
00:33:13  2 of 3 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 99.02s]
00:33:13  3 of 3 START sql incremental model v1.data_record .............................. [RUN]
03:12:17  3 of 3 OK created sql incremental model v1.data_record ......................... [INSERT 0 1751765 in 9543.71s]
03:12:17
03:12:17  Finished running 3 incremental models in 3 hours 12 minutes and 52.04 seconds (11572.04s).

@njuguna-n
Copy link
Contributor Author

@lorerod I was getting an error when using the macros for the incremental filter I was getting an error so reverted back to the sub query. I did try not using coalesce and max functions for the sub query but that didn't help. Changing the index type has helped though as mentioned above.

@njuguna-n
Copy link
Contributor Author

The incremental logic can potentially be improved by further filtering by the reported column with something like the query below but the order cannot be guaranteed so it is very buggy.

 AND "@timestamp" >= (select coalesce(max("@timestamp"), '1900-01-01') from {{ this }})
 AND to_timestamp((NULLIF(doc ->> 'reported_date'::text, ''::text)::bigint / 1000)::double precision) >=(select coalesce(max("reported"), '1900-01-01') from {{ this }})

Creating partitioned tables in DBT is not straight forward and there is an open pr to add partitioning support. There seems to be a workaround by creating a function in DBT and have that create the partition tables but I have not yet gotten that to work. Right now I have removed the couchb model and created a new contact model to try and reduce the number of duplicates we store in the base tables.

@njuguna-n
Copy link
Contributor Author

After changing the root table to have an autoincrement primary key below is the current query plan

 Gather  (cost=111901.28..720825.51 rows=9038 width=859)
   Workers Planned: 2
   Params Evaluated: $1
   InitPlan 1 (returns $1)
     ->  Finalize Aggregate  (cost=110901.27..110901.28 rows=1 width=4)
           ->  Gather  (cost=110901.06..110901.27 rows=2 width=4)
                 Workers Planned: 2
                 ->  Partial Aggregate  (cost=109901.06..109901.07 rows=1 width=4)
                       ->  Parallel Seq Scan on data_record  (cost=0.00..107259.05 rows=1056805 width=4)
   ->  Parallel Seq Scan on couchdb  (cost=0.00..608020.43 rows=3766 width=859)
         Filter: ((auto_id > $1) AND ((doc ->> 'type'::text) = 'data_record'::text))
 JIT:
   Functions: 9
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(14 rows)

dbt run takes ~10minutes with ~5.5million docs

09:59:51  1 of 4 START sql incremental model v1.couchdb .................................. [RUN]
10:02:54  1 of 4 OK created sql incremental model v1.couchdb ............................. [INSERT 0 416625 in 182.27s]
10:02:54  2 of 4 START sql incremental model v1.contact .................................. [RUN]
10:04:26  2 of 4 OK created sql incremental model v1.contact ............................. [INSERT 0 46831 in 92.34s]
10:04:26  3 of 4 START sql incremental model v1.data_record .............................. [RUN]
10:09:58  3 of 4 OK created sql incremental model v1.data_record ......................... [INSERT 0 301657 in 331.69s]
10:09:58  4 of 4 START sql incremental model v1.chp_with_branches ........................ [RUN]
10:09:58  4 of 4 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 0.15s]
10:09:58
10:09:58  Finished running 4 incremental models in 0 hours 10 minutes and 6.69 seconds (606.69s).

@garethbowen
Copy link
Contributor

@njuguna-n Can you share the full query? Are you querying based on the primary key or the timestamp?

@njuguna-n
Copy link
Contributor Author

@garethbowen the query:

explain SELECT
  doc ->> '_id'::text AS uuid,
  (doc #>> '{fields,patient_id}')::text AS patient_id,
  (doc #>> '{contact,_id}')::text as contact_uuid,
  (doc #>> '{contact,parent,_id}')::text as contact_parent_uuid,
  to_timestamp((NULLIF(doc ->> 'reported_date'::text, ''::text)::bigint / 1000)::double precision) AS reported,
  doc ->> 'form' as form,
  doc,
  auto_id,
  "@timestamp"
FROM "data"."v1"."couchdb"
WHERE doc ->> 'type' = 'data_record'

  AND "auto_id" > (select max("auto_id") from "data"."v1"."data_record");

@derickl
Copy link
Member

derickl commented May 20, 2024

There shouldn't be anything fundamentally different between using an auto_id and a timestamp.

The pattern that @lorerod shows here uses a dbt macro to get the static value to compare against the column on the where clause. The postgres optimiser is able to make effectively use and index than when using a subquery. Unless things have changed over the last 2 or so years.

@lorerod
Copy link
Contributor

lorerod commented May 20, 2024

@lorerod I was getting an error when using the macros for the incremental filter I was getting an error so reverted back to the sub query. I did try not using coalesce and max functions for the sub query but that didn't help. Changing the index type has helped though as mentioned above.

@njuguna-n, what error were you getting?

@njuguna-n
Copy link
Contributor Author

@lorerod the error

Database Error in model couchdb (models/root/couchdb.sql)
  invalid input syntax for type timestamp: "None"
  LINE 34:   WHERE "@timestamp" >= CAST('None' AS TIMESTAMP)

@njuguna-n
Copy link
Contributor Author

@derickl the macros also creates a sub query is that really more efficient than writing the sub query directly?

@lorerod
Copy link
Contributor

lorerod commented May 21, 2024

@lorerod the error

Database Error in model couchdb (models/root/couchdb.sql)
  invalid input syntax for type timestamp: "None"
  LINE 34:   WHERE "@timestamp" >= CAST('None' AS TIMESTAMP)

For me, it was working in this commit.

@njuguna-n
Copy link
Contributor Author

Update: The data_record takes the longest time to update for some reason, for example int hr DBT logs here it takes approximately 54% of the running time yet couchdb has the most records. I tried using the macros @lorerod added just for that table but it still takes too long to run. I also tried adding materialized views for the max timestamp but even with the same query the one for data_record has been running for close to 2 hours now when the other two run in milliseconds.

Even an explain query like explain SELECT coalesce(max("@timestamp"), '1900-01-01') AS max_timestamp FROM "data"."v1"."data_record"; takes too long and I have to cancel it without getting a response yet a similar query explain SELECT coalesce(max("@timestamp"), '1900-01-01') AS max_timestamp FROM "data"."v1"."couchdb"; returns almost immediately. Currently my DB has 44m rows in the medic table, 9.4m in the data_record table, 27m in the couchdb table and 2.7m in the contacts table.

I'll temporarily remove data_record and the form views to benchmark using the macros vs the mat views while I try and figure out what is going on with it.

@derickl
Copy link
Member

derickl commented May 22, 2024

Is the long run-time only affecting the first load or does it also affect the incremental loading?

@derickl
Copy link
Member

derickl commented May 22, 2024

#82 (comment)

None is not a valid type in SQL. NULL however would work.

@njuguna-n
Copy link
Contributor Author

Is the long run-time only affecting the first load or does it also affect the incremental loading?

It also affects incremental loading with runs taking longer with more data.

@njuguna-n
Copy link
Contributor Author

Three incremental tables and two materialized views now take about 9 minutes to refresh. Tomorrow I will test if using the mat views is faster than the macros for the incremental queries.

18:20:34  Running with dbt=1.8.0
18:20:34  Registered adapter: postgres=1.8.0
18:20:35  Found 10 models, 4 data tests, 1 source, 539 macros
18:20:35
18:20:35  Concurrency: 1 threads (target='default')
18:20:35
18:20:35  1 of 5 START sql incremental model v1.couchdb .................................. [RUN]
18:27:42  1 of 5 OK created sql incremental model v1.couchdb ............................. [INSERT 0 1 in 427.47s]
18:27:42  2 of 5 START sql incremental model v1.contact .................................. [RUN]
18:27:42  2 of 5 OK created sql incremental model v1.contact ............................. [INSERT 0 1 in 0.14s]
18:27:42  3 of 5 START sql materialized_view model v1.latest_couchdb_timestamp ........... [RUN]
18:27:42  3 of 5 OK created sql materialized_view model v1.latest_couchdb_timestamp ...... [REFRESH MATERIALIZED VIEW in 0.10s]
18:27:42  4 of 5 START sql incremental model v1.chp_with_branches ........................ [RUN]
18:29:49  4 of 5 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 126.68s]
18:29:49  5 of 5 START sql materialized_view model v1.latest_contact_timestamp ........... [RUN]
18:29:49  5 of 5 OK created sql materialized_view model v1.latest_contact_timestamp ...... [REFRESH MATERIALIZED VIEW in 0.06s]
18:29:49
18:29:49  Finished running 3 incremental models, 2 materialized view models in 0 hours 9 minutes and 14.63 seconds (554.63s).

@njuguna-n
Copy link
Contributor Author

Mat views ended up causing a cyclical dependency based on how I used them. I tried using mat views for contactview_person_fields and contactview_chp mainly because I needed to query them to debug why chp_with_branches table had no rows. I restarted the sync process with the idea of getting all data in the BRAC clone copied over to my test DB. I got ~44m docs in the medic table and ~22m docs in the couchdb table because the DBT modelling could not keep up. There was a big jump from runs taking ~4mins to taking ~3hours and the next run after that did not complete so not sure how long it would have taken it how long it had been running. See logs below:

20:36:48  1 of 5 START sql incremental model v1.couchdb .................................. [RUN]
20:39:42  1 of 5 OK created sql incremental model v1.couchdb ............................. [INSERT 0 137751 in 174.52s]
20:39:42  2 of 5 START sql incremental model v1.contact .................................. [RUN]
20:39:54  2 of 5 OK created sql incremental model v1.contact ............................. [INSERT 0 8468 in 12.29s]
20:39:54  3 of 5 START sql materialized_view model v1.contactview_person_fields .......... [RUN]
20:40:14  3 of 5 OK created sql materialized_view model v1.contactview_person_fields ..... [REFRESH MATERIALIZED VIEW in 19.78s]
20:40:14  4 of 5 START sql materialized_view model v1.contactview_chp .................... [RUN]
20:40:14  4 of 5 OK created sql materialized_view model v1.contactview_chp ............... [REFRESH MATERIALIZED VIEW in 0.14s]
20:40:14  5 of 5 START sql incremental model v1.chp_with_branches ........................ [RUN]
20:40:14  5 of 5 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 0.07s]
20:40:14
20:40:14  Finished running 3 incremental models, 2 materialized view models in 0 hours 3 minutes and 26.97 seconds (206.97s).
20:40:15
20:40:15  Completed successfully
20:40:15
20:40:15  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
20:40:21  Running with dbt=1.8.0
20:40:21  Registered adapter: postgres=1.8.0
20:40:22  Found 8 models, 4 data tests, 1 source, 539 macros
20:40:22
20:40:22  Concurrency: 1 threads (target='default')
20:40:22
20:40:22  1 of 5 START sql incremental model v1.couchdb .................................. [RUN]
20:43:07  1 of 5 OK created sql incremental model v1.couchdb ............................. [INSERT 0 134751 in 165.08s]
20:43:07  2 of 5 START sql incremental model v1.contact .................................. [RUN]
20:43:14  2 of 5 OK created sql incremental model v1.contact ............................. [INSERT 0 8326 in 6.66s]
20:43:14  3 of 5 START sql materialized_view model v1.contactview_person_fields .......... [RUN]
20:44:22  3 of 5 OK created sql materialized_view model v1.contactview_person_fields ..... [REFRESH MATERIALIZED VIEW in 68.33s]
20:44:22  4 of 5 START sql materialized_view model v1.contactview_chp .................... [RUN]
20:44:22  4 of 5 OK created sql materialized_view model v1.contactview_chp ............... [REFRESH MATERIALIZED VIEW in 0.16s]
20:44:22  5 of 5 START sql incremental model v1.chp_with_branches ........................ [RUN]
20:44:22  5 of 5 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 0.07s]
20:44:22
20:44:22  Finished running 3 incremental models, 2 materialized view models in 0 hours 4 minutes and 0.51 seconds (240.51s).
20:44:22
20:44:22  Completed successfully
20:44:22
20:44:22  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
20:44:29  Running with dbt=1.8.0
20:44:29  Registered adapter: postgres=1.8.0
20:44:30  Found 8 models, 4 data tests, 1 source, 539 macros
20:44:30
20:44:30  Concurrency: 1 threads (target='default')
20:44:30
20:44:30  1 of 5 START sql incremental model v1.couchdb .................................. [RUN]
20:47:26  1 of 5 OK created sql incremental model v1.couchdb ............................. [INSERT 0 188501 in 175.93s]
20:47:26  2 of 5 START sql incremental model v1.contact .................................. [RUN]
20:47:30  2 of 5 OK created sql incremental model v1.contact ............................. [INSERT 0 3860 in 4.31s]
20:47:30  3 of 5 START sql materialized_view model v1.contactview_person_fields .......... [RUN]
20:48:02  3 of 5 OK created sql materialized_view model v1.contactview_person_fields ..... [REFRESH MATERIALIZED VIEW in 31.83s]
20:48:02  4 of 5 START sql materialized_view model v1.contactview_chp .................... [RUN]
20:48:02  4 of 5 OK created sql materialized_view model v1.contactview_chp ............... [REFRESH MATERIALIZED VIEW in 0.15s]
20:48:02  5 of 5 START sql incremental model v1.chp_with_branches ........................ [RUN]
20:48:02  5 of 5 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 0.08s]
20:48:02
20:48:02  Finished running 3 incremental models, 2 materialized view models in 0 hours 3 minutes and 32.46 seconds (212.46s).
20:48:02
20:48:02  Completed successfully
20:48:02
20:48:02  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
20:48:09  Running with dbt=1.8.0
20:48:09  Registered adapter: postgres=1.8.0
20:48:09  Found 8 models, 4 data tests, 1 source, 539 macros
20:48:09
20:48:09  Concurrency: 1 threads (target='default')
20:48:09
20:48:09  1 of 5 START sql incremental model v1.couchdb .................................. [RUN]
20:50:46  1 of 5 OK created sql incremental model v1.couchdb ............................. [INSERT 0 228376 in 156.43s]
20:50:46  2 of 5 START sql incremental model v1.contact .................................. [RUN]
20:50:52  2 of 5 OK created sql incremental model v1.contact ............................. [INSERT 0 5957 in 5.87s]
20:50:52  3 of 5 START sql materialized_view model v1.contactview_person_fields .......... [RUN]
20:51:23  3 of 5 OK created sql materialized_view model v1.contactview_person_fields ..... [REFRESH MATERIALIZED VIEW in 31.03s]
20:51:23  4 of 5 START sql materialized_view model v1.contactview_chp .................... [RUN]
20:51:23  4 of 5 OK created sql materialized_view model v1.contactview_chp ............... [REFRESH MATERIALIZED VIEW in 0.15s]
20:51:23  5 of 5 START sql incremental model v1.chp_with_branches ........................ [RUN]
20:51:23  5 of 5 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 0.07s]
20:51:23
20:51:23  Finished running 3 incremental models, 2 materialized view models in 0 hours 3 minutes and 13.73 seconds (193.73s).
20:51:23
20:51:23  Completed successfully
20:51:23
20:51:23  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
20:51:30  Running with dbt=1.8.0
20:51:30  Registered adapter: postgres=1.8.0
20:51:30  Found 8 models, 4 data tests, 1 source, 539 macros
20:51:30
20:51:30  Concurrency: 1 threads (target='default')
20:51:30
20:51:30  1 of 5 START sql incremental model v1.couchdb .................................. [RUN]
20:54:52  1 of 5 OK created sql incremental model v1.couchdb ............................. [INSERT 0 219876 in 201.46s]
20:54:52  2 of 5 START sql incremental model v1.contact .................................. [RUN]
20:54:59  2 of 5 OK created sql incremental model v1.contact ............................. [INSERT 0 7616 in 6.70s]
20:54:59  3 of 5 START sql materialized_view model v1.contactview_person_fields .......... [RUN]
20:55:50  3 of 5 OK created sql materialized_view model v1.contactview_person_fields ..... [REFRESH MATERIALIZED VIEW in 51.51s]
20:55:50  4 of 5 START sql materialized_view model v1.contactview_chp .................... [RUN]
20:55:50  4 of 5 OK created sql materialized_view model v1.contactview_chp ............... [REFRESH MATERIALIZED VIEW in 0.17s]
20:55:50  5 of 5 START sql incremental model v1.chp_with_branches ........................ [RUN]
20:55:50  5 of 5 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 0.08s]
20:55:50
20:55:50  Finished running 3 incremental models, 2 materialized view models in 0 hours 4 minutes and 20.08 seconds (260.08s).
20:55:50
20:55:50  Completed successfully
20:55:50
20:55:50  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
20:55:57  Running with dbt=1.8.0
20:55:57  Registered adapter: postgres=1.8.0
20:55:58  Found 8 models, 4 data tests, 1 source, 539 macros
20:55:58
20:55:58  Concurrency: 1 threads (target='default')
20:55:58
20:55:58  1 of 5 START sql incremental model v1.couchdb .................................. [RUN]
20:59:39  1 of 5 OK created sql incremental model v1.couchdb ............................. [INSERT 0 309589 in 221.17s]
20:59:39  2 of 5 START sql incremental model v1.contact .................................. [RUN]
21:00:02  2 of 5 OK created sql incremental model v1.contact ............................. [INSERT 0 14759 in 22.53s]
21:00:02  3 of 5 START sql materialized_view model v1.contactview_person_fields .......... [RUN]
21:00:22  3 of 5 OK created sql materialized_view model v1.contactview_person_fields ..... [REFRESH MATERIALIZED VIEW in 20.76s]
21:00:22  4 of 5 START sql materialized_view model v1.contactview_chp .................... [RUN]
21:00:23  4 of 5 OK created sql materialized_view model v1.contactview_chp ............... [REFRESH MATERIALIZED VIEW in 0.20s]
21:00:23  5 of 5 START sql incremental model v1.chp_with_branches ........................ [RUN]
21:00:23  5 of 5 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 0.09s]
21:00:23
21:00:23  Finished running 3 incremental models, 2 materialized view models in 0 hours 4 minutes and 24.93 seconds (264.93s).
21:00:23
21:00:23  Completed successfully
21:00:23
21:00:23  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
21:00:29  Running with dbt=1.8.0
21:00:30  Registered adapter: postgres=1.8.0
21:00:30  Found 8 models, 4 data tests, 1 source, 539 macros
21:00:30
21:00:30  Concurrency: 1 threads (target='default')
21:00:30
21:00:30  1 of 5 START sql incremental model v1.couchdb .................................. [RUN]
21:04:01  1 of 5 OK created sql incremental model v1.couchdb ............................. [INSERT 0 302251 in 210.95s]
21:04:01  2 of 5 START sql incremental model v1.contact .................................. [RUN]
21:04:17  2 of 5 OK created sql incremental model v1.contact ............................. [INSERT 0 17502 in 16.10s]
21:04:17  3 of 5 START sql materialized_view model v1.contactview_person_fields .......... [RUN]
21:04:47  3 of 5 OK created sql materialized_view model v1.contactview_person_fields ..... [REFRESH MATERIALIZED VIEW in 30.16s]
21:04:47  4 of 5 START sql materialized_view model v1.contactview_chp .................... [RUN]
23:51:09  4 of 5 OK created sql materialized_view model v1.contactview_chp ............... [REFRESH MATERIALIZED VIEW in 9981.99s]
23:51:09  5 of 5 START sql incremental model v1.chp_with_branches ........................ [RUN]
23:51:10  5 of 5 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 0.21s]
23:51:10
23:51:10  Finished running 3 incremental models, 2 materialized view models in 2 hours 50 minutes and 39.58 seconds (10239.58s).
23:51:10
23:51:10  Completed successfully
23:51:10
23:51:10  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
23:51:16  Running with dbt=1.8.0
23:51:17  Registered adapter: postgres=1.8.0
23:51:17  Found 8 models, 4 data tests, 1 source, 539 macros
23:51:17
23:51:17  Concurrency: 1 threads (target='default')
23:51:17
23:51:17  1 of 5 START sql incremental model v1.couchdb .................................. [RUN]
00:49:03  1 of 5 OK created sql incremental model v1.couchdb ............................. [INSERT 0 11661376 in 3465.96s]
00:49:03  2 of 5 START sql incremental model v1.contact .................................. [RUN]
01:01:16  2 of 5 OK created sql incremental model v1.contact ............................. [INSERT 0 677615 in 733.14s]
01:01:16  3 of 5 START sql materialized_view model v1.contactview_person_fields .......... [RUN]
01:03:59  3 of 5 OK created sql materialized_view model v1.contactview_person_fields ..... [REFRESH MATERIALIZED VIEW in 162.71s]
01:03:59  4 of 5 START sql materialized_view model v1.contactview_chp .................... [RUN]

@njuguna-n
Copy link
Contributor Author

Latest dbt run with about 10m docs synced results below. The good news is that it seems to be holding up better than previous tries and the lambda view makes sure the data is up to date even if the dbt run is delayed. The not so good news is that the time is slowly inching upwards as the DB size grows. The sync rate currently is about 700 docs per second. I'll let it run overnight and check back in the morning.

17:44:39  1 of 3 START sql incremental model v1.stable_couchdb ........................... [RUN]
17:49:28  1 of 3 OK created sql incremental model v1.stable_couchdb ...................... [INSERT 0 626876 in 288.78s]
17:49:28  2 of 3 START sql incremental model v1.contact .................................. [RUN]
17:51:14  2 of 3 OK created sql incremental model v1.contact ............................. [INSERT 0 43395 in 106.38s]
17:51:14  3 of 3 START sql incremental model v1.chp_with_branches ........................ [RUN]
17:51:14  3 of 3 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 0.11s]
17:51:14
17:51:14  Finished running 3 incremental models in 0 hours 6 minutes and 35.47 seconds (395.47s).

@njuguna-n
Copy link
Contributor Author

njuguna-n commented May 25, 2024

Using the lambda table and not running DBT in a continuous loop has so far yielded the best results (10 minute interval in this instance). The base table has ~44m docs now and the longest DBT run I see from a cursory look at the logs is 59 minutes and even then the latest data is still available from the lambda. I will cleanup the PR and have it form the basis for other models and hopefully it holds up with more models added.

CC: @derickl

22:17:03  1 of 3 START sql incremental model v1.stable_couchdb ........................... [RUN]
22:30:51  1 of 3 OK created sql incremental model v1.stable_couchdb ...................... [INSERT 0 1999501 in 827.92s]
22:30:51  2 of 3 START sql incremental model v1.contact .................................. [RUN]
22:37:22  2 of 3 OK created sql incremental model v1.contact ............................. [INSERT 0 104485 in 391.64s]
22:37:22  3 of 3 START sql incremental model v1.chp_with_branches ........................ [RUN]
22:40:15  3 of 3 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 172.51s]
22:40:15
22:40:15  Finished running 3 incremental models in 0 hours 23 minutes and 12.32 seconds (1392.32s).
22:40:15
22:40:15  Completed successfully
22:40:15
22:40:15  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
22:50:18  Running with dbt=1.8.0
22:50:18  Registered adapter: postgres=1.8.0
22:50:19  Found 10 models, 4 data tests, 1 source, 539 macros
22:50:19
22:50:19  Concurrency: 1 threads (target='default')
22:50:19
22:50:19  1 of 3 START sql incremental model v1.stable_couchdb ........................... [RUN]
23:05:23  1 of 3 OK created sql incremental model v1.stable_couchdb ...................... [INSERT 0 1998126 in 904.33s]
23:05:23  2 of 3 START sql incremental model v1.contact .................................. [RUN]
23:12:28  2 of 3 OK created sql incremental model v1.contact ............................. [INSERT 0 124779 in 424.63s]
23:12:28  3 of 3 START sql incremental model v1.chp_with_branches ........................ [RUN]
23:17:20  3 of 3 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 291.80s]
23:17:20
23:17:20  Finished running 3 incremental models in 0 hours 27 minutes and 1.00 seconds (1621.00s).
23:17:20
23:17:20  Completed successfully
23:17:20
23:17:20  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
23:27:23  Running with dbt=1.8.0
23:27:23  Registered adapter: postgres=1.8.0
23:27:24  Found 10 models, 4 data tests, 1 source, 539 macros
23:27:24
23:27:24  Concurrency: 1 threads (target='default')
23:27:24
23:27:24  1 of 3 START sql incremental model v1.stable_couchdb ........................... [RUN]
23:45:41  1 of 3 OK created sql incremental model v1.stable_couchdb ...................... [INSERT 0 2292751 in 1096.74s]
23:45:41  2 of 3 START sql incremental model v1.contact .................................. [RUN]
23:52:34  2 of 3 OK created sql incremental model v1.contact ............................. [INSERT 0 92145 in 412.98s]
23:52:34  3 of 3 START sql incremental model v1.chp_with_branches ........................ [RUN]
23:56:45  3 of 3 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 251.34s]
23:56:45
23:56:45  Finished running 3 incremental models in 0 hours 29 minutes and 21.31 seconds (1761.31s).
23:56:45
23:56:45  Completed successfully
23:56:45
23:56:45  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
00:06:48  Running with dbt=1.8.0
00:06:49  Registered adapter: postgres=1.8.0
00:06:49  Found 10 models, 4 data tests, 1 source, 539 macros
00:06:49
00:06:50  Concurrency: 1 threads (target='default')
00:06:50
00:06:50  1 of 3 START sql incremental model v1.stable_couchdb ........................... [RUN]
00:29:57  1 of 3 OK created sql incremental model v1.stable_couchdb ...................... [INSERT 0 2486376 in 1387.79s]
00:29:57  2 of 3 START sql incremental model v1.contact .................................. [RUN]
00:36:43  2 of 3 OK created sql incremental model v1.contact ............................. [INSERT 0 120329 in 405.68s]
00:36:43  3 of 3 START sql incremental model v1.chp_with_branches ........................ [RUN]
00:39:36  3 of 3 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 173.23s]
00:39:36
00:39:36  Finished running 3 incremental models in 0 hours 32 minutes and 46.92 seconds (1966.92s).
00:39:36
00:39:36  Completed successfully
00:39:36
00:39:36  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
00:49:39  Running with dbt=1.8.0
00:49:40  Registered adapter: postgres=1.8.0
00:49:40  Found 10 models, 4 data tests, 1 source, 539 macros
00:49:40
00:49:40  Concurrency: 1 threads (target='default')
00:49:40
00:49:40  1 of 3 START sql incremental model v1.stable_couchdb ........................... [RUN]
01:19:40  1 of 3 OK created sql incremental model v1.stable_couchdb ...................... [INSERT 0 2746251 in 1799.16s]
01:19:40  2 of 3 START sql incremental model v1.contact .................................. [RUN]
01:29:22  2 of 3 OK created sql incremental model v1.contact ............................. [INSERT 0 105824 in 582.36s]
01:29:22  3 of 3 START sql incremental model v1.chp_with_branches ........................ [RUN]
01:36:29  3 of 3 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 427.02s]
01:36:29
01:36:29  Finished running 3 incremental models in 0 hours 46 minutes and 48.79 seconds (2808.79s).
01:36:29
01:36:29  Completed successfully
01:36:29
01:36:29  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
01:46:33  Running with dbt=1.8.0
01:46:33  Registered adapter: postgres=1.8.0
01:46:34  Found 10 models, 4 data tests, 1 source, 539 macros
01:46:34
01:46:34  Concurrency: 1 threads (target='default')
01:46:34
01:46:34  1 of 3 START sql incremental model v1.stable_couchdb ........................... [RUN]
02:28:54  1 of 3 OK created sql incremental model v1.stable_couchdb ...................... [INSERT 0 3433711 in 2540.05s]
02:28:54  2 of 3 START sql incremental model v1.contact .................................. [RUN]
02:37:09  2 of 3 OK created sql incremental model v1.contact ............................. [INSERT 0 127192 in 495.37s]
02:37:09  3 of 3 START sql incremental model v1.chp_with_branches ........................ [RUN]
02:40:47  3 of 3 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 217.80s]
02:40:47
02:40:47  Finished running 3 incremental models in 0 hours 54 minutes and 13.52 seconds (3253.52s).
02:40:47
02:40:47  Completed successfully
02:40:47
02:40:47  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
02:50:51  Running with dbt=1.8.0
02:50:52  Registered adapter: postgres=1.8.0
02:50:52  Found 10 models, 4 data tests, 1 source, 539 macros
02:50:52
02:50:52  Concurrency: 1 threads (target='default')
02:50:52
02:50:52  1 of 3 START sql incremental model v1.stable_couchdb ........................... [RUN]
03:38:17  1 of 3 OK created sql incremental model v1.stable_couchdb ...................... [INSERT 0 3924918 in 2844.03s]
03:38:17  2 of 3 START sql incremental model v1.contact .................................. [RUN]
03:46:42  2 of 3 OK created sql incremental model v1.contact ............................. [INSERT 0 132891 in 505.79s]
03:46:42  3 of 3 START sql incremental model v1.chp_with_branches ........................ [RUN]
03:50:10  3 of 3 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 207.66s]
03:50:10
03:50:10  Finished running 3 incremental models in 0 hours 59 minutes and 17.77 seconds (3557.77s).
03:50:10
03:50:10  Completed successfully
03:50:10
03:50:10  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
04:00:14  Running with dbt=1.8.0
04:00:14  Registered adapter: postgres=1.8.0
04:00:15  Found 10 models, 4 data tests, 1 source, 539 macros
04:00:15
04:00:15  Concurrency: 1 threads (target='default')
04:00:15
04:00:15  1 of 3 START sql incremental model v1.stable_couchdb ........................... [RUN]
04:28:31  1 of 3 OK created sql incremental model v1.stable_couchdb ...................... [INSERT 0 1966340 in 1696.67s]
04:28:31  2 of 3 START sql incremental model v1.contact .................................. [RUN]
04:37:49  2 of 3 OK created sql incremental model v1.contact ............................. [INSERT 0 1 in 557.94s]
04:37:49  3 of 3 START sql incremental model v1.chp_with_branches ........................ [RUN]
04:43:56  3 of 3 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 366.56s]
04:43:56
04:43:56  Finished running 3 incremental models in 0 hours 43 minutes and 41.45 seconds (2621.45s).
04:43:56
04:43:56  Completed successfully
04:43:56
04:43:56  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
04:53:59  Running with dbt=1.8.0
04:54:00  Registered adapter: postgres=1.8.0
04:54:00  Found 10 models, 4 data tests, 1 source, 539 macros
04:54:00
04:54:01  Concurrency: 1 threads (target='default')
04:54:01
04:54:01  1 of 3 START sql incremental model v1.stable_couchdb ........................... [RUN]
05:01:01  1 of 3 OK created sql incremental model v1.stable_couchdb ...................... [INSERT 0 1 in 420.55s]
05:01:01  2 of 3 START sql incremental model v1.contact .................................. [RUN]
05:08:02  2 of 3 OK created sql incremental model v1.contact ............................. [INSERT 0 1 in 420.50s]
05:08:02  3 of 3 START sql incremental model v1.chp_with_branches ........................ [RUN]
05:09:48  3 of 3 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 105.77s]
05:09:48
05:09:48  Finished running 3 incremental models in 0 hours 15 minutes and 47.06 seconds (947.06s).
05:09:48
05:09:48  Completed successfully
05:09:48
05:09:48  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
05:19:51  Running with dbt=1.8.0
05:19:51  Registered adapter: postgres=1.8.0
05:19:52  Found 10 models, 4 data tests, 1 source, 539 macros
05:19:52
05:19:52  Concurrency: 1 threads (target='default')
05:19:52
05:19:52  1 of 3 START sql incremental model v1.stable_couchdb ........................... [RUN]
05:26:52  1 of 3 OK created sql incremental model v1.stable_couchdb ...................... [INSERT 0 1 in 420.05s]
05:26:52  2 of 3 START sql incremental model v1.contact .................................. [RUN]
05:33:53  2 of 3 OK created sql incremental model v1.contact ............................. [INSERT 0 1 in 420.74s]
05:33:53  3 of 3 START sql incremental model v1.chp_with_branches ........................ [RUN]
05:34:37  3 of 3 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 43.99s]
05:34:37
05:34:37  Finished running 3 incremental models in 0 hours 14 minutes and 45.02 seconds (885.02s).
05:34:37
05:34:37  Completed successfully

@njuguna-n
Copy link
Contributor Author

PR merged

@github-project-automation github-project-automation bot moved this from In Progress to Done in Product Team Activities May 30, 2024
@njuguna-n
Copy link
Contributor Author

njuguna-n commented May 31, 2024

Summary of the testing tried out in this ticket:

A typical DBT run with 44 million documents already synced takes about 14 minutes. This is when there is no additional data being synced or when the rate of data sync is low. Previous iterations were taking a few hours and many of them did not complete even the initial sync.

All tests were conducted on an EKS instance with the Postgres pod having 6GB in memory and 6 vCPUs resource limits. The DBT pod has 4GB of memory and 4 vCPUs. The test data is from a clone of the production instance of BRAC Uganda. CouchDB records were being copied to Postgres at a rate of approximately 500 documents per second.

The idea of the branch was to restart the modelling process with minimal models and only have a few models that would be required for a specific BRAC dashboard widget to establish a basic pattern for modelling that would work. This is important to note because any production deployment will likely have more models which might impact the performance.

The basic structure recommended right now is to have base models defined as incremental tables. The first draft of recommended base tables can be found here. We also added a lambda view for the main base table to try and ensure that downstream models always have the latest data even when DBT takes long to run. An important note is that a lambda view has the most benefit if it is queried directly by the dashboard or another model that is a view. If a model that creates a table reads from a lambda view then the data is only added to the table once dbt run for that particular model is complete and while there is a small benefit in getting more up to date data it is not worth the extra overhead.

The biggest gains in terms of reducing the time it takes for DBT to complete to run have been caused by:

  • Using a macros in the incremental query for models
  • Adding indexes to columns used in queries
  • Adding an interval between DBT runs. In the test above the interval was 10 minutes but I have had similar results with a 3 minute interval as well.

The next questions we need to answer are:

  • How does this hold up with production-like data being synced
  • How does it hold up with additional models being added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

6 participants