Transferring tables to a data warehouse destination
  • 16 May 2024
  • 5 Minutes to read
  • Dark
    Light
  • PDF

Transferring tables to a data warehouse destination

  • Dark
    Light
  • PDF

Article Summary

To deliver data from a data warehouse to a data warehouse destination, Bobsled seeks to mirror the source tables, prioritizing correctness and efficiency to the data warehouse destination. 

Replication Methods

There are three different ways to replicate tables from a data warehouse to a date warehouse. The selection depends on your ELT/ELT approach in your data warehouse as well as the scale of the data. 

Full table replication

Using the full table replication method, Bobsled will extract the full table from the source data warehouse and replace the contents in the data warehouse destination. 

This pattern is well-suited for tables where the ETL overwrites much of the data each pipe run or where no timestamps are maintained to understand which records have changed. This method is also well suited for small tables where the contents don't change too often. 

Cursor to control full table replication on views in BigQuery

If needed, you can use a 'last_modified_at' field to help manage data copying from BigQuery views. This acts like a bookmark for the last time the data was changed. To set this up, make a new source share in the BigQuery UI, and choose the necessary column as the 'last_modified_at' bookmark in the view settings. Bobsled will use the maximum value in the last_modified_at field and see if that has increased since last sync. This only works for views, not tables. If you don't specify a 'last_modified_at' column, the system will automatically update full table replication views every 24 hours.

Incremental replication - Append Only

Using the row based replication - append only method, Bobsled will identify new records based on a "created at" field in the table. This "create at" field is used as a bookmark, and each time Bobsled runs new rows that have a higher value of that field will be identified and transferred to the destination. 

This pattern is well-suited for cases where the rows in the table are immutable records and old records are not updated (e.g. event data or logs). 

You will be required to provide Bobsled with one value in your dataset: a created at field.

  • Created at: A key or composite key that Bobsled will use to identify new records. If your dataset does not track when new records are created, consider using a different loading pattern.  

Incremental replication - Update and Append

Using the row based replication - Update and Append, Bobsled will identify newly updated records based on a "updated at" field in the table.  Bobsled will then replicate the newly added or updated records to the destination warehouse and ensure that only the latest version of the record, identified by unique identifier, is present in the destination table. 

This method is well-suited for datasets where records evolve or move through various states such as e-commerce orders where the status can change. Each automation all new data is promoted transactionally regardless of whether the records were newly appended or updated.

You will be required to provide Bobsled with two values in your dataset: a unique identifier and a last modified at field.

  • Unique identifier: A key or composite key that should be unique and distinctive for each record that will be updated.

  • Last modified at: A key that Bobsled will use to identify the most recent version of each record. If your dataset does not have a last modified value for Bobsled to identify, consider using a different loading pattern.

  • Delete flag: An optional key that Bobsled will use to determine if a given row should be deleted from the destination table. Bobsled will make sure any column marked for deletion is not added and remove any row with the same unique identifier for new records marked for deletion.

  • Created at: An optional key that Bobsled will use to filter down the MERGE statement run in the destination data warehouse to improve the performance of the load by pruning files containing historical records. This should be used in cases where the table is large or updates very frequent and when your created_at field doesn’t change and is not nullable. Nulls in the source data can lead to duplicates. When using the created at key, it is suggested to cluster the table by that field for optimal loading performance and efficiency.

Row based replication - Change tracking (Snowflake Source only)

Bobsled will identify newly updated records by querying the Snowflake change stream to look for newly inserted, updated or deleted rows. This requires change tracking to be turned on for any given table or table(s) referenced within a view. 

This method is well-suited for datasets where records evolve or move through various states such as e-commerce orders where the status can change. With each automation, all new data is promoted transactionally regardless of whether the records were newly appended or updated.

This method should be used if any rows are deleted from your source table and need to make sure they are deleted downstream.

You will be required to provide Bobsled with one value in your dataset: a unique identifier.

  • Unique identifier: A key or composite key that should be unique and distinctive for each record that will be updated.

  • Created at: An optional key that Bobsled will use to filter down the MERGE statement run in the destination data warehouse to improve the performance of the load by pruning files containing historical records. This should be used in cases where the table is large or updates very frequent and when your created_at field doesn’t change and is not nullable. When using the created at key, it is suggested to cluster the table by that field for optimal loading performance.

Note - if a share is paused for longer than the data retention window, a backfill will be automatically kicked off. As such, we recommend increasing the table time travel window on tables that are used in change tracking replication pattern to enabling pausing, when required.

Backfills

Bobsled enables manually triggering a backfill of a table when you need to restate the table. On the next scheduled sync Bobsled will extract the full table and transactionally replace the existing table in the destination.

Backfills can be manually requested in the UI. Bobsled will also automatically trigger a backfill when required in certain cases (e.g. replication pattern change).

Empty source tables

In Bobsled, if you initiate an operation with empty tables in your data warehouse source, the operation will complete successfully but no data will be delivered to these destination tables. This behavior allows operations to proceed without requiring any error management, even when starting with empty source tables.


Was this article helpful?