Transferring files to a data warehouse destination
  • 29 Nov 2023
  • 7 Minutes to read
  • Dark
    Light
  • PDF

Transferring files to a data warehouse destination

  • Dark
    Light
  • PDF

Article Summary

To deliver data from a file storage source to a data warehouse destination, Bobsled turns the underlying files in the selected folders of the source into tables in the data warehouse destination.

File Format

Bobsled turns the underlying files in the selected folders into tables in the data warehouse destination. 

File formats supported:

  • Parquet
  • Delta Lake
  • CSV 
  • JSON 
  • Avro

Table Schemas

Bobsled extracts the schema from the underlying files and uses that to create the tables in data warehouse destination. Bobsled uses standard data warehouse destination data types, providing maximum flexibility.

For self-describing file formats, the schema from a random file will be used. 

For file formats without a defined schema, Bobsled will auto-detect the schema. 

CSV and JSON auto-inference 

Bobsled will automatically infer the schema and types for each column in a CSV or JSON file. The type detection works by taking a sample of the data and attempting to convert the values in each column to the candidate types. If the conversion is unsuccessful, the candidate type is removed from the set of candidate types for that column. After all samples have been handled - the remaining candidate type with the highest priority is chosen. 

CSV Loading Options

Bobsled uses CSV options for each data warehouse to effectively load data from CSV files. Below are the options used for each warehouse:

Snowflake

Option NameSettingDefinition
ENCODINGUTF8Specifies the character set of the source data.
COMPRESSIONAUTOSpecifies the compression algorithm used for data files.
BINARY_FORMATHEXDefines the encoding format for binary string values in the data files
FIELD_DELIMITERAuto-detected by BobsledDefines the character or string used to separate fields (columns) within each row (line) in the data file.
ESCAPENONESpecifies the escape character used for enclosed field values.
ESCAPE_UNENCLOSED FIELD'\\'Specifies the escape character used for unenclosed field values.
RECORD_DELIMITER'\n'One or more characters that separate records in an input file. There is no option to set a custom record delimiter.
ERROR_ON_COLUMN_COUNT_MISMATCHFALSEIf set to FALSE, an error is not generated, and the load continues. 
NULL_IF['', '\\N', 'NULL']Specifies the strings that represents NULL values. 
EMPTY_FIELD_AS_NULLFALSEWhen loading data, specifies whether to insert SQL NULL for empty fields in an input file.
FIELD_OPTIONALLY_ENCLOSED_BY' " 'Character used to enclose strings.
SKIP_HEADER1Skips the header row containing column names.
TIMESTAMP_FORMATAUTODefines the format of timestamp values in the data files.
TIME_FORMATAUTODefines the format of time string values in the data files.
DATE_FORMATAUTODefines the format of date values in the data files.

BigQuery

Option NameSettingDescription
encodingUTF8The character encoding of the data
fieldDelimiterAuto-detected by Bobsled
Defines the character or string used to separate fields (columns) within each row (line) in the data file
quote' " 'The value that is used to quote data sections in a CSV file
skipLeadingRows1Skips the header row containing column names
nullMarker ''Specifies a string that represents a null value
ignoreUnkownValuesTRUEWhen set to true, rows with extra column values that do not match the table schema are ignored and are not loaded.
allowJaggedRowsTRUEAccept rows that are missing trailing optional columns. The missing values are treated as nulls.

Databricks

Option NameSettingDescription
encoding UTF8The name of the encoding CSV files.
delimiterAuto-detected by BobsledThe separator string between columns.
escape '\'The escape character to use when parsing the data.
modePERMISSIVENulls are inserted for fields that could not be parsed correctly.
lineSepdefault: NONE covers the line separators: \r\r\n, and \n. A string between two consecutive CSV records.
nullValue''String representation of a null value.
quote' " 'The character used for escaping values where the field delimiter is part of the value.
skipRows1Skips the header row containing column names
timestampFormatdefault: yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]The format for parsing timestamp strings.
dateFormatdefault: yyyy-MM-ddThe format for parsing date strings.


Table Loading Patterns

There are four different ways to load the folders into tables in the data warehouse. These options will load and transform the data into the warehouse to make sure that the data is ready for the consumer.

Append Only

Using the append only loading pattern, Bobsled will load all records in the file(s) in the underlying folder into the table in data warehouse destination. This is the most basic loading option that matches the folder contents with the table in data warehouse destination.

This pattern is well-suited for cases where new files in the underlying folder contain new records. A few examples of a good fit for the “Append only” loading pattern are event data or market price data where new data will be logged.

Update and append

Using the update and append loading pattern, Bobsled will ensure that only the latest version of each record will be available in the destination table. This process is executed through providing a key for Bobsled to identify each record’s latest state.

This pattern is well-suited for datasets where records evolve or move through various states. A few examples of a good fit for the “Update and Append” loading pattern are datasets of orders that include changes to order status or company reference information where details can change.

This pattern is transactionally managed, for each automation all new data is promoted simultaneously regardless of whether the records were newly appended or updated.

When using update and append, you will be required to provide Bobsled with two values in your dataset: a unique identifier and a last modified at field.

  • Unique identifier: A key or composite key that should be unique and distinctive for each record that will be updated.
  • Last modified at: A key or composite key that Bobsled will use to identify the most recent version of each record. If your dataset does not have a last modified value for Bobsled to identify, consider using a different loading pattern.

Recordset Overwrite

Using the recordset overwrite loading pattern, Bobsled will ensure that only the latest version of a given set of related records (recordset) is present. This pattern is well-suited for datasets where one event or record is stored on multiple rows and all rows need inserting, updating or deleting at the same time. Bobsled ensures that only the latest version of the recordset is present and queryable. 

This pattern is transactionally managed, for each automation run all new data is promoted simultaneously regardless of whether the records were newly added, updated or deleted.

When using recordset overwrite, you will be required to provide Bobsled with two values in your dataset: a match key(s) and a last modified at field(s).

  • Matching key: A key or composite key that is used to relate all records within the same recordset
  • Last modified at: A key or composite key that Bobsled will use to identify the most recent version of the recordset. If your dataset does not have a last modified value for Bobsled to identify, consider using a different loading pattern.

Notes:

  • If multiple rows are duplicates with the same matching key(s) and last modified at key(s) all such rows will be present in the destination table. 

Overwrite

Using the overwrite loading pattern, Bobsled will take the new file(s) in an underlying folder and overwrite the existing data by truncating the table in data warehouse destination, then loading all records in the latest file(s).


This pattern is well-suited for folders where new files written represent the latest complete data set. This is a common approach when a large portion of records in the table are altered and overwriting may be more efficient than updating, inserting and deleting records in the table; or when the data set is sufficiently small enough that reloading the table is not an expensive operation.


Tables are transactionally updated, there will never be an empty or incomplete data warehouse destination table during a Bobsled load.


Bobsled supports two methods for identifying the new complete set of data latest file and completion file.


Latest file


Latest  file is a good option for smaller tables where data is written into a single file. The latest file method will find the latest, non-hash, file in the folder or sub-folders associated with a given table and consider that the full state of the table. 


Completion file


Completion file is a good option for larger tables that write data in multi-part files. 

To ensure Bobsled is able to identify the right files that represent the latest version of the table, there are a few requirements for how the data must be structured in the source folders:


  • Each set of files representing a version of the table must be in their own folder
  • A meta file must be written to identify when the complete set of data has been written to the folder (e.g. a file named _SUCCESS). It is important to write the completion file after the full set of data has been written so Bobsled moves the full set of data for the given table.

Example folder structure showing a _SUCCESS file


Upon seeing a new or updated meta file (most recent by modified date), Bobsled will replace the contents of the table in data warehouse destination with the records in the file(s) in the same folder as the latest meta file.


As new versions are written to object storage with new completion meta files, Bobsled will continue to update the table with the latest set of data as identified by the latest completion meta file.


When using overwrite, you will be required to provide Bobsled with the name of the meta file that your pipeline uses.


  • Name of file: the name of the file for Bobsled to use to identify when a complete version is available

Advanced Settings

Bobsled supports advanced settings for data engineers who want to optimize specific table delivery options


Was this article helpful?