Bobsled data types
  • 18 Jul 2024
  • 3 Minutes to read
  • PDF

Bobsled data types

  • PDF

Article summary

Overview: Schema Inference and Mapping

It’s a well-known fact that data types are not identical across various file formats and databases. When transferring structured data from a source to a destination, Bobsled takes care of reading the columns and data types of source data and mapping them accurately to the destination.

When Bobsled reads the data from a source, we infer the schema based on the file format and data source.

  • For self-describing file formats such as Parquet, we read the schema directly from the files. The same is true of data from a data-warehouse source.

  • For formats like CSVs and JSONs, we auto-infer the schema.

This schema is represented using our internal Bobsled Data Types. These Bobsled data types are designed to provide an interchange layer between various Sources and Destinations and to help providers better understand the way data will be delivered to downstream destinations.

Providers should therefore familiarize themselves with the Bobsled data types and can use this documentation to understand how Bobsled types drive destination data types.

sschema inferences(1)

Example of Bobsled schema inference and internal mapping.


Bobsled Data Types

Bobsled Data Types include all the well-known SQL Data Types.

Primitive Data Types

Data Type

Explanation

BINARY

Variable length Binary Data.

BOOLEAN

TRUE or FALSE or NULL

DECIMAL(precision,scale)

Represents a Fixed Point Decimal Number. The precision and scale are limited by the source and destination systems or file formats. See the sections on Mappings for details.

FLOAT

Double Precision (64-bit) Floating Point Number

FLOAT<FLOAT4 | FLOAT8>

Parameterized Floating point. Can represent either a 32-bit or 64-bit floating point number. Currently supported only when transferring data from Parquet to Databricks

INTEGER

INTEGER data type. The range of value depends on ssource and destination systems or file formats.

INTEGER<TINYINT | SMALLINT | INT | BIGINT>

Parameterized Integer Data Type. Represents 1, 2, 4, and 8-byte integers. Currently supported only when transferring data from Parquet to Databricks.

STRING

UTF-8 encoded String of varying length.

Date and Time

Data Type

Explanation

DATE

SQL Date in the Gregorian Calendar

TIME_NTZ

Represents a wall-clock TIME value (for ex. 10:23) irrespective of any time zone.

TIME_TZ

Represents a wall-clock TIME value (for ex. 10:23 CET) in a specific time zone.

TIMESTAMP_NTZ

Represents a specific wall-clock date-time value (for ex. 2024-04-01 10:23) irrespective of any time zone. Equivalent to DATETIME in some SQL dialects.

TIMESTAMP_TZ

Represents a specific point in time (for ex. 2024-04-01 10:23 UTC)

Geospatial Data Types

Data Type

Explanation

GEOGRAPHY

GEOGRAPHY SQL data type as per the WGS 84 standard ↗

GEOMETRY

GEOMETRY SQL data type as per OpenGIS Simple Features Specification (PDF) ↗

Complex Data Types

Data Type

Explanation

ARRAY<Primitive>

Variable length list of elements of simple data types. Containing elements must be of the same data type.

ARRAY<COMPLEX>

COMPLEX is a notional data type to indicate that the contained element is either an array, struct, or map.

JSON

JSON-formatted String. Also used for nested types such as Structs and Maps.


Mapping into Bobsled Data Types

Snowflake → Bobsled

Snowflake

Bobsled Data Type

ARRAY

ARRAY<STRING>

VARIANT / OBJECT

JSON

BINARY / VARBINARY

BINARY

BOOLEAN

BOOLEAN

DATE

DATE

NUMBER(p,s)

DECIMAL(p,s)

INTEGER

DECIMAL(38,0) Snowflake internally stores INTEGERS as NUMBERS with maximum precision.

FLOAT

FLOAT

GEOGRAPHY

GEOGRAPHY

GEOMETRY

GEOMETRY

TIME

TIME_NTZ

TIMESTAMP / TIMESTAMP_NTZ

TIMESTAMP_NTZ

TIMESTAMP_TZ

TIMESTAMP_TZ

VARCHAR

STRING

BigQuery → Bobsled

BigQuery

Bobsled Type

ARRAY<Primitive>

ARRAY<Primitive>

ARRAY<STRUCT/MAP>

ARRAY<COMPLEX>

BYTES

BINARY

BOOLEAN

BOOLEAN

DATE

DATE

NUMERIC(p,s)

DECIMAL(p,s)

NUMERIC(38,0) / NUMERIC

DECIMAL(38,0)

BIGNUMERIC(p,s)

DECIMAL(p,s)

BIGNUMERIC(76,0) / BIGNUMERIC

FLOAT

INTEGER

INTEGER

FLOAT

FLOAT

GEOGRAPHY

GEOGRAPHY

JSON

JSON

MAP

JSON

STRUCT

JSON

VARCHAR

STRING

TIME

TIME_NTZ

DATETIME / TIMESTAMP

TIMESTAMP_NTZ

Parquet/Delta-Lake → Bobsled

Parquet

Bobsled Type

LIST<Primitive>

ARRAY<Primitive>

LIST<LIST/STRUCT/MAP>

ARRAY<COMPLEX>

BYTES_ARRAY

BINARY

BOOLEAN

BOOLEAN

DATE

DATE

NUMERIC(p,s)

DECIMAL(p,s)

TINYINT

INTEGER<TINYINT>

SMALLINT / UTINYINT

INTEGER<SMALLINT>

INTEGER / USMALLINT

INTEGER<INT>

BIGINTUINTEGER

INTEGER<BIGINT>

INTERVAL

Not Supported

FLOAT

FLOAT<FLOAT4>

DOUBLE

FLOAT<FLOAT8>

MAP

JSON

STRUCT

JSON

JSON

JSON

VARCHAR

STRING

TIME_NTZ (isAdjustedToUTC=false)

TIME_NTZ

TIME_TZ (isAdjustedToUTC=true)

TIME_TZ

TIMESTAMP_NTZ (isAdjustedToUTC=false)

TIMESTAMP_NTZ

TIMESTAMP_TZ (isAdjustedToUTC=true)

TIMESTAMP_TZ

UUID

STRING

CSV/JSON → Bobsled

CSV

Bobsled Type

BINARY

BINARY

BOOLEAN

BOOLEAN

DATE

DATE

NUMERIC(p,s)

DECIMAL(p,s)

INT64 (and all other signed and unsigned INT types)

INTEGER

FLOAT (and other FLOAT types)

FLOAT

VARCHAR

STRING

TIMENTZ

TIME_NTZ

TIMETZ

TIME_NTZ

TIMESTAMPNTZ

TIMESTAMP_NTZ

TIMESTAMP_TZ

TIMESTAMP_TZ

ARRAY<Primitive>

ARRAY<Primitive>

ARRAY<COMPLEX>

ARRAY<COMPLEX>

JSON

JSON


Bobsled Types to Destinations Mapping

Bobsled → Snowflake

Bobsled Type

Snowflake

ARRAY<Primitive>

ARRAY

ARRAY<COMPLEX>

ARRAY

BINARY

BINARY

BOOLEAN

BOOLEAN

DATE

DATE

DECIMAL(p,s) p<=38 AND scale<=37

DECIMAL(p,s)

DECIMAL(p,s) p>38 OR scale> 37

STRING If the source data contains a DECIMAL that can’t be stored as a DECIMAL in Snowflake, we use a STRING.

FLOAT

FLOAT

GEOGRAPHY

GEOGRAPHY

GEOMETRY

GEOMETRY

INTEGER

INTEGER / DECIMAL(38,0)

JSON

VARIANT

STRING

STRING

TIME_NTZ

TIME

TIME_TZ

TIME

TIMESTAMP_NTZ

TIMESTAMP_NTZ

TIMESTAMP_TZ

TIMESTAMP_TZ

Bobsled → BigQuery

Bobsled Type

BigQuery

ARRAY<Primitive>

JSON

ARRAY<COMPLEX>

JSON

BINARY

BYTES

BOOLEAN

BOOLEAN

DATE

DATE

DECIMAL(p,s)

DECIMAL / BIGDECIMAL (depending on the BigQuery rules described. Learn more ↗)

FLOAT

FLOAT

GEOGRAPHY

GEOGRAPHY

GEOMETRY

BYTES

INTEGER

INTEGER

JSON

JSON

STRING

STRING

TIME_NTZ

TIME

TIME_TZ

TIME

TIMESTAMP_NTZ

TIMESTAMP

TIMESTAMP_TZ

TIMESTAMP

Bobsled → Databricks

Bobsled Type

Databricks

ARRAY<Primitive>

ARRAY<Primitve>

ARRAY<COMPLEX>

ARRAY<JSON>

BINARY

BINARY

BOOLEAN

BOOLEAN

DATE

DATE

DECIMAL(p,s) p<=38 AND scale<=37

DECIMAL(p,s) If the source data contains a DECIMAL that can’t be stored as a DECIMAL in Databricks, we use a STRING.

DECIMAL(p,s) p>38 OR scale> 37

STRING

FLOAT<FLOAT4>

FLOAT

FLOAT<FLOAT8>

DOUBLE

FLOAT

DOUBLE

GEOGRAPHY

BINARY

GEOMETRY

BINARY

INTEGER<TINYINT>

TINYINT

INTEGER<SMALLINT>

SMALLINT

INTEGER<INT>

INT

INTEGER<BIGINT>

BIGINT

INTEGER

BIGINT

JSON

STRING

STRING

STRING

TIME_NTZ (parquet)

Cannot be mapped. A source parquet file containing TIME_NTZ can not be loaded to Databricks.

TIME_NTZ (csv)

STRING

TIME_TZ (parquet)

Cannot be mapped. A source parquet file containing TIME_NTZ can not be loaded to Databricks.

TIME_TZ (csv)

STRING

TIMESTAMP_NTZ

TIMESTAMP_TZ Currently, we do not support TIMESTAMP_NTZ in Databricks because it can't be shared using Delta Share.

TIMESTAMP_TZ

TIMESTAMP_TZ

Bobsled → Parquet

Parquet Mappings depend on the source of the data as we rely on the parquet writers of the source systems.

Bobsled Type

Snowflake → Parquet

BigQuery → Parquet

ARRAY<Primitive>

JSON

ARRAY<Primitive>

ARRAY<COMPLEX>

JSON

ARRAY<COMPLEX>

BINARY

BYTE_ARRAY

BYTE_ARRAY

BOOLEAN

BOOLEAN

BOOLEAN

DATE

DATE

DATE

DECIMAL(p,s)

DECIMAL(p,s)

DECIMAL(38,9) OR DECIMAL(76,38)  BigQuery widens the precision/scale to max when unloading to Parquet.

FLOAT

FLOAT

FLOAT

GEOGRAPHY

BYTE_ARRAY

BYTE_ARRAY

GEOMETRY

BYTE_ARRAY

N/A

INTEGER

DECIMAL(38,0) Snowflake stores INTs as DECIMAL(38,0)

INT64

JSON

STRING

STRING

STRING

STRING

STRING

TIME_NTZ

TIME_NTZ (isAdjustedToUTC=false)

TIME_NTZ (isAdjustedToUTC=false)

TIME_TZ

TIME_TZ (isAdjustedToUTC=true)

TIME_TZ (isAdjustedToUTC=true)

TIMESTAMP_NTZ

TIMESTAMP_NTZ (isAdjustedToUTC=false)

TIMESTAMP_NTZ (isAdjustedToUTC=false)

TIMESTAMP_TZ

TIMESTAMP_TZ (isAdjustedToUTC=true)

TIMESTAMP_TZ (isAdjustedToUTC=true)


Was this article helpful?