Data types
  • 26 Apr 2024
  • 2 Minutes to read
  • Dark
    Light
  • PDF

Data types

  • Dark
    Light
  • PDF

Article Summary

Schema Inference and Mapping

It’s a well known fact that Data Types are not identical across various file formats and databases. When transferring structured data from a source to destination, Bobsled takes care of reading the columns and data types of source data and mapping them accurately to the destination.

When we read the data from a source, we infer the schema based on the file format and data source. For self-describing file formats such as Parquet, we read the schema directly from the files. The same is true of data from a data-warehouse source. For formats like CSVs and JSONs, we auto-infer the schema.

This schema is represented using our internal Bobsled Data Types. These Bobsled data types are designed to provide an interchange layer between various sources and destinations and to help providers better understand the way data will deliver in downstream destinations.

Providers should therefore familiarize themselves with the Bobsled data types and can use this documentation to understand how Bobsled types drive destination data types.

Bobsled Data Types

Bobsled Data Types include all the well known SQL Data Types

Primitive Data Types

Data TypeExplanation
BINARYVariable length Binary Data.
BOOLEANTRUE or FALSE or NULL
DECIMAL(precision,scale)Represents a Fixed Point Decimal Number. The precision and scale are limited by the source and destination systems or file formats. See the sections on Mappings for details.
FLOATDouble Precision (64 bit) Floating Point Number
INTEGERINTEGER data type. Range of value depends on source and destination systems or file formats. 
STRINGUTF-8 encoded String of varying length.

Date and Time

Data TypeExplanation
DATESQL Date in the Gregorian Calendar
TIME_NTZRepresents a wall-clock TIME value (for ex. 10:23) irrespective of any time zone.
TIME_TZRepresents a wall-clock TIME value (for ex. 10:23 CET) in a specific time zone.
TIMESTAMP_NTZRepresents a specific wall-clock date-time value (for ex. 2024-04-01 10:23) irrespective of any time zone. Equivalent to DATETIME in some SQL dialects.
TIMESTAMP_TZRepresents a specific point in time (for ex. 2024-04-01 10:23 UTC)

Geospatial Data Types

Data TypeExplanation
GEOGRAPHYGEOGRAPHY SQL data type as per the WGS 84 standard (https://en.wikipedia.org/wiki/World_Geodetic_System)
GEOMETRYGEOMETRY SQL data type as per https://portal.ogc.org/files/?artifact_id=829

Complex Data Types

Data TypeExplanation
ARRAY<Primitive>Variable length list of elements of simple data types. Containing elements must be of the dame data type.
ARRAY<COMPLEX>COMPLEX is a notional data type to indicate that the contained element is either an array, struct or map.
JSONJSON-formatted String. Also used for nested types such as Structs and Maps


Snowflake → Bobsled

SnowflakeBobsled Data Type
ARRAY/VARIANT/OBJECTJSON
BINARY/VARBINARYBINARY
BOOLEANBOOLEAN
DATEDATE
NUMBER(p,s)DECIMAL(p,s)
INTEGERDECIMAL(38,0). Snowflake internally stores INTEGERS as NUMBERS with maximum precision.
FLOATFLOAT
GEOGRAPHYGEOGRAPHY
GEOMETRYGEOMETRY
TIMETIME_NTZ
TIMESTAMP/TIMESTAMP_NTZTIMESTAMP_NTZ
TIMESTAMP_TZTIMESTAMP_TZ
VARCHARSTRING

BigQuery → Bobsled

BigQueryBobsled Type
ARRAY<Primitive>ARRAY<Primitive>
ARRAY<STRUCT/MAP>ARRAY<COMPLEX>
BYTESBINARY
BOOLEANBOOLEAN
DATEDATE
NUMERIC(p,s)DECIMAL(p,s)
NUMERIC(38,0) / NUMERICDECIMAL(38,0)
BIGNUMERIC(p,s)DECIMAL(p,s)
BIGNUMERIC(76,0)/ BIGNUMERICFLOAT
INTEGERINTEGER
FLOATFLOAT
GEOGRAPHYGEOGRAPHY
JSONJSON
MAPJSON
STRUCTJSON
VARCHARSTRING
TIMETIME_NTZ
DATETIMETIMESTAMP_NTZ
TIMESTAMP_TZTIMESTAMP_TZ

Parquet / Delta-Lake → Bobsled

ParquetBobsled Type
LISTARRAY
BYTES_ARRAYBINARY
BOOLEANBOOLEAN
DATEDATE
NUMERIC(p,s)DECIMAL(p,s)
INT64INTEGER
INTERVALNot Supported
FLOATFLOAT
MAPJSON
STRUCTJSON
VARCHARSTRING
TIME_NTZ (isAdjustedToUTC=false)TIME_NTZ
TIME_TZ (isAdjustedToUTC=true)TIMESTAMP_TZ
TIMESTAMP_NTZ (isAdjustedToUTC=false)TIMESTAMP_NTZ
TIMESTAMP_TZ (isAdjustedToUTC=true)TIMESTAMP_TZ
UUIDSTRING

CSV / JSON → Bobsled

CSVBobsled Type
BINARYBINARY
BOOLEANBOOLEAN
DATEDATE
NUMERIC(p,s)DECIMAL(p,s)
INT64 (and all other signed and unsigned INT types)INTEGER
FLOAT (and other FLOAT types)FLOAT
VARCHARSTRING
TIMENTZTIME_NTZ
TIMETZTIME_NTZ
TIMESTAMPNTZTIMESTAMP_NTZ
TIMESTAMP_TZTIMESTAMP_TZ


Bobsled Types to Destinations Mapping

Bobsled → Snowflake

Bobsled TypeSnowflake
ARRAY<Primitive>VARIANT
ARRAY<COMPLEX>VARIANT
BINARYBINARY
BOOLEANBOOLEAN
DATEDATE
DECIMAL(p,s) p<=38 AND scale<=37DECIMAL(p,s)
DECIMAL(p,s) p>38 OR scale> 37STRING . If the source data contains a DECIMAL that can’t be stored as a DECIMAL in Snowflake, we use a STRING.
FLOATFLOAT
GEOGRAPHYGEOGRAPHY
GEOMETRYGEOMETRY
INTEGERINTEGER / DECIMAL(38,0)
JSONSTRING
STRINGSTRING
TIME_NTZTIME
TIME_TZTIME
TIMESTAMP_NTZTIMESTAMP_NTZ
TIMESTAMP_TZTIMESTAMP_TZ

Bobsled → BigQuery

Bobsled TypeBigQuery
ARRAY<Primitive>JSON
ARRAY<COMPLEX>JSON
BINARYBINARY
BOOLEANBOOLEAN
DATEDATE
DECIMAL(p,s)DECIMAL/BIGDECIMAL (depending on the BigQuery rules described https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#parameterized_decimal_type)
FLOATFLOAT
GEOGRAPHYGEOGRAPHY
INTEGERINTEGER
JSONJSON
STRINGSTRING
TIME_NTZTIME
TIME_TZTIME
TIMESTAMP_NTZDATETIME
TIMESTAMP_TZTIMESTAMP_TZ

Bobsled → Databricks

Bobsled TypeDatabricks
ARRAY<Primitive>ARRAY<Primitve>
ARRAY<COMPLEX>ARRAY<JSON>
BINARYBINARY
BOOLEANBOOLEAN
DATEDATE
DECIMAL(p,s) p<=38 AND scale<=37DECIMAL(p,s). If the source data contains a DECIMAL that can’t be stored as a DECIMAL in Databricks, we use a STRING.
DECIMAL(p,s) p>38 OR scale> 37STRING
FLOATDOUBLE
GEOGRAPHYBINARY
INTEGERBIGINTEGER
JSONSTRING
STRINGSTRING
TIME_NTZ (parquet)Can not be mapped. A source parquet file containing TIME_NTZ can not be loaded to Databricks.
TIME_NTZ (csv)STRING
TIME_TZ (parquet)Can not be mapped. A source parquet file containing TIME_NTZ can not be loaded to Databricks.
TIME_TZ (csv)STRING
TIMESTAMP_NTZTIMESTAMP_TZ. Currently, we do not support TIMESTAMP_NTZ in Databricks because it can't be shared using Delta Share
TIMESTAMP_TZTIMESTAMP_TZ

Bobsled → Parquet

Parquet Mappings depend on the source of the data as we rely on the parquet writers of the source systems.

Bobsled TypeSnowflake → ParquetBigQuery → Parquet
ARRAY<Primitive>JSONARRAY<Primitive>
ARRAY<COMPLEX>JSONARRAY<COMPLEX>
BINARYBYTE_ARRAYBYTE_ARRAY
BOOLEANBOOLEANBOOLEAN
DATEDATEDATE
DECIMAL(p,s)DECIMAL(p,s)DECIMAL(38,9) or DECIMAL(76,38). BigQuery widens the precision/scale to max when unloading to Parquet.
FLOATFLOATFLOAT
GEOGRAPHYBYTE_ARRAYBYTE_ARRAY
GEOMETRYBYTE_ARRAYNA
INTEGERDECIMAL(38,0). Snowflake stores INTs as DECIMAL(38,0)INT64
JSONSTRINGSTRING
STRINGSTRINGSTRING
TIME_NTZTIME_NTZ (isAdjustedToUTC=false)TIME_NTZ (isAdjustedToUTC=false)
TIME_TZTIME_TZ (isAdjustedToUTC=true)TIME_TZ (isAdjustedToUTC=true)
TIMESTAMP_NTZTIMESTAMP_NTZ (isAdjustedToUTC=false)TIMESTAMP_NTZ (isAdjustedToUTC=false)
TIMESTAMP_TZTIMESTAMP_TZ (isAdjustedToUTC=true)TIMESTAMP_TZ (isAdjustedToUTC=true)

Was this article helpful?