Configure the DB extractor

To configure the DB extractor, you must create a configuration file. The file must be in YAML format.

The configuration file allows substitutions with environment variables. For example, connection strings:

connection-string: DRIVER={Oracle 19.3};DBQ=localhost:1521/XE;UID=SYSTEM;PWD=${DATABASE_PASSWORD}

Note

Implicit substitutions only work for unquoted value strings. For quoted strings, use the !env tag to activate environment substitution:

connection-string: !env 'DRIVER={Oracle 19.3};DBQ=localhost:1521/XE;UID=SYSTEM;PWD=${DATABASE_PASSWORD}'

The configuration file also contains the global parameter version, which holds the version of the configuration schema used in the configuration file. This document describes version 3 of the configuration schema.

Tip

You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud.

Logger

The optional logger section sets up logging to a console and files.

Parameter	Description
`console`	Sets up console logger configuration. See the Console section.
`file`	Sets ut file logger configuration. See the File section.

Console

Include the console section to enable logging to a standard output, such as a terminal window.

Parameter	Description
`level`	Select the verbosity level for console logging. Valid options, in decreasing verbosity levels, are `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`.

File

Include the file section to enable logging to a file. The files are rotated daily.

Parameter	Description
`level`	Select the verbosity level for file logging. Valid options, in decreasing verbosity levels, are `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`.
`path`	Insert the path to the log file.
`retention`	Specify the number of days to keep logs for. The default value is 7.

Cognite

The cognite section describes which CDF project the extractor will load data into and how to connect to the project.

Parameter	Description
`project`	Insert the CDF project name. This is a required value.
`host`	Insert the base URL of the CDF project. The default value is https://api.cognitedata.com.
`api-key`	We've deprecated API-key authentication and strongly encourage customers to migrate to authentication with IdP.
`idp-authentication`	Insert the credentials for authenticating to CDF using an external identity provider. You must enter either an API key or use IdP authentication.
`data-set`	Insert an optional data set ID that will be used if you've set the extractor to create missing time series. This value must contain either `id` or `external-id`.

Identity provider (IdP) authentication

The idp-authentication section enables the extractor to authenticate to CDF using an external identity provider, such as Microsoft Entra ID (formerly Azure Active Directory).

Parameter	Description
`client-id`	Enter the client ID from the IdP. This is a required value.
`secret`	Enter the client secret from the IdP. This is a required value.
`scopes`	List the scopes. This is a required value.
`resource`	Insert token requests. This is an optional field.
`token-url`	Insert the URL to fetch tokens from. You must enter either a token URL or an Azure tenant.
`tenant`	Enter the Azure tenant. You must enter either a token URL or an Azure tenant
`min-ttl`	Insert the minimum time in seconds a token will be valid. If the cached token expires in less than min_ttl seconds, it will be refreshed. The default value is 30.

Extractor

The optional extractor section contains tuning parameters.

Parameter	Description
`mode`	Set the execution mode. Options are `single` or `continuous`. Use `continuous` to run the extractor in a continuous mode, executing the queries defined in the `queries` section. The default value is `single`.
`upload-queue-size`	Enter the size of the upload queue. The default value is 50 000 rows.
`parallelism`	Insert the number of parallel queries to run. The default value is 4.
`state-store`	The state store configuration. See the State store section. The default value is no state store, and incremental load is deactivated.

State store

Include the state store section to save extraction states between runs. Use this if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.

Parameter	Description
`local`	Local state store configuration. See the Local section.
`raw`	RAW state store configuration. See the RAW section.

Local

A local state store stores the extraction state in a JSON file on the local machine.

Parameter	Description
`path`	Insert the file path to a JSON file.
`save-interval`	Enter the interval in seconds between each save. The default value is 30.

RAW

A RAW state store stores the extraction state in a table in CDF RAW.

Parameter	Description
`database`	Enter the database name in CDF RAW.
`table`	Enter the table name in CDF RAW.
`upload-interval`	Enter the interval in seconds between each save. The default value is 30.

Metrics

The metrics section describes where to send metrics on extractor performance for remote monitoring of the extractor. We recommend sending metrics to a Prometheus pushgateway, but you can also send metrics as time series in the CDF project.

Parameter	Description
`push-gateways`	List of pushgateway configurations. See the Pushgateways section.
`cognite`	Cognite metrics configurations. See the Cognite section.

Pushgateways

The push-gateways sections contain a list of metric destinations, each on the following schema:

Parameter	Description
`host`	Enter the address of the host to push metrics to. This is a required value.
`job-name`	Enter the value of the `exported_job` label to associate metrics with. This separates several deployments on a single pushgateway, and should be unique. This is a required value.
`username`	Enter the credentials for the pushgateway. This is a required value.
`password`	Enter the credentials for the pushgateway. This is a required value.
`clear-after`	Enter the number of seconds to wait before clearing the pushgateway. When this parameter is present, the extractor will stall after the run is complete before deleting all metrics from the pushgateway. The recommended value is at least twice that of the scrape interval on the pushgateway. This is to ensure that the last metrics are gathered before the deletion.
`push-interval`	Enter the interval in seconds between each push. The default value is 30.

Databases

The databases section contains a list of database configurations. The schema for database configuration depends on which database type you are connecting to. These are distinguished by the type parameter. Possible database types include:

ODBC
PostgresSQL
MongoDB
Azure Cosmos DB
Snowflake
Oracle DB
Local spreadsheet files: Microsoft Excel, OpenDocument sheets and .csv files.
Amazon Redshift
Amazon DynamoDB

Navigate to Integrate > Connect to source system > Cognite DB extractor in CDF to see all supported sources and the recommended approach.

This is the schema for ODBC databases:

Parameter	Description
`type`	Select the type of database connection. Set to `ODBC` for ODBC databases.
`name`	Enter a name for the database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
`connection-string`	Enter the ODBC connection string. See Connect to a database section for details. This is a required value.
`database-type`	Enter the database type to connect to. Options are `GENERIC`, `MYSQL`, and `ORACLE`. The default value is `GENERIC`.
`timeout`	Enter the timeout in seconds for the ODBC connection and queries. The default value is no timeout. Some ODBC drivers don't accept either the `SQL_ATTR_CONNECTION_TIMEOUT` or the `SQL_ATTR_QUERY_TIMEOUT` option. The extractor will log an exception with the message `Could not set timeout on the ODBC driver - timeouts might not work properly`. Extractions will continue regardless but without timeouts. To avoid this logline, you can disable timeouts for the database causing these problems.
`batch-size`	Enter the number of rows to fetch from the database at a time. The default value is 1000. You can decrease this number if the machine with the extractor runs out of memory. Note that this will increase the run time.
`timezone`	Specify how the extractor should handle the source timezone. Possible values are `local` and `utc`. Default value is `local`.

This is the schema for PostgreSQL databases:

Parameter	Description
`type`	Type of database connection, set to `postgres` for PostgreSQL databases.
`name`	Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
`host`	Enter the host name or address for the database.
`user`	Enter the username for the database.
`password`	Enter the password for the database.
`database`	Enter the database name to use. The default is to use the user name.
`port`	Enter the port for the database. The default value is 5432.
`timeout`	Enter the timeout in seconds for the database connection and queries. The default value is no timeout.
`batch-size`	Number of rows to fetch from the database at a time. The default value is 1000. You can decrease this number if the machine running the extractor is running out of memory, but this will increase the run time.
`timezone`	Specify how the extractor should handle the source timezone. Possible values are `local` and `utc`. Default value is `local`.

This is the schema for MongoDB databases:

Parameter	Description
`type`	Type of database connection, set to `mongodb` for MongoDB databases.
`name`	Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
`uri`	Adress and authentication data for the database as a Uniform Resource Identifier. You can find URI examples here
`database`	Name of the related MongoDB database to use
`timezone`	Specify how the extractor should handle the source timezone. Possible values are `local` and `utc`. Default value is `local`.

This is the schema for Azure Cosmos DB databases:

Parameter	Description
`type`	Type of database connection, set to `cosmosdb` for Cosmos DB databases.
`name`	Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
`host`	Host address for the database
`key`	Authentication key for the database
`database`	Database name to use
`timezone`	Specify how the extractor should handle the source timezone. Possible values are `local` and `utc`. Default value is `local`.

This is the schema for Snowflake warehouses:

Parameter	Description
`type`	Type of database connection, set to `snowflake` for Snowflake data warehouses.
`name`	Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
`user`	User name for the database
`password`	Password for the database
`account`	Snowflake account
`organization`	Snowflake organization name
`database`	Database name to use
`schema`	Schema to use
`timeout`	Enter the timeout in seconds for the database connection and queries. The default value is no timeout.
`batch-size`	Number of rows to fetch from the database at a time. Default is 1000. This can be decreased if the machine running the extractor is running out of memory, but that will increase run time.
`timezone`	Specify how the extractor should handle the source timezone. Possible values are `local` and `utc`. Default value is `local`.

This is the schema for local spreadsheets:

Parameter	Description
`type`	Type of connection, set to `spreadsheet` for local spreadsheet files.
`name`	Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
`path`	Path where the spreadsheet file is located.
`timezone`	Specify how the extractor should handle the source timezone. Possible values are `local` and `utc`. Default value is `local`.

This is the schema for Oracle databases:

Parameter	Description
`type`	Type of database connection, set to `oracle` for Oracle databases.
`name`	Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
`host`	Enter the host name or address for the database.
`user`	Enter the username for the database.
`password`	Enter the password for the database.
`port`	Enter the port for the database. The default value is 1521.
`timeout`	Enter the timeout in seconds for the database connection and queries. The default value is no timeout.
`batch-size`	Number of rows to fetch from the database at a time. The default value is 1000. You can decrease this number if the machine running the extractor is running out of memory, but this will increase the run time.
`timezone`	Specify how the extractor should handle the source timezone. Possible values are `local` and `utc`. Default value is `local`.

This is the schema for Amazon DynamoDB:

Parameter	Description
`type`	Type of database connection, set to `dynamodb` for Cosmos DB databases.
`name`	Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
`aws_access_key_id`	AWS authentication access key. This is a required value.
`aws_secret_access_key`	AWS Authentication secret. This is a required value.
`region_name`	AWS region where your database is located. Example: us-east-1. This is a required value.
`timezone`	Specify how the extractor should handle the source timezone. Possible values are `local` and `utc`. Default value is `local`.

This is the schema for Amazon Redshift:

Parameter	Description
`type`	Type of database connection, set to `redshift` for Cosmos DB databases.
`name`	Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
`aws_access_key_id`	AWS authentication access key. This is a required value.
`aws_secret_access_key`	AWS Authentication secret. This is a required value.
`region_name`	AWS region where your database is located. Example: us-east-1. This is a required value.
`database`	Enter the name of the Redshift database. This is a required value.
`secret_arn`	AWS Secret ARN. This is an optional value.
`cluster_identifier`	Name of the Redshift cluster to connect. This parameter is mandatory when connecting to a managed Redshift cluster .
`workgroup_name`	Name of the Redshift workgroup to connect. This parameter is mandatory when connecting to a Redshift Serverless database.
`timezone`	Specify how the extractor should handle the source timezone. Possible values are `local` and `utc`. Default value is `local`.

Queries

The queries section contains a list of queries to run. Each query is configured according to this schema:

Parameter	Description
`name`	Enter the name of this query that will be used for logging and tagging metrics. The name must be unique for each query in the configuration file. This is a required value.
`database`	Enter the name of the database to connect to. This must be one of the database names configured in the `databases` section. This is a required value.
`primary-key`	Insert the format of the row key in CDF RAW. This parameter supports case-sensitive substitutions with values from the table columns. For example, if there's a column called index, setting `primary-key: "row_{index}"` will result in rows with keys `row_0`, `row_1`, etc. This is a required value if the destination is a `raw` type.
`query`	Enter a SQL query to run against the database. The query can contain joins as long as the database supports joins. The query should contain a `WHERE` clause using the substitutions `{incremental-field}` and `{start-at}` if data is loaded incrementally. The queries must also be sorted on the incremental field. Example:`SELECT * FROM table WHERE {incremental_field} > {start_at} ORDER BY {incremental_field} ASC`. This is a required value.
`incremental-field`	Insert the table column that holds the incremental field
`initial-start`	Enter the `{start-at}` for the first run. Later runs will use the value stored in the state store.
`schedule`	Schedule the interval which the query should be executed towards the source database. Use this parameter when the extractor is set to `continuous` mode. See the Schedule section.
`destination`	The destination of the data in CDF. One of many destination types, see below. This is a required value.
`collection`	Specify the collection on which the query will be executed. This parameter is mandatory when connecting to `mongodb` databases.
`container`	Specify the container on which the query will be executed. This parameter is mandatory when connecting to `cosmosdb` databases.
`sheet`	Specify the sheet on which the query will be executed. This parameter is mandatory when connecting to `spreadsheet` local files.
`parameters`	Specify the parameters to be used when querying to AWS DynamyDB. This parameter is mandatory when connectong to `dynamodb` databases.

Schedule

Use the schedule subsection to schedule runs when the extractor runs as a service.

Parameter	Description
`type`	Insert the schedule type. Valid options are `cron` and `interval`. `cron` uses regular cron expressions. `interval` expects an interval-based schedule.
`expression`	Enter the cron or interval expression to trigger the query. For example, `1h` repeats the query hourly, and `5m` repeats the query every 5 minutes.

Destinations

The DB extractor supports a number of destinations. These include:

CDF RAW
CDF Assets
CDF Events
CDF Files
CDF Time series
CDF Sequences

RAW

The raw destination writes data to the CDF staging area (RAW). The raw destination requires the primary-key parameter in the query configuration.

Parameter	Description
`type`	Type of CDF destination, set to `raw` to write data to RAW.
`database`	Enter the CDF RAW database to upload data into. This will be created if it doesn't exist. This is a required value.
`table`	Enter the CDF RAW table to upload data into. This will be created if it doesn't exist. This is a required value.

Time series

The time_series destination inserts the resulting data as data points in time series. The time series destination is configured by setting the type parameter to time_series. No other parameters are required.

To ingest data into a time series, the query must produce columns named

externalId
timestamp
value

Assets

The assets destination inserts the resulting data as CDF assets. The assets destination is configured by setting the type parameter to assets. No other parameters are required.

To ingest data into a assets, the query must produce columns named

name

In addition, columns named

externalId
parentExternalId
description
source

may be included and will be mapped to corresponding fields in CDF assets. Any other columns returned by the query will be mapped to key/value pairs in the metadata field for assets.

Events

The events destination inserts the resulting data as CDF events. The events destination is configured by setting the type parameter to events. No other parameters are required.

To ingest data into a events, the query must produce columns named

externalId

In addition, columns named

startTime
endTime
description
source
type
subType

may be included and will be mapped to corresponding fields in CDF events. Any other columns returned by the query will be mapped to key/value pairs in the metadata field for events.

Events

The events destination inserts the resulting data as CDF events. The events destination is configured by setting the type parameter to events. No other parameters are required.

To ingest data into a events, the query must produce columns named

externalId

In addition, columns named

startTime
endTime
description
source
type
subType

may be included and will be mapped to corresponding fields in CDF events. Any other columns returned by the query will be mapped to key/value pairs in the metadata field for events.

Files

The files destination inserts the resulting data as CDF files. The files destination is configured by setting the type parameter to files. No other parameters are required.

To ingest data into a files, the query must produce columns named

name
externalId
content

content will be treated as binary data and uploaded to CDF files as the content of the file

In addition, columns named

source
mimeType
directory
sourceCreatedTime
sourceModifiedTime
asset_ids

may be included and will be mapped to corresponding fields in CDF files. Any other columns returned by the query will be mapped to key/value pairs in the metadata field for files.

Sequences

The sequence destination writes data to a CDF sequence.

Parameter	Description
`type`	Type of CDF destination, set to `sequence` to write data to CDF sequences.
`external-id`	The external ID of the sequence to ingest data into. The sequence will be created if it doesn't already exist.

The column set of the query result will determine the columns of the sequence.

The result must include a column named row_number, which must include an integer indicating which row number in the sequence to ingest the row into.

Configure the DB extractor

Logger​

Console​

File​

Cognite​

Identity provider (IdP) authentication​

Extractor​

State store​

Local​

RAW​

Metrics​

Pushgateways​

Databases​

Queries​

Schedule​

Destinations​

RAW​

Time series​

Assets​

Events​

Events​

Files​

Sequences​

Logger

Console

File

Cognite

Identity provider (IdP) authentication

Extractor

State store

Local

RAW

Metrics

Pushgateways

Databases

Queries

Schedule

Destinations

RAW

Time series

Assets

Events

Events

Files

Sequences