メイン コンテンツにスキップする

Configure the DB extractor

To configure the DB extractor, you must create a configuration file. The file must be in YAML format.

The configuration file allows substitutions with environment variables. For example, connection strings:

connection-string: DRIVER={Oracle 19.3};DBQ=localhost:1521/XE;UID=SYSTEM;PWD=${DATABASE_PASSWORD}
Note

Implicit substitutions only work for unquoted value strings. For quoted strings, use the !env tag to activate environment substitution:

connection-string: !env 'DRIVER={Oracle 19.3};DBQ=localhost:1521/XE;UID=SYSTEM;PWD=${DATABASE_PASSWORD}'

The configuration file also contains the global parameter version, which holds the version of the configuration schema used in the configuration file. This document describes version 3 of the configuration schema.

Tip

You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud.

Logger

The optional logger section sets up logging to a console and files.

ParameterDescription
consoleSets up console logger configuration. See the Console section.
fileSets ut file logger configuration. See the File section.

Console

Include the console section to enable logging to a standard output, such as a terminal window.

ParameterDescription
levelSelect the verbosity level for console logging. Valid options, in decreasing verbosity levels, are DEBUG, INFO, WARNING, ERROR, and CRITICAL.

File

Include the file section to enable logging to a file. The files are rotated daily.

ParameterDescription
levelSelect the verbosity level for file logging. Valid options, in decreasing verbosity levels, are DEBUG, INFO, WARNING, ERROR, and CRITICAL.
pathInsert the path to the log file.
retentionSpecify the number of days to keep logs for. The default value is 7.

Cognite

The cognite section describes which CDF project the extractor will load data into and how to connect to the project.

ParameterDescription
projectInsert the CDF project name. This is a required value.
hostInsert the base URL of the CDF project. The default value is https://api.cognitedata.com.
api-keyWe've deprecated API-key authentication and strongly encourage customers to migrate to authentication with IdP.
idp-authenticationInsert the credentials for authenticating to CDF using an external identity provider. You must enter either an API key or use IdP authentication.
data-setInsert an optional data set ID that will be used if you've set the extractor to create missing time series. This value must contain either id or external-id.

Identity provider (IdP) authentication

The idp-authentication section enables the extractor to authenticate to CDF using an external identity provider, such as Microsoft Entra ID (formerly Azure Active Directory).

ParameterDescription
client-idEnter the client ID from the IdP. This is a required value.
secretEnter the client secret from the IdP. This is a required value.
scopesList the scopes. This is a required value.
resourceInsert token requests. This is an optional field.
token-urlInsert the URL to fetch tokens from. You must enter either a token URL or an Azure tenant.
tenantEnter the Azure tenant. You must enter either a token URL or an Azure tenant
min-ttlInsert the minimum time in seconds a token will be valid. If the cached token expires in less than min_ttl seconds, it will be refreshed. The default value is 30.

Extractor

The optional extractor section contains tuning parameters.

ParameterDescription
modeSet the execution mode. Options are single or continuous.
Use continuous to run the extractor in a continuous mode, executing the queries defined in the queries section. The default value is single.
upload-queue-sizeEnter the size of the upload queue. The default value is 50 000 rows.
parallelismInsert the number of parallel queries to run. The default value is 4.
state-storeThe state store configuration. See the State store section. The default value is no state store, and incremental load is deactivated.

State store

Include the state store section to save extraction states between runs. Use this if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.

ParameterDescription
localLocal state store configuration. See the Local section.
rawRAW state store configuration. See the RAW section.

Local

A local state store stores the extraction state in a JSON file on the local machine.

ParameterDescription
pathInsert the file path to a JSON file.
save-intervalEnter the interval in seconds between each save. The default value is 30.

RAW

A RAW state store stores the extraction state in a table in CDF RAW.

ParameterDescription
databaseEnter the database name in CDF RAW.
tableEnter the table name in CDF RAW.
upload-intervalEnter the interval in seconds between each save. The default value is 30.

Metrics

The metrics section describes where to send metrics on extractor performance for remote monitoring of the extractor. We recommend sending metrics to a Prometheus pushgateway, but you can also send metrics as time series in the CDF project.

ParameterDescription
push-gatewaysList of pushgateway configurations. See the Pushgateways section.
cogniteCognite metrics configurations. See the Cognite section.

Pushgateways

The push-gateways sections contain a list of metric destinations, each on the following schema:

ParameterDescription
hostEnter the address of the host to push metrics to. This is a required value.
job-nameEnter the value of the exported_job label to associate metrics with. This separates several deployments on a single pushgateway, and should be unique. This is a required value.
usernameEnter the credentials for the pushgateway. This is a required value.
passwordEnter the credentials for the pushgateway. This is a required value.
clear-afterEnter the number of seconds to wait before clearing the pushgateway. When this parameter is present, the extractor will stall after the run is complete before deleting all metrics from the pushgateway. The recommended value is at least twice that of the scrape interval on the pushgateway. This is to ensure that the last metrics are gathered before the deletion.
push-intervalEnter the interval in seconds between each push. The default value is 30.

Databases

The databases section contains a list of database configurations. The schema for database configuration depends on which database type you are connecting to. These are distinguished by the type parameter. Possible database types include:

  • ODBC
  • PostgresSQL
  • MongoDB
  • Azure Cosmos DB
  • Snowflake
  • Oracle DB
  • Local spreadsheet files: Microsoft Excel, OpenDocument sheets and .csv files.
  • Amazon Redshift
  • Amazon DynamoDB

Navigate to Integrate > Connect to source system > Cognite DB extractor in CDF to see all supported sources and the recommended approach.

This is the schema for ODBC databases:

ParameterDescription
typeSelect the type of database connection. Set to ODBC for ODBC databases.
nameEnter a name for the database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
connection-stringEnter the ODBC connection string. See Connect to a database section for details. This is a required value.
database-typeEnter the database type to connect to. Options are GENERIC, MYSQL, and ORACLE. The default value is GENERIC.
timeoutEnter the timeout in seconds for the ODBC connection and queries. The default value is no timeout.

Some ODBC drivers don't accept either the SQL_ATTR_CONNECTION_TIMEOUT or the SQL_ATTR_QUERY_TIMEOUT option. The extractor will log an exception with the message Could not set timeout on the ODBC driver - timeouts might not work properly. Extractions will continue regardless but without timeouts. To avoid this logline, you can disable timeouts for the database causing these problems.
batch-sizeEnter the number of rows to fetch from the database at a time. The default value is 1000. You can decrease this number if the machine with the extractor runs out of memory. Note that this will increase the run time.
timezoneSpecify how the extractor should handle the source timezone. Possible values are local and utc. Default value is local.

This is the schema for PostgreSQL databases:

ParameterDescription
typeType of database connection, set to postgres for PostgreSQL databases.
nameEnter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
hostEnter the host name or address for the database.
userEnter the username for the database.
passwordEnter the password for the database.
databaseEnter the database name to use. The default is to use the user name.
portEnter the port for the database. The default value is 5432.
timeoutEnter the timeout in seconds for the database connection and queries. The default value is no timeout.
batch-sizeNumber of rows to fetch from the database at a time. The default value is 1000. You can decrease this number if the machine running the extractor is running out of memory, but this will increase the run time.
timezoneSpecify how the extractor should handle the source timezone. Possible values are local and utc. Default value is local.

This is the schema for MongoDB databases:

ParameterDescription
typeType of database connection, set to mongodb for MongoDB databases.
nameEnter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
uriAdress and authentication data for the database as a Uniform Resource Identifier. You can find URI examples here
databaseName of the related MongoDB database to use
timezoneSpecify how the extractor should handle the source timezone. Possible values are local and utc. Default value is local.

This is the schema for Azure Cosmos DB databases:

ParameterDescription
typeType of database connection, set to cosmosdb for Cosmos DB databases.
nameEnter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
hostHost address for the database
keyAuthentication key for the database
databaseDatabase name to use
timezoneSpecify how the extractor should handle the source timezone. Possible values are local and utc. Default value is local.

This is the schema for Snowflake warehouses:

ParameterDescription
typeType of database connection, set to snowflake for Snowflake data warehouses.
nameEnter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
userUser name for the database
passwordPassword for the database
accountSnowflake account
organizationSnowflake organization name
databaseDatabase name to use
schemaSchema to use
timeoutEnter the timeout in seconds for the database connection and queries. The default value is no timeout.
batch-sizeNumber of rows to fetch from the database at a time. Default is 1000. This can be decreased if the machine running the extractor is running out of memory, but that will increase run time.
timezoneSpecify how the extractor should handle the source timezone. Possible values are local and utc. Default value is local.

This is the schema for local spreadsheets:

ParameterDescription
typeType of connection, set to spreadsheet for local spreadsheet files.
nameEnter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
pathPath where the spreadsheet file is located.
timezoneSpecify how the extractor should handle the source timezone. Possible values are local and utc. Default value is local.

This is the schema for Oracle databases:

ParameterDescription
typeType of database connection, set to oracle for Oracle databases.
nameEnter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
hostEnter the host name or address for the database.
userEnter the username for the database.
passwordEnter the password for the database.
portEnter the port for the database. The default value is 1521.
timeoutEnter the timeout in seconds for the database connection and queries. The default value is no timeout.
batch-sizeNumber of rows to fetch from the database at a time. The default value is 1000. You can decrease this number if the machine running the extractor is running out of memory, but this will increase the run time.
timezoneSpecify how the extractor should handle the source timezone. Possible values are local and utc. Default value is local.

This is the schema for Amazon DynamoDB:

ParameterDescription
typeType of database connection, set to dynamodb for Cosmos DB databases.
nameEnter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
aws_access_key_idAWS authentication access key. This is a required value.
aws_secret_access_keyAWS Authentication secret. This is a required value.
region_nameAWS region where your database is located. Example: us-east-1. This is a required value.
timezoneSpecify how the extractor should handle the source timezone. Possible values are local and utc. Default value is local.

This is the schema for Amazon Redshift:

ParameterDescription
typeType of database connection, set to redshift for Cosmos DB databases.
nameEnter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
aws_access_key_idAWS authentication access key. This is a required value.
aws_secret_access_keyAWS Authentication secret. This is a required value.
region_nameAWS region where your database is located. Example: us-east-1. This is a required value.
databaseEnter the name of the Redshift database. This is a required value.
secret_arnAWS Secret ARN. This is an optional value.
cluster_identifierName of the Redshift cluster to connect. This parameter is mandatory when connecting to a managed Redshift cluster .
workgroup_nameName of the Redshift workgroup to connect. This parameter is mandatory when connecting to a Redshift Serverless database.
timezoneSpecify how the extractor should handle the source timezone. Possible values are local and utc. Default value is local.

Queries

The queries section contains a list of queries to run. Each query is configured according to this schema:

ParameterDescription
nameEnter the name of this query that will be used for logging and tagging metrics. The name must be unique for each query in the configuration file. This is a required value.
databaseEnter the name of the database to connect to. This must be one of the database names configured in the databases section. This is a required value.
primary-keyInsert the format of the row key in CDF RAW. This parameter supports case-sensitive substitutions with values from the table columns. For example, if there's a column called index, setting primary-key: "row_{index}" will result in rows with keys row_0, row_1, etc. This is a required value if the destination is a raw type.
queryEnter a SQL query to run against the database. The query can contain joins as long as the database supports joins.

The query should contain a WHERE clause using the substitutions {incremental-field} and {start-at} if data is loaded incrementally. The queries must also be sorted on the incremental field.
Example:SELECT * FROM table WHERE {incremental_field} > {start_at} ORDER BY {incremental_field} ASC. This is a required value.
incremental-fieldInsert the table column that holds the incremental field
initial-startEnter the {start-at} for the first run. Later runs will use the value stored in the state store.
scheduleSchedule the interval which the query should be executed towards the source database. Use this parameter when the extractor is set to continuous mode. See the Schedule section.
destinationThe destination of the data in CDF. One of many destination types, see below. This is a required value.
collectionSpecify the collection on which the query will be executed. This parameter is mandatory when connecting to mongodb databases.
containerSpecify the container on which the query will be executed. This parameter is mandatory when connecting to cosmosdb databases.
sheetSpecify the sheet on which the query will be executed. This parameter is mandatory when connecting to spreadsheet local files.
parametersSpecify the parameters to be used when querying to AWS DynamyDB. This parameter is mandatory when connectong to dynamodb databases.

Schedule

Use the schedule subsection to schedule runs when the extractor runs as a service.

ParameterDescription
typeInsert the schedule type. Valid options are cron and interval.
  • cron uses regular cron expressions.
  • interval expects an interval-based schedule.
  • expressionEnter the cron or interval expression to trigger the query. For example, 1h repeats the query hourly, and 5m repeats the query every 5 minutes.

    Destinations

    The DB extractor supports a number of destinations. These include:

    • CDF RAW
    • CDF Assets
    • CDF Events
    • CDF Files
    • CDF Time series
    • CDF Sequences

    RAW

    The raw destination writes data to the CDF staging area (RAW). The raw destination requires the primary-key parameter in the query configuration.

    ParameterDescription
    typeType of CDF destination, set to raw to write data to RAW.
    databaseEnter the CDF RAW database to upload data into. This will be created if it doesn't exist. This is a required value.
    tableEnter the CDF RAW table to upload data into. This will be created if it doesn't exist. This is a required value.

    Time series

    The time_series destination inserts the resulting data as data points in time series. The time series destination is configured by setting the type parameter to time_series. No other parameters are required.

    To ingest data into a time series, the query must produce columns named

    • externalId
    • timestamp
    • value

    Assets

    The assets destination inserts the resulting data as CDF assets. The assets destination is configured by setting the type parameter to assets. No other parameters are required.

    To ingest data into a assets, the query must produce columns named

    • name

    In addition, columns named

    • externalId
    • parentExternalId
    • description
    • source

    may be included and will be mapped to corresponding fields in CDF assets. Any other columns returned by the query will be mapped to key/value pairs in the metadata field for assets.

    Events

    The events destination inserts the resulting data as CDF events. The events destination is configured by setting the type parameter to events. No other parameters are required.

    To ingest data into a events, the query must produce columns named

    • externalId

    In addition, columns named

    • startTime
    • endTime
    • description
    • source
    • type
    • subType

    may be included and will be mapped to corresponding fields in CDF events. Any other columns returned by the query will be mapped to key/value pairs in the metadata field for events.

    Events

    The events destination inserts the resulting data as CDF events. The events destination is configured by setting the type parameter to events. No other parameters are required.

    To ingest data into a events, the query must produce columns named

    • externalId

    In addition, columns named

    • startTime
    • endTime
    • description
    • source
    • type
    • subType

    may be included and will be mapped to corresponding fields in CDF events. Any other columns returned by the query will be mapped to key/value pairs in the metadata field for events.

    Files

    The files destination inserts the resulting data as CDF files. The files destination is configured by setting the type parameter to files. No other parameters are required.

    To ingest data into a files, the query must produce columns named

    • name
    • externalId
    • content

    content will be treated as binary data and uploaded to CDF files as the content of the file

    In addition, columns named

    • source
    • mimeType
    • directory
    • sourceCreatedTime
    • sourceModifiedTime
    • asset_ids

    may be included and will be mapped to corresponding fields in CDF files. Any other columns returned by the query will be mapped to key/value pairs in the metadata field for files.

    Sequences

    The sequence destination writes data to a CDF sequence.

    ParameterDescription
    typeType of CDF destination, set to sequence to write data to CDF sequences.
    external-idThe external ID of the sequence to ingest data into. The sequence will be created if it doesn't already exist.

    The column set of the query result will determine the columns of the sequence.

    The result must include a column named row_number, which must include an integer indicating which row number in the sequence to ingest the row into.