Configure the DB extractor
To configure the DB extractor, you must create a configuration file. The file must be in YAML format.
The configuration file allows substitutions with environment variables. For example, connection strings:
connection-string: DRIVER={Oracle 19.3};DBQ=localhost:1521/XE;UID=SYSTEM;PWD=${DATABASE_PASSWORD}
Implicit substitutions only work for unquoted value strings. For quoted strings, use the !env
tag to activate environment substitution:
connection-string: !env 'DRIVER={Oracle 19.3};DBQ=localhost:1521/XE;UID=SYSTEM;PWD=${DATABASE_PASSWORD}'
The configuration file also contains the global parameter version
, which holds the version of the configuration schema used in the configuration file. This document describes version 3 of the configuration schema.
You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud.
Logger
The optional logger
section sets up logging to a console and files.
Parameter | Description |
---|---|
console | Sets up console logger configuration. See the Console section. |
file | Sets ut file logger configuration. See the File section. |
Console
Include the console
section to enable logging to a standard output, such as a terminal window.
Parameter | Description |
---|---|
level | Select the verbosity level for console logging. Valid options, in decreasing verbosity levels, are DEBUG , INFO , WARNING , ERROR , and CRITICAL . |
File
Include the file
section to enable logging to a file. The files are rotated daily.
Parameter | Description |
---|---|
level | Select the verbosity level for file logging. Valid options, in decreasing verbosity levels, are DEBUG , INFO , WARNING , ERROR , and CRITICAL . |
path | Insert the path to the log file. |
retention | Specify the number of days to keep logs for. The default value is 7. |
Cognite
The cognite
section describes which CDF project the extractor will load data into and how to connect to the project.
Parameter | Description |
---|---|
project | Insert the CDF project name. This is a required value. |
host | Insert the base URL of the CDF project. The default value is https://api.cognitedata.com. |
api-key | We've deprecated API-key authentication and strongly encourage customers to migrate to authentication with IdP. |
idp-authentication | Insert the credentials for authenticating to CDF using an external identity provider. You must enter either an API key or use IdP authentication. |
data-set | Insert an optional data set ID that will be used if you've set the extractor to create missing time series. This value must contain either id or external-id . |
Identity provider (IdP) authentication
The idp-authentication
section enables the extractor to authenticate to CDF using an external identity provider, such as Microsoft Entra ID (formerly Azure Active Directory).
Parameter | Description |
---|---|
client-id | Enter the client ID from the IdP. This is a required value. |
secret | Enter the client secret from the IdP. This is a required value. |
scopes | List the scopes. This is a required value. |
resource | Insert token requests. This is an optional field. |
token-url | Insert the URL to fetch tokens from. You must enter either a token URL or an Azure tenant. |
tenant | Enter the Azure tenant. You must enter either a token URL or an Azure tenant |
min-ttl | Insert the minimum time in seconds a token will be valid. If the cached token expires in less than min_ttl seconds, it will be refreshed. The default value is 30. |
Extractor
The optional extractor
section contains tuning parameters.
Parameter | Description |
---|---|
mode | Set the execution mode. Options are single or continuous . Use continuous to run the extractor in a continuous mode, executing the queries defined in the queries section. The default value is single . |
upload-queue-size | Enter the size of the upload queue. The default value is 50 000 rows. |
parallelism | Insert the number of parallel queries to run. The default value is 4. |
state-store | The state store configuration. See the State store section. The default value is no state store, and incremental load is deactivated. |
State store
Include the state store
section to save extraction states between runs. Use this if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.
Parameter | Description |
---|---|
local | Local state store configuration. See the Local section. |
raw | RAW state store configuration. See the RAW section. |
Local
A local state store stores the extraction state in a JSON file on the local machine.
Parameter | Description |
---|---|
path | Insert the file path to a JSON file. |
save-interval | Enter the interval in seconds between each save. The default value is 30. |
RAW
A RAW state store stores the extraction state in a table in CDF RAW.
Parameter | Description |
---|---|
database | Enter the database name in CDF RAW. |
table | Enter the table name in CDF RAW. |
upload-interval | Enter the interval in seconds between each save. The default value is 30. |
Metrics
The metrics
section describes where to send metrics on extractor performance for remote monitoring of the extractor. We recommend sending metrics to a Prometheus pushgateway, but you can also send metrics as time series in the CDF project.
Parameter | Description |
---|---|
push-gateways | List of pushgateway configurations. See the Pushgateways section. |
cognite | Cognite metrics configurations. See the Cognite section. |
Pushgateways
The push-gateways
sections contain a list of metric destinations, each on the following schema:
Parameter | Description |
---|---|
host | Enter the address of the host to push metrics to. This is a required value. |
job-name | Enter the value of the exported_job label to associate metrics with. This separates several deployments on a single pushgateway, and should be unique. This is a required value. |
username | Enter the credentials for the pushgateway. This is a required value. |
password | Enter the credentials for the pushgateway. This is a required value. |
clear-after | Enter the number of seconds to wait before clearing the pushgateway. When this parameter is present, the extractor will stall after the run is complete before deleting all metrics from the pushgateway. The recommended value is at least twice that of the scrape interval on the pushgateway. This is to ensure that the last metrics are gathered before the deletion. |
push-interval | Enter the interval in seconds between each push. The default value is 30. |
Databases
The databases
section contains a list of database configurations. The schema for database configuration depends on which database type you are connecting to. These are distinguished by the type
parameter. Possible database types include:
- ODBC
- PostgresSQL
- MongoDB
- Azure Cosmos DB
- Snowflake
- Oracle DB
- Local spreadsheet files: Microsoft Excel, OpenDocument sheets and .csv files.
- Amazon Redshift
- Amazon DynamoDB
Navigate to Integrate > Connect to source system > Cognite DB extractor in CDF to see all supported sources and the recommended approach.
This is the schema for ODBC databases:
Parameter | Description |
---|---|
type | Select the type of database connection. Set to ODBC for ODBC databases. |
name | Enter a name for the database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value. |
connection-string | Enter the ODBC connection string. See Connect to a database section for details. This is a required value. |
database-type | Enter the database type to connect to. Options are GENERIC , MYSQL , and ORACLE . The default value is GENERIC . |
timeout | Enter the timeout in seconds for the ODBC connection and queries. The default value is no timeout. Some ODBC drivers don't accept either the SQL_ATTR_CONNECTION_TIMEOUT or the SQL_ATTR_QUERY_TIMEOUT option. The extractor will log an exception with the message Could not set timeout on the ODBC driver - timeouts might not work properly . Extractions will continue regardless but without timeouts. To avoid this logline, you can disable timeouts for the database causing these problems. |
batch-size | Enter the number of rows to fetch from the database at a time. The default value is 1000. You can decrease this number if the machine with the extractor runs out of memory. Note that this will increase the run time. |
timezone | Specify how the extractor should handle the source timezone. Possible values are local and utc . Default value is local . |
This is the schema for PostgreSQL databases:
Parameter | Description |
---|---|
type | Type of database connection, set to postgres for PostgreSQL databases. |
name | Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value. |
host | Enter the host name or address for the database. |
user | Enter the username for the database. |
password | Enter the password for the database. |
database | Enter the database name to use. The default is to use the user name. |
port | Enter the port for the database. The default value is 5432. |
timeout | Enter the timeout in seconds for the database connection and queries. The default value is no timeout. |
batch-size | Number of rows to fetch from the database at a time. The default value is 1000. You can decrease this number if the machine running the extractor is running out of memory, but this will increase the run time. |
timezone | Specify how the extractor should handle the source timezone. Possible values are local and utc . Default value is local . |
This is the schema for MongoDB databases:
Parameter | Description |
---|---|
type | Type of database connection, set to mongodb for MongoDB databases. |
name | Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value. |
uri | Adress and authentication data for the database as a Uniform Resource Identifier. You can find URI examples here |
database | Name of the related MongoDB database to use |
timezone | Specify how the extractor should handle the source timezone. Possible values are local and utc . Default value is local . |
This is the schema for Azure Cosmos DB databases:
Parameter | Description |
---|---|
type | Type of database connection, set to cosmosdb for Cosmos DB databases. |
name | Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value. |
host | Host address for the database |
key | Authentication key for the database |
database | Database name to use |
timezone | Specify how the extractor should handle the source timezone. Possible values are local and utc . Default value is local . |
This is the schema for Snowflake warehouses:
Parameter | Description |
---|---|
type | Type of database connection, set to snowflake for Snowflake data warehouses. |
name | Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value. |
user | User name for the database |
password | Password for the database |
account | Snowflake account |
organization | Snowflake organization name |
database | Database name to use |
schema | Schema to use |
timeout | Enter the timeout in seconds for the database connection and queries. The default value is no timeout. |
batch-size | Number of rows to fetch from the database at a time. Default is 1000. This can be decreased if the machine running the extractor is running out of memory, but that will increase run time. |
timezone | Specify how the extractor should handle the source timezone. Possible values are local and utc . Default value is local . |
This is the schema for local spreadsheets:
Parameter | Description |
---|---|
type | Type of connection, set to spreadsheet for local spreadsheet files. |
name | Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value. |
path | Path where the spreadsheet file is located. |
timezone | Specify how the extractor should handle the source timezone. Possible values are local and utc . Default value is local . |
This is the schema for Oracle databases:
Parameter | Description |
---|---|
type | Type of database connection, set to oracle for Oracle databases. |
name | Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value. |
host | Enter the host name or address for the database. |
user | Enter the username for the database. |
password | Enter the password for the database. |
port | Enter the port for the database. The default value is 1521. |
timeout | Enter the timeout in seconds for the database connection and queries. The default value is no timeout. |
batch-size | Number of rows to fetch from the database at a time. The default value is 1000. You can decrease this number if the machine running the extractor is running out of memory, but this will increase the run time. |
timezone | Specify how the extractor should handle the source timezone. Possible values are local and utc . Default value is local . |
This is the schema for Amazon DynamoDB:
Parameter | Description |
---|---|
type | Type of database connection, set to dynamodb for Cosmos DB databases. |
name | Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value. |
aws_access_key_id | AWS authentication access key. This is a required value. |
aws_secret_access_key | AWS Authentication secret. This is a required value. |
region_name | AWS region where your database is located. Example: us-east-1. This is a required value. |
timezone | Specify how the extractor should handle the source timezone. Possible values are local and utc . Default value is local . |
This is the schema for Amazon Redshift:
Parameter | Description |
---|---|
type | Type of database connection, set to redshift for Cosmos DB databases. |
name | Enter a user-given name of this database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value. |
aws_access_key_id | AWS authentication access key. This is a required value. |
aws_secret_access_key | AWS Authentication secret. This is a required value. |
region_name | AWS region where your database is located. Example: us-east-1. This is a required value. |
database | Enter the name of the Redshift database. This is a required value. |
secret_arn | AWS Secret ARN. This is an optional value. |
cluster_identifier | Name of the Redshift cluster to connect. This parameter is mandatory when connecting to a managed Redshift cluster . |
workgroup_name | Name of the Redshift workgroup to connect. This parameter is mandatory when connecting to a Redshift Serverless database. |
timezone | Specify how the extractor should handle the source timezone. Possible values are local and utc . Default value is local . |
Queries
The queries
section contains a list of queries to run. Each query is configured according to this schema:
Parameter | Description |
---|---|
name | Enter the name of this query that will be used for logging and tagging metrics. The name must be unique for each query in the configuration file. This is a required value. |
database | Enter the name of the database to connect to. This must be one of the database names configured in the databases section. This is a required value. |
primary-key | Insert the format of the row key in CDF RAW. This parameter supports case-sensitive substitutions with values from the table columns. For example, if there's a column called index, setting primary-key: "row_{index}" will result in rows with keys row_0 , row_1 , etc. This is a required value if the destination is a raw type. |
query | Enter a SQL query to run against the database. The query can contain joins as long as the database supports joins. The query should contain a WHERE clause using the substitutions {incremental-field} and {start-at} if data is loaded incrementally. The queries must also be sorted on the incremental field. Example: SELECT * FROM table WHERE {incremental_field} > {start_at} ORDER BY {incremental_field} ASC . This is a required value. |
incremental-field | Insert the table column that holds the incremental field |
initial-start | Enter the {start-at} for the first run. Later runs will use the value stored in the state store. |
schedule | Schedule the interval which the query should be executed towards the source database. Use this parameter when the extractor is set to continuous mode. See the Schedule section. |
destination | The destination of the data in CDF. One of many destination types, see below. This is a required value. |
collection | Specify the collection on which the query will be executed. This parameter is mandatory when connecting to mongodb databases. |
container | Specify the container on which the query will be executed. This parameter is mandatory when connecting to cosmosdb databases. |
sheet | Specify the sheet on which the query will be executed. This parameter is mandatory when connecting to spreadsheet local files. |
parameters | Specify the parameters to be used when querying to AWS DynamyDB. This parameter is mandatory when connectong to dynamodb databases. |
Schedule
Use the schedule
subsection to schedule runs when the extractor runs as a service.
Parameter | Description |
---|---|
type | Insert the schedule type. Valid options are cron and interval . cron uses regular cron expressions.interval expects an interval-based schedule. |
expression | Enter the cron or interval expression to trigger the query. For example, 1h repeats the query hourly, and 5m repeats the query every 5 minutes. |
Destinations
The DB extractor supports a number of destinations. These include:
- CDF RAW
- CDF Assets
- CDF Events
- CDF Files
- CDF Time series
- CDF Sequences
RAW
The raw
destination writes data to the CDF staging area (RAW). The raw destination requires the primary-key
parameter in the query configuration.
Parameter | Description |
---|---|
type | Type of CDF destination, set to raw to write data to RAW. |
database | Enter the CDF RAW database to upload data into. This will be created if it doesn't exist. This is a required value. |
table | Enter the CDF RAW table to upload data into. This will be created if it doesn't exist. This is a required value. |
Time series
The time_series
destination inserts the resulting data as data points in time series. The time series destination is configured by setting the type
parameter to time_series
. No other parameters are required.
To ingest data into a time series, the query must produce columns named
externalId
timestamp
value
Assets
The assets
destination inserts the resulting data as CDF assets. The assets destination is configured by setting the type
parameter to assets
. No other parameters are required.
To ingest data into a assets, the query must produce columns named
name
In addition, columns named
externalId
parentExternalId
description
source
may be included and will be mapped to corresponding fields in CDF assets. Any other columns returned by the query will be mapped to key/value pairs in the metadata
field for assets.
Events
The events
destination inserts the resulting data as CDF events. The events destination is configured by setting the type
parameter to events
. No other parameters are required.
To ingest data into a events, the query must produce columns named
externalId
In addition, columns named
startTime
endTime
description
source
type
subType
may be included and will be mapped to corresponding fields in CDF events. Any other columns returned by the query will be mapped to key/value pairs in the metadata
field for events.
Events
The events
destination inserts the resulting data as CDF events. The events destination is configured by setting the type
parameter to events
. No other parameters are required.
To ingest data into a events, the query must produce columns named
externalId
In addition, columns named
startTime
endTime
description
source
type
subType
may be included and will be mapped to corresponding fields in CDF events. Any other columns returned by the query will be mapped to key/value pairs in the metadata
field for events.
Files
The files
destination inserts the resulting data as CDF files. The files destination is configured by setting the type
parameter to files
. No other parameters are required.
To ingest data into a files, the query must produce columns named
name
externalId
content
content
will be treated as binary data and uploaded to CDF files as the content of the file
In addition, columns named
source
mimeType
directory
sourceCreatedTime
sourceModifiedTime
asset_ids
may be included and will be mapped to corresponding fields in CDF files. Any other columns returned by the query will be mapped to key/value pairs in the metadata
field for files.
Sequences
The sequence
destination writes data to a CDF sequence.
Parameter | Description |
---|---|
type | Type of CDF destination, set to sequence to write data to CDF sequences. |
external-id | The external ID of the sequence to ingest data into. The sequence will be created if it doesn't already exist. |
The column set of the query result will determine the columns of the sequence.
The result must include a column named row_number
, which must include an integer indicating which row number in the sequence to ingest the row into.