Configure the File Extractor
To configure the File Extractor, you must create a configuration file. The file must be in YAML format.
The configuration file allows substitutions with environment variables:
config-parameter: ${CONFIG_VALUE}
Implicit substitutions only work for unquoted value strings. For quoted strings, use the !env
tag to activate environment substitution:
config-parameter: !env 'PARAM=SYSTEM;CONFIG=${CONFIG_VALUE}'
The configuration file also contains the global parameter version
, which holds the version of the configuration schema used in the configuration file. This document describes version 3 of the configuration schema.
You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud.
Logger
The optional logger
section sets up logging to a console and files.
Parameter | Description |
---|---|
console | Sets up console logger configuration. See the Console section. |
file | Sets ut file logger configuration. See the File section. |
Console
Include the console
section to enable logging to a standard output, such as a terminal window.
Parameter | Description |
---|---|
level | Select the verbosity level for console logging. Valid options, in decreasing verbosity levels, are DEBUG , INFO , WARNING , ERROR , and CRITICAL . |
File
Include the file
section to enable logging to a file. The files are rotated daily.
Parameter | Description |
---|---|
level | Select the verbosity level for file logging. Valid options, in decreasing verbosity levels, are DEBUG , INFO , WARNING , ERROR , and CRITICAL . |
path | Insert the path to the log file. |
retention | Specify the number of days to keep logs for. The default value is 7. |
Cognite
The cognite
section describes which CDF project the extractor will load data into and how to connect to the project.
Parameter | Description |
---|---|
project | Insert the CDF project name. This is a required value. |
host | Insert the base URL of the CDF project. The default value is https://api.cognitedata.com. |
api-key | We've deprecated API-key authentication and strongly encourage customers to migrate to authentication with IdP. |
idp-authentication | Insert the credentials for authenticating to CDF using an external identity provider. You must enter either an API key or use IdP authentication. |
data-set | Insert an optional data set ID that will be used if you've set the extractor to create missing time series. This value must contain either id or external-id . |
Identity provider (IdP) authentication
The idp-authentication
section enables the extractor to authenticate to CDF using an external identity provider, such as Azure AD..
Parameter | Description |
---|---|
client-id | Enter the client ID from the IdP. This is a required value. |
secret | Enter the client secret from the IdP. This is a required value. |
scopes | List the scopes. This is a required value. |
resource | Insert token requests. This is an optional field. |
token-url | Insert the URL to fetch tokens from. You must enter either a token URL or an Azure tenant. |
tenant | Enter the Azure tenant. You must enter either a token URL or an Azure tenant |
min-ttl | Insert the minimum time in seconds a token will be valid. If the cached token expires in less than min_ttl seconds, it will be refreshed. The default value is 30. |
Extractor
The optional extractor
section contains tuning parameters.
Parameter | Description |
---|---|
errors_threshold | Enter the amount of retries the extractor should execute when a file extraction fails. The default value is 5 |
parallelism | Insert the number of parallel queries to run. The default value is 4. |
state-store | Set to true to configure state store. The default value is no state store, and the incremental load is deactivated. See the State store section. |
schedule | Schedule the interval which the file extraction should be execute. Use this parameter when the extractor is set to continuous mode. See the Schedule section. |
Schedule
Use the schedule
subsection to schedule runs when the extractor runs as a service.
Parameter | Description |
---|---|
type | Insert the schedule type. Valid options are cron and interval . cron uses regular cron expressions.interval expects an interval-based schedule. |
expression | Enter the cron or interval expression to trigger the query. For example, 1h repeats the query hourly, and 5m repeats the query every 5 minutes. |
State store
Use the state store
subsection to save extraction states between runs. Use this if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.
Parameter | Description |
---|---|
local | Local state store configuration. See the Local section. |
raw | RAW state store configuration. See the RAW section. |
Local
Use the local
section to store the extraction state in a JSON file on a local machine.
Parameter | Description |
---|---|
path | Insert the file path to a JSON file. |
save-interval | Enter the interval in seconds between each save. The default value is 30 seconds. |
RAW
Use the RAW
section to store the extraction state in a table in the CDF staging area.
Parameter | Description |
---|---|
database | Enter the database name in the CDF staging area. |
table | Enter the table name in the CDF staging area. |
upload-interval | Enter the interval in seconds between each save. The default value is 30 seconds. |
Files
The files
section contains the configuration needed in order to connect to the file source. The schema for the file configuration depends on which file source you are connecting to. These are distinguished by the type
parameter. Possible file source types include:
- Azure Blob Storage
- FTP / FTPS
- Google Cloud Storage
- Local files
- Amazon S3
- Samba / SMB
- SFTP
- Sharepoint Online
Navigate to Integrate > Connect to source system > Cognite File Extractor in CDF to see all supported sources and the recommended approach.
This is the schema for Azure Blob Storage source:
Parameter | Description |
---|---|
type | Type of file source, set to azure_blob_storage for Azure Blob storage files. |
connection_string | Connection string needed to connect to Azure Blob storage. This is a mandatory field. |
containers | List of Azure blob containers. This is an optional field. |
This is the schema for FTP/FTPS source:
Parameter | Description |
---|---|
type | Type of file source, set to ftp for FTP or FTPS source. |
base-url | Enter the base URL for the FTP server. This is a mandatory field. |
port | Enter the port related to the FTP server. This is an optional field. |
client-login | Enter the FTP username. This is an mandatory field. |
client-password | Enter the FTP password. This is an mandatory field. |
main-folder | Enter the root directory on which the extractor will start the extractor. This is an optional field. |
with-subfolders | Flag that allows the extractor to traverse into sub-folders in order to retrieve the related files. Possible values are true or false . Default value is false . This is an optional field. |
use-ssl | When set to true , it connects to the source using SSL (FTPS). Possible values are true or false . Default value is false . This is an optional field. |
certificate-file-path | Enter the path to the certificate file. This is an optional field. |
This is the schema for Google Cloud Storage source:
Parameter | Description |
---|---|
type | Type of file source, set to gcp_cloud_storage for Google Cloud Storage source. |
google-application-credentials | Enter the Google Cloud Platform service account credentials (encoded in base64 format). This is a mandatory field. |
bucket | Enter the name of the bucket where the files are located. This is a mandatory field. |
folders | Enter the list of folders where the files are located . This is an mandatory field. |
This is the schema for local files source:
Parameter | Description |
---|---|
type | Type of file source, set to local for local files. |
path | Enter the path (absolute or relative) where the local files are located. This is a mandatory parameter. |
This is the schema for Amazon S3 source:
Parameter | Description |
---|---|
type | Type of file source, set to aws_s3 for Amazon S3 source. |
aws_access_key_id | Enter the AWS Access Key ID. This is a mandatory parameter. |
aws_secret_access_key | Enter the AWS Secret Access Key. This is a mandatory field. |
bucket | Enter the name of the bucket where the files are located. This is a mandatory field. |
This is the schema for Samba / SMB source:
Parameter | Description |
---|---|
type | Type of file source, set to smb for Samba source. |
server | Enter the server address related to the Samba server. This is a mandatory field. |
share_path | Enter the Samba server share path . This is a mandatory field. |
username | Enter the Samba server username. This is an mandatory field. |
password | Enter the Samba server password. This is an mandatory field. |
This is the schema for FTP/FTPS source:
Parameter | Description |
---|---|
type | Type of file source, set to sftp for STFP source. |
base-url | Enter the base URL for the FTP server. This is a mandatory field. |
port | Enter the port related to the FTP server. This is an optional field. |
client-login | Enter the FTP username. This is an mandatory field. |
client-password | Enter the FTP password. This is an mandatory field. |
main-folder | Enter the root directory on which the extractor will start the extractor. This is an optional field. |
with-subfolders | Flag that allows the extractor to traverse into sub-folders in order to retrieve the related files. Possible values are true or false . Default value is false . This is an optional field. |
certificate-file-path | Enter the path to the certificate file. This is an optional field. |
This is the schema for Sharepoint Online source:
Parameter | Description |
---|---|
type | Type of file source, set to sharepoint_online for Sharepoint Online source. |
client-id | Enter the App registration client ID. This is a mandatory field. |
client-secret | Enter the App registration secret. This is a mandatory field. |
tenant-id | Enter the Azure tenant related to the App registration . This is a mandatory field. |
base-url | Enter the Sharepoint Online base URL. This is an mandatory field. |
site | Enter the Sharepoint site where the document library is located. This is a mandatory field. |
document-library | Enter the Sharepoint document library where the files are located. This is a mandatory field. |
with-subfolders | Flag that allows the extractor to traverse into sub-folders in order to retrieve the related files. Possible values are true or false . Default value is false . This is an optional field. |