メイン コンテンツにスキップする

Configure the PI extractor

To configure the PI extractor, you must create a configuration file. This file must be in YAML format. The configuration file is split into sections, each represented by a top-level entry in the YAML format. Subsections are nested under a section in the YAML format.

You can use either the sample complete or minimal configuration files included with the installer as a starting point for your configuration settings:

  • config.default.yml - This file contains all configuration options and descriptions.

  • config.minimal.yml - This file contains a minimum configuration and no descriptions.

Naming the configuration file

You must name the configuration file config.yml.

Tip

You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud.

Before you start

  • Optionally, copy one of the sample files in the config directory and rename it to config.yml.
  • The config.minimal.yml file doesn't include a metrics section. Copy this section from the example below if the extractor is required to send metrics to a Prometheus Pushgateway.
  • Set up an extraction pipeline and note the external ID.

Minimal YAML configuration file

The YAML settings below contain valid PI extractor 2.1 configurations. The values wrapped in ${} are replaced with environment variables with that name. For example, ${COGNITE_PROJECT} will be replaced with the value of the environment variable called COGNITE_PROJECT.

The configuration file has a global parameter version, which holds the version of the configuration schema used in the configuration file. This document describes version 3 of the configuration schema.

version: 3

cognite:
project: '${COGNITE_PROJECT}'
idp-authentication:
tenant: ${COGNITE_TENANT_ID}
client-id: ${COGNITE_CLIENT_ID}
secret: ${COGNITE_CLIENT_SECRET}
scopes:
- ${COGNITE_SCOPE}

time-series:
external-id-prefix: 'pi:'

pi:
host: ${PI_HOST}
username: ${PI_USER}
password: ${PI_PASSWORD}

state-store:
database: LiteDb
location: 'state.db'

logger:
file:
level: 'information'
path: 'logs/log.log'

metrics:
push-gateways:
- host: 'https://prometheus-push.cognite.ai/'
username: ${PROMETHEUS_USER}
password: ${PROMETHEUS_PASSWORD}
job: ${PROMETHEUS_JOB}

where:

  • version is the version of the configuration schema. Use version 3 to be compatible with the Cognite PI extractor 2.1.

  • cognite is the how the extractor reads the authentication details for PI (pi) and CDF (cognite) from environment variables. Since no host is specified in the cognite section, the extractor uses the default value, https://api.cognitedata.com, and assumes that the PI server uses Windows authentication.

  • time-series configures the extractor to create time series in CDF where the external IDs will be prefixed with pi:. You can also use a data set ID configuration to add all time series created by the extractor to a particular data set.

  • state-store configures the extractor to save the extraction state locally using a LiteDB database file named state.db.

  • logger configures the extractor to log at information level and outputs log messages to a log file in the logs/log.log directory. By default, new files are created daily and retained for 31 days. The date is appended to the file name.

  • metrics points to the Prometheus Pushgateway hosted by CDF. It assumes that a user has already been created. For the Pushgateway hosted by CDF, the job name (${PROMETHEUS_JOB}) must start with the username followed by '-'.

Configure the PI extractor

Cognite

Include the cognite section to configure which CDF project the extractor will load data into and how to connect to the project. This section is mandatory and should always contain the project and authentication configuration.

ParameterDescription
projectInsert the CDF project name you want to ingest data into. This is a required value.
api-keyWe've deprecated API-key authentication and strongly encourage customers to migrate to authentication with IdP.
hostInsert the base URL of the CDF project. The default value is https://api.cognitedata.com.
idp-authenticationInsert the credentials for authenticating to CDF using an external identity provider. You must enter either an API key or use IdP authentication.
extraction-pipelineInsert the external ID of an extraction pipeline in CDF. You should create the extraction pipeline before you configure this section.
metadata-targetsConfiguration for targets for metadata, meaning assets, time series metadata, and relationships.
metadata-targets/cleanConfiguration for enabling writing to clean. Options:
  • timeseries - Set to false to disable writing to CDF time series. Definition of metadata includes name, description, unit fields. The default value is true.
metadata-targets/rawConfiguration for writing to CDF RAW. Options:
  • database - The RAW database to write to, required.
  • timeseries-table - Name of the RAW table to write timeseries to, enables writing metadata to RAW. Definition of metadata includes name, description, unit fields, which are added as columns.

Identity provider (IdP) authentication

Include the idp-authentication section to enable the extractor to authenticate to CDF using an external identity provider, such as Azure AD.

ParameterDescription
client-idEnter the client ID from the IdP. This is a required value.
secretEnter the client secret from the IdP. This is a required value.
scopesList the scopes. This is a required value.
tenantEnter the Azure tenant. This is a required value.
authorityInsert the base URL of the authority. The default value is https://login.microsoftonline.com/
min-ttlInsert the minimum time in seconds a token will be valid. The cached token is refreshed if it expires in less than min_ttl seconds. The default value is 30.

CDF Retry policy

Include the cdf-retries subsection to configure the automatic retry policy used for requests to CDF. This subsection is optional. If you don't enter any values, the extractor uses the default values.

ParameterDescription
timeoutSpecify the timeout in milliseconds for each individual try. The default value is 80000.
max-retriesEnter the maximum number of retries. If you enter a negative value, the extractor keeps retrying. The default value is 5.
max-delayEnter the maximum delay in milliseconds between each try. The base delay is calculated according to 125*2^retry ms. If this value is less than 0, there is no maximum. 0 means that there is never any delay. The default value is 5000.

CDF Chunking

Include the cdf-chunking subsection to set up the rules you can configure to optimize the number of requests against CDF endpoints. This subsection is optional. If you don't enter any values, the extractor uses the default values based on CDF's current limits.

Note

If you increase the default values, requests may fail due to limits in CDF.

ParameterDescription
time-seriesEnter the maximum number of time series per get/create time series request. The default value is 1000.
data-point-time-seriesEnter the maximum number of time series per data point create request. The default value is 10000.
data-pointsEnter the maximum number of ingested data points within a single request to CDF. The default value is 100000.
data-point-deleteEnter the maximum number of ranges per delete data points request. The default value is 1000.
data-point-listEnter the maximum number of time series per data point read request, used when getting the first point in a time series. The default value is 100.
data-point-latestEnter the maximum number of time series per data point read latest request, used when getting the last point in a time series. The default value is 100.
raw-rowsEnter the maximum number of rows per request to CDF RAW. This is used with raw state-store. The default value is 10000.
eventsEnter the maximum number of events per get/create events request. This is used for logging extractor incidents as CDF events. The default value is 1000.

CDF Throttling

Include the cdf-throttling subsection to configure how requests to CDF should be throttled. This subsection is optional. If you don't enter any values, the extractor uses the default values.

ParameterDescription
time-seriesEnter the maximum number of parallel requests per time series operation. The default value is 20.
data-pointsEnter the maximum number of parallel requests per data points operation. The default value is 10.
rawEnter the maximum number of parallel requests per raw operation. The default value is 10.
rangesEnter the maximum number of parallel requests per first/last data point operation. The default value is 20.
eventsEnter the maximum number of parallel requests per events operation. The default value is 20.

SDK Logging

Include the sdk-logging subsection to enable and disable output log messages from the Cognite SDK. The extractor is built using Cognite’s .NET SDK. If you enable this logging, the extractor displays any logs produced by the SDK. This configuration is optional. To disable logging from the SDK, exclude this section or set disabled to true.

ParameterDescription
disableSet to true to disable logging from the SDK. The default value is false.
levelEnter the minimum level of logging, either trace, debug, information, warning, error, critical, none. The default value is debug.
formatSelect the format of the log message. The default value is CDF ({Message}): {HttpMethod} {Url} - {Elapsed} ms.

If this section is configured, the additional logs are mainly details about requests to CDF, with the following default format:

SDK logging format

Time series

Include the time-series section for configurations related to the time series ingested by the extractor. This section is optional. If you don't enter any values, the extractor uses the default values.

ParameterDescription
external-id-prefixEnter the external ID prefix to identify the time series in CDF. Leave empty for no prefix. The external ID in CDF is used as this optional prefix, followed by the PI Point name or PI Point ID. The default value is ““.
external-id-sourceEnter the source of the external ID. This is either Name or SourceId. By default, the PI Point name is used as the source. If SourceId is selected, then the PI Point ID is used. The default value is name.
data-set-idSpecify the data set to assign to all time series controlled by the extractor, both new and current time series sourced by the extractor. If you don't configure this, the extractor doesn't change the current time series' data set. The data set is specified with the internal ID (long). The default value is null/empty.

This is an example time series configuration:

time-series:
external-id-prefix: 'pi:'
external-id-source: 'SourceId'
data-set-id: 1234567890123456

This configures the extractor to create time series in CDF, as shown below. The time series' external ID is created by appending the PI Point ID 12345 to the configured prefix pi::

{
'externalId': 'pi:12345',
'name': 'PI-Point-Name',
'isString': false,
'isStep': false,
'dataSetId': 1234567890123456,
}

Events

Include the events section for configurations related to events in CDF. This section is optional. If configured and store-extractor-events-interval is greater than zero, the PI extractor creates events for extractor reconnection and PI data pipe loss incidents. Reconnection and data loss may cause the extractor to miss some historical data point updates. Logging these as events in CDF provides a way of inspecting data quality. When combined with the PI replace utility, you can use these events to correct data inconsistencies in CDF.

ParameterDescription
sourceEvents have this value as the event source in CDF. The default value is ““.
external-id-prefixInsert an external ID prefix to identify the events in CDF. Leave empty for no prefix. The default value is ““.
data-set-idSelect a data set to assign to all events created by the extractor. We recommend using the same data set ID used in the Time series section. The default value is null/empty.
store-extractor-events-intervalStore extractor events in CDF with this interval, given in seconds. If this value is less than zero, no events are created in CDF. The default value is -1.

This is an example events configuration:

events:
source: 'PiExtractor'
external-id-prefix: 'pi-events:'
data-set-id: 1234567890123456
store-extractor-events-interval: 10

With this configuration, the extractor sends data loss/reconnection events to CDF every 10 seconds. The created events are similar to the one below. The external ID is created by concatenating source + event date and time + event count, and prefixing it with the configured pi-events: prefix. The type and subtype describe the incident and the cause of the event. This example is a data loss incident caused by a PI data pipe overflow:

{
'externalId': 'pi-events:PiExtractor-2020-08-04 01:01:21.395(0)',
'startTime': 1596502850668,
'endTime': 1596502880393,
'type': 'DataLoss',
'subtype': 'DataPipeOverflow',
'source': 'PiExtractor',
}

PI

Connect the extractor to a particular PI server or a PI collective. If you configure the extractor with a PI collective, the extractor will transparently maintain a connection to one of the active servers in the collective. The default settings provide Active Directory authorization to the PI host when the service account for the Windows service is authorized.

ParameterDescription
hostEnter the hostname or IP address of the PI server or the PI collective name. If the host is a collective member name, connect to that member. The default value is null/empty. This is a required value.
usernameEnter the username to use for authentication. Leave this empty if the Windows service account under which the extractor runs is authorized in Active Directory to access the PI host. The default value is null/empty.
passwordEnter the password for the user-provided username, if any. The default value is null/empty.
native-authenticationDetermines whether the extractor will use the native PI authentication or Windows authentication. The default value is false, which indicates Windows authentication.
parallelismInsert the number of parallel requests to the PI server. If backfill-parallelism is defined, it excludes backfill requests. The default value is 1.
backfill-parallelismInsert the number of parallel requests to the PI server for backfills. This allows the separate throttling of backfills. The default value is 0.

If you've configured a PI host to use a PI Server or Collective server, the extractor connects to that server and logs the server version and the OSIsoft AF SDK version. If you've configured a host as a collective member, the log output shows if it is a primary or secondary member in the collective:

Collective_server_log

Metrics

These are configurations related to sending extractor metrics to Prometheus. This section is optional, but we recommend using a Pushgateway to collect extractor metrics to monitor the extractor's health and performance.

Server

Include the server subsection to define an HTTP server embedded in the extractor from which Prometheus can scrape metrics. This subsection is optional.

ParameterDescription
hostStart a metrics server in the extractor with this host for Prometheus scrape. Example: localhost, without scheme and port. The default value is null/empty.
portEnter the server port. The default value is null/empty.

Pushgateways

Include the push-gateways subsection to describe an array of Prometheus pushgateway destinations to which the extractor will push metrics. This subsection is optional.

ParameterDescription
hostInsert the absolute URL of the Pushgateway host. Example: http://localhost:9091. If you are using Cognite’s Pushgateway, this is https://prometheus-push.cognite.ai/. The default value is null/empty.
jobEnter the job name. If you are using Cognite’s Pushgateway, this starts with username followed by -. For instance, if username is cog, a valid job name would be cog-pi-extractor. Otherwise, this can be any job name. The default value is null/empty.
usernameEnter the user name in the Pushgateway. The default value is null/empty.
passwordEnter the password. The default value is null/empty.

If you configure this section, the extractor pushes metrics that can be displayed, for instance, in Grafana. Create Grafana dashboards with the extractor metrics using Prometheus as data source.

State storage

Include the state-storage section to configure the extractor to save the extraction state periodically. Then, the extraction resumes faster in the next run. This section is optional. If not present, or if database is set to none, the extraction state is restored by querying the timestamps of the first and the last data points for each time series.

Note

Querying the first and last data points in each time series may take more time than using a state store.

ParameterDescription
databaseSelect which type of database to use, either None, Raw, or LiteDb.
  • Raw saves the extraction state in CDF RAW.
  • LiteDb uses LiteDB to create a local database file with the state.
  • The default value is None.
  • locationInsert the path to the .db file used by the state storage or name of the CDF RAW table. The default value is null/empty.
    intervalEnter the time in seconds between each write to the extraction state store. If zero or less, the state is not saved periodically. The default value is 10.
    time-series-table-nameEnter the collection name in LiteDb or the table in CDF RAW to store the extracted time series ranges. The default value is ranges.
    extractor-table-nameEnter the collection name in LiteDb or the table in CDF RAW to store the extractor state: last heartbeat timestamp and last meta-data update timestamp. The default value is extractor.

    If CDF RAW is used, you can see the extracted states under Manage staged data in CDF.

    Extractor

    The extractor section contains various configurations for the operation of the extractor itself. If you want to extract a subset of the PI time series, specify the list of time series to be extracted in this subsection. This is how the list is created:

    1. If include-tags, include-prefixes, and include-patterns are not empty, start with the union of these three. Otherwise, start with all points.
    2. Remove all as specified by exclude-tags, exclude-prefixes, and exclude-patterns.
    ParameterDescription
    time-series-update-intervalSet an interval in seconds between each time series update, discovering new time series in PI and synchronizing metadata from PI to CDF. The default value is 86400 (24 hours).
    include-tagsList the tags to include to retrieve time series with a name matching exactly this tag (case sensitive).
    include-prefixesList the prefixes to include to retrieve time series with names starting with this prefix.
    include-patternsList the patterns to include to retrieve time series with names containing the pattern.
    exclude-tagsList the tags to exclude.
    exclude-prefixesList the prefixes to exclude.
    exclude-patternsList the patterns to exclude.

    Delete time series

    Include the deleted-time-series subsection to configure how the extractor handles time series that exist in CDF but not in PI. This subsection is optional, and the default behavior is none (do nothing).

    Note

    This only affects time series with the same data set ID and external ID prefix as the time series configured in the extractor.

    To find time series that exist in CDF but not in PI, the extractor:

    • Lists all time series in CDF that have the configured external ID prefix and data set ID.
    • Filters the time series using the include/exclude rules defined in the extractor section.
    • Matches the result against the time series obtained from the PI Server after filtering these using the include/exclude rules.
    ParameterDescription
    behaviorSelect the action taken by the extractor, either none, flag, or delete. Setting this to flag will perform soft deletion: Flag the time series as deleted but don't remove them from the destination. Setting it to delete will delete the time series from CDF. The default value is none, which indicates ignore.

    If you set to delete, the time series in CDF that cannot be found in PI will be permanently deleted from CDF.
    flag-nameIf you've set behavior to flag, use this string as the flag name to mark the time series as deleted. In CDF, metadata is added to the time series with this as the key and the current time as the value. The default value is deletedByExtractor.
    Warning

    If you don't configure an external-id-prefix, the extractor may find time series in CDF with no external ID. Even if these time series belong to the same data set, they couldn't be created by the extractor, as the extractor always assigns an external ID when creating new time series. In this case, the time series won't be deleted.

    Backfill

    Include the backfill section to configure how the extractor fills in historical data back in time with respect to the first data point in CDF. The backfill process completes when all the data points in the PI Data Archive are sent to CDF or when the extractor reaches the target timestamp for all time series if the to parameter is set.

    ParameterDescription
    skipSet to true to disable backfill. The default value is false.
    step-size-hoursThe step, in whole number of hours. Set to 0 to freely backfill all time series. Each iteration of backfill will backfill all time series to the next step before stepping further backward. The default value is 168.
    toThe target CDF timestamp at which to stop the backfill. The backfill may overshoot this timestamp. The overshoot will be less with smaller data point chunk size. The default value is 0 (backfill all time series to epoch).

    Frontfill

    Include the frontfill section to configure how the extractor fills in historical data forward in time with respect to the last data point in CDF. At startup, the extractor fills in the gap between the last data point in CDF and the last data point in PI by querying the archived data in the PI Data Archive. After that, the extractor only receives data streamed through the PI Data Pipe. These are real-time changes made to the time series in PI before archiving.

    Note

    When data points are archived in PI, they may be subject to compression, reducing the total amount of data points in a time series. Therefore, the backfill and frontfill tasks will receive data points after compression, while the streaming task will receive data points before compression. Learn more about compression in this video.

    ParameterDescription
    skipSet to true to disable frontfill and streaming. The default value is false.
    streaming-intervalInsert interval in seconds between each call to the PI Data Pipe to fetch new data. The default value is 1.

    If you set this parameter to a high value, there is a higher chance of having a client outbound queue overflow in the PI server. Overflow may cause loss of data points in CDF.
    delete-data-pointsIf you set to true, the corresponding data points are deleted in CDF when data point deletion events are received in the PI Data Pipe. The default value is false.

    Enabling this parameter may increase the streaming latency of the extractor since the extractor verifies the data point deletion before proceeding with the insertions.
    use-data-pipeOlder PI servers may not support data pipes. If that's the case, set this value to false to disable data pipe streaming. The frontfiller task will run constantly and will frequently query the PI Data Archive for new data points. The default value is true.

    Logger

    Include the logger section to set up logging to a console and files.

    File

    Include the file subsection to enable logging to a file. This subsection is optional. If level is invalid, logging to file is disabled.

    ParameterDescription
    levelSelect the verbosity level for file logging. Valid options, in decreasing level of verbosity, verbose, debug, information, warning, error, fatal. The default value is null/empty.
    pathInsert the path relative to the install directory or current working directory when using multiple services. Example: logs/log.txt will create log files with a log prefix followed by a date as the suffix and txt as the extension in the logs folder. The default value is null/empty.
    retention-limitSpecify the maximum number of log files that will be retained in the log folder. The default value is 31.
    rolling-intervalSpecify a rolling interval for log files as either day or hour. The default value is day.

    Console

    Include the console subsection to enable logging events to a standard output, such as a terminal window. This subsection is optional. If level is invalid, console is disabled.

    If the extractor is run as a service, logging to a console will have no effect.

    ParameterDescription
    levelSelect the verbosity level for console logging. Valid options, in decreasing level of verbosity, are verbose, debug, information, warning, error, fatal. The default value is null/empty.