All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Changes are grouped as follows
Added
for new features.Changed
for changes in existing functionality.Deprecated
for soon-to-be removed features.Removed
for now removed features.Fixed
for any bug fixes.Security
in case of vulnerabilities.
- Loosened dependency restrictions on
0.*
packages
- Added a
ssl_verify
argument to file uploaders to either use custom CA bundles or to diable SSL verification completely.
- Added a feature to report file upload errors
- Improved logging when uploading files
- Handling of files with size 0
- Fix issue caused by attempting to update file mimeType on AWS clusters.
- Revert change in 7.5.2, the root cause is a different issue.
- Avoid trying to update mime type when updating empty files.
- The file upload queue will now use
file_transfer_timeout
from the passedCogniteClient
.
- File processing utils
CastableInt
class the represents an interger to be used in config schema definitions. The difference from usingint
is that the field of this type in the yaml file can be either a string or a number, while a field of typeint
must be a number in yaml.PortNumber
class that represents a valid port number to be used in config schema definitions. Just likeCastableInt
it can be a string or a number in the yaml file. This allows for example setting a port number using an environment variable.
- Fix file upload when private link is used
- Configuration for ignore regexp pattern
- Fix file metadata update
- Remove warnings from data models file uploads
- Updated cognite SDK version.
- Regression: Reverting change related to file_meta parameter in IOUploadQueue
- Added support for AWS file upload
- Updated cognite sdk version
- Upload to Core DM/Classic file.
- Use httpx to upload files to CDF instead of the python SDK. May improve performance on windows.
- Add additional validation to cognite config before creating a cognite client, to provide better error messages when configuration is obviously wrong.
- Produce a config error when missing token-url and tenant, instead of eventually
producing an
OAuth 2 MUST utilize https
error when getting the token.
- Reformat log messages to not have newlines
- Fixed using the keyvault tag in remote config.
- Fixed an issue with the
retry
decorator where functions would not be called at all if the cancellation token was set. This resulted in errors with for example upload queues.
- An upload queue for data model instances.
- A new type of state store that stores hashes of ingested items. This can be used to detect changed RAW rows or data model instances.
- Update cognite-sdk version to 7.43.3
- Fixed an issue preventing retries in file uploads from working properly
- File external ID when logging failed file uploads
- Fixed a race condition in state stores and uploaders where a shutdown could result in corrupted state stores.
- Update type hints for the time series upload queue to allow status codes
cognite_exceptions()
did not properly retry file uploads
- Enhancement of
7.0.5
: more use cases covered (to avoid repeatedly fetching a new token). - When using remote config, the full local
idp-authentication
is now injected (some fields were missing).
- The file upload queue is now able to stream files larger than 5GiB.
- The background thread
ConfigReloader
now caches theCogniteClient
to avoid repeatedly fetching a new token.
- Max parallelism in file upload queue properly can set larger values than the
max_workers
in theClientConfig
object. - Storing states with the state store will lock the state store. This fixes an issue where iterating through a changing dict could cause issues.
- Fix file size upper limit.
- Support for files without content.
- Ensure that
CancellationToken.wait(timeout)
only waits for at mosttimeout
, even if it is notified in that time.
-
The file upload queues have changed behaviour.
- Instead of waiting to upload until a set of conditions, it starts uploading immedeately.
- The
upload()
method now acts more like ajoin
, wating on all the uploads in the queue to complete before returning. - A call to
add_to_upload_queue
when the queue is full will hang until the queue is no longer full before returning, instead of triggering and upload and hanging until everything is uploaded. - The queues now require to be set up with a max size. The max upload
latencey is removed. As long as you use the queue in as a context (ie,
using
with FileUploadQueue(...) as queue:
) you should not have to change anything in your code. The behaviour of the queue will change, it will most likely be much faster, but it will not require any changes from you as a user of the queue.
-
threading.Event
has been replaced globally withCancellationToken
. The interfaces are mostly compatible, thoughCancellationToken
does not have aclear
method. The compatibility layer is deprecated.- Replace calls to
is_set
with the propertyis_cancelled
. - Replace calls to
set
with the methodcancel
. - All methods which took
threading.Event
now takeCancellationToken
. You can usecreate_child_token
to create a token that can be canceled without affecting its parent token, this is useful for creating stoppable sub-modules that are stopped if a parent module is stopped. Notably, callingstop
on an upload queue no longer stops the parent extractor, this was never intended behavior.
- Replace calls to
- The deprecated
middleware
module has been removed. set_event_on_interrupt
has been replaced withCancellationToken.cancel_on_interrupt
.
- You can now use
Path
as a type in your config files. CancellationToken
as a better abstraction for cancellation thanthreading.Event
.
To migrate from version 6.*
to 7
, you need to update how you interract with
cancellation tokens. The type has now changed from Event
to
CancellationToken
, so make sure to update all of your type hints etc. There is
a compatability layer for the CancellationToken
class, so that it has the same
methods as an Event
(except for clear()
) which means it should act as a
drop-in replacement for now. This compatability layer is deprected, and will be
removed in version 8
.
If you are using file upload queues, read the entry in the Changed section. You will most likely not need to change your code, but how the queue behaves has changed for this version.
- File upload queues now reuse a single thread pool across runs instead of
creating a new one each time
upload()
is called.
-
Option to specify retry exceptions as a dictionary instead of a tuple. Values should be a callable determining whether a specific exception object should be retied or not. Example:
@retry( exceptions = {ValueError: lambda x: "Invalid" not in str(x)} ) def func() -> None: value = some_function() if value is None: raise ValueError("Could not retrieve value") if not_valid(value): raise ValueError(f"Invalid value: {value}")
-
Templates for common retry scenarios. For example, if you're using the
requests
library, you can doretry(exceptions = request_exceptions())
- Default parameters in
retry
has changed to be less agressive. Retries will apply backoff by default, and give up after 10 retries.
- Aliases for keyvault config to align with dotnet utils
- Improved the state store retry behavior to handle both fundamental and wrapped network connection errors.
- Added support to retrieve secrets from Azure Keyvault.
- Added an optional
security-categories
attribute to thecognite
config section.
- Fixed a type hint in the
post_upload_function
for upload queues.
- Added
IOFileUploadQueue
as a base class of bothFileUploadQueue
andBytesUploadQueue
. This is an upload queue for functions that produceBinaryIO
to CDF Files.
- Correctly handle equality comparison of
TimeIntervalConfig
objects.
- Added ability to specify dataset under which metrics timeseries are created
- Improved the state store retry behavior to handle connection errors
- Fixed iter method on the state store to return an iterator
cognite-sdk
tov7
- Added iter method on the state store to return the keys of the local state dict
- Added
load_yaml_dict
toconfigtools.loaders
.
- Fixed getting the config
type
when!env
was used in the config file.
- Added len method on the state store to return the length of the local state dict
- Fix on find_dotenv call
- Update cognite-sdk version to 6.24.0
- Fixed the type hint for the
retry
decorator. The list of exception types must be given as a tuple, not an arbitrary iterable. - Fixed retries for sequence upload queue.
- Sequence upload queue reported number of distinct sequences it had rows for, not the number of rows. That is now changed to number of rows.
- When the sequence upload queue uploaded, it always reported 0 rows uploaded because of a bug in the logging.
- Latency metrics for upload queues.
- Added support for queuing assets upload
- Timestamps before 1970 are no longer filtered out, to align with changes to the timeseries API.
- The event upload queue now upserts events. If creating an event fails due to the event already existing, it will be updated instead.
- Support for
connection
parameters
- Upload queue size limit now triggers an upload when the size has reached the limit, not when it exceeded the limit.
-
Legacy authentication through API keys has been removed throughtout the code base.
-
A few deprecated modules (
authentication
,prometheus_logging
) have been deleted.
-
uploader
andconfigtools
have been changed from one module to a package of multiple modules. The content has been re-exported to preserve compatability, so you can still dofrom cognite.extractorutils.configtools import load_yaml, TimeIntervalConfig from cognite.extractorutils.uploader import TimeSeriesUploadQueue
But now, you can also import from the submodules directly:
from cognite.extractorutils.configtools.elements import TimeIntervalConfig from cognite.extractorutils.configtools.loaders import load_yaml from cognite.extractorutils.uploader.time_series import TimeSeriesUploadQueue
This has first and foremost been done to improve the codebase and make it easier to continue to develop.
-
Updated the version of the Cognite SDK to version 6. Refer to the changelog and migration guide for the SDK for details on the changes it entails for users.
-
Several small single-function modules have been removed and the content have been moved to the catch-all
util
module. This includes:-
The
add_extraction_pipeline
decorator from theextraction_pipelines
module -
The
throttled_loop
generator from thethrottle
module -
The
retry
decorator from theretry
module
-
- Support for
audience
parameter inidp-authentication
The deletion of API keys and the legacy OAuth2 implementation should not affect
your extractors or your usage of the utils unless you were depending on the old
OAuth implementation directly and not through configtools
or the base classes.
To update to version 5 of extractor-utils, you need to
-
Change where you import a few things.
-
Change from
from cognite.extractorutils.extraction_pipelines import add_extraction_pipeline
to
from cognite.extractorutils.util import add_extraction_pipeline
-
Change from
from cognite.extractorutils.throttle import throttled_loop
to
from cognite.extractorutils.util import throttled_loop
-
Change from
from cognite.extractorutils.retry import retry
to
from cognite.extractorutils.util import retry
-
-
Consult the migration guide for the Cognite SDK version 6 for details on the changes it entails for users.
The changes in this version are only breaking for your usage of the utils. Any extractor you have written will not be affected by the changes, meaning you do not need to bump the major version for your extractors.
- Default size of RAW queues is now 50 000 for
UploaderExtractor
s as well.
FileSizeConfig
class, similar toTimeIntervalConfig
, that allows human readable formats such as1 kb
or3.7 mib
, as well as bytes directly. It then computes properties for bytes, kilobytes, etc.
- Fixed a bug in the state store when decimal valued incremental fields are used.
- Add support for certificate authentication
- Change minimum cognite-sdk version to 5.8
- Update cognite-sdk version to 5.1.1
- Fixed a bug in the error handling for reporting heartbeats.
-
Allow indexing state stores. You can now use the indexing notation to access values in a state store like you would do e.g. in a dictionary. Examples:
states = LocalStateStore(...) # You can now set states like so: states["id"] = (None, 5) # Getting current states: low, high = states["another_id"] # You can also check if an entry has a stored state with the 'in' operator: if "new_id" not in states: # do something you only do the first time you process an item
- Fixed a typo throughout the library,
cancelation_token
is now calledcancellation_token
everywhere. If you ever e.g. specify a cancellation token in a keyword argument, make sure to update the spelling.
-
Allow any interval to be configured as a string in addition to integers, so e.g. upload intervals can be configured as
upload-interval: 2m
with
s
/m
/h
/d
being valid units, ands
being implied when units are missing (to preserve backwards compatibility with old config files). If your extractor reads from these config fields you need to update to read theseconds
(or the computedminutes
,hours
ordays
) attribute instead of the fields directly, for example:RawUploadQueue( cognite_client, max_upload_interval=config.upload_interval, )
must be changed to
RawUploadQueue( cognite_client, max_upload_interval=config.upload_interval.seconds, )
In order to update from version 3 to version 4, you need to:
- Change
cancelation_token
tocancellation_token
every time you pass one as a keyword argument (or read from an attribute) - Access the
seconds
attribute from any time values you read from the default config (such asupload_interval
s ortimeout
s) - Consider updating any time value you have in your own config parameters to be
of
TimeIntervalConfig
instead ofint
to allow users the option of configuring time values in a human-readable format.
A type checker, like mypy, will be able to detect any breaking changes this update introduces. We highly reccomend scanning your code with a type checker.
Updating from version 3 to 4 should not introduce breaking changes to your extractor.
- Decorator which adds extraction pipeline functionality
- Correctly catch errors when reloading remote configuration files
- Support running extractor utils inside Cognite Functions
- Support for exposing prometheus metrics on a local port instead of pushing to pushgateway
- Remove old version warning from the cognite SDK.
- Correctly do not request the experimental SDK when using remote configuration files
SequenceUploadQueue
'screate_missing
funtionality can now be used to set Name and Description values on newly created sequences.- Remote configuration files is now fully released and supported without using the experimental SDK.
- Dataset and linked asset information are correctly set on created sequences.
- Type hint for sequence column definitions updated to be more consistent.
- Option to set data set id when creating missing time series in the time series upload queue.
- Update cognite-sdk to version 4.0.1, which removes the support for reserved
environment variables such as
COGNITE_API_KEY
andCOGNITE_CLIENT_ID
.
- Python 3.7 support
Preview release of remote configuration of extractors. This allows users of your extractors to configure their extractors from CDF.
In this section we will go through how you can start using remote configuration in your extractors.
If you have based your extractor on the Extractor
base class, remote
configuration is already implemented for your extractor without any need for
further changes to your code.
To include the feature, you must update cognite-extractor-utils
to
2.3.0-beta1
, and add a dependency to cognite-sdk-experimental
version
>=0.76.0
.
Otherwise, you must use the new ConfigResolver
class in the configtools
module:
# With automatic CLI (ie, read config from command line args):
resolver = ConfigResolver.from_cli(
name="my extractor",
description="Short description of my extractor",
version="1.2.3",
config_type=MyConfig,
)
config: MyConfig = resolver.config
# With path to a yaml file:
resolver = ConfigResolver(
config_path="/path/to/config.yaml",
config_type=MyConfig
)
config: MyConfig = resolver.config
The resolver will automatically fetch configuration from CDF if remote
configuration is used, otherwise it will return the same as load_yaml
.
When using the base class, you have the option to automatically detect new
config revisions, and do one of several predefined actions (keep in mind that
this is not exclusive to remote configs, if the extractor is running with a
local configuration that changes, it will do the same action). You specify which
with an reload_config_action
enum. The enum can be one of the following values:
DO_NOTHING
which is the defaultREPLACE_ATTRIBUTE
which will replace theconfig
attribute on the object (keep in mind that if you are using therun_handle
instead of subclassing, this will have no effect). Be also aware that anything that is set up based on the config (upload queues, cognite client objects, loggers, connections to source, etc) will not change in this case.SHUTDOWN
will set thecancelation_token
event, and wait for the extractor to shut down. It is then intended that the service layer running the extractor (ie, windows services, systemd, docker, etc) will be configured to always restart the service if it shuts down. This is the recomended approach for reloading configs, as it is always guaranteed that everything will be re-initialized according to the new configuration.CALLBACK
is similar toREPLACE_ATTRIBUTE
with one difference. After replacing theconfig
attribute on the extractor object, it will call thereload_config_callback
method, which you will have to override in your subclass. This method should then do any necessary cleanup or re-initialization needed for your particular extractor.
To enable detection of config changes, set the reload_config_action
argument
to the Extractor
constructor to your chosen action:
# Using run handle:
with Extractor(
name="my_extractor",
description="Short description of my extractor",
config_class=MyConfig,
version="1.2.3",
run_handle=run_extractor,
reload_config_action=ReloadConfigAction.SHUTDOWN,
) as extractor:
extractor.run()
# Using subclass:
class MyExtractor(Extractor):
def __init__(self):
super().__init__(
name="my_extractor",
description="Short description of my extractor",
config_class=MyConfig,
version="1.2.3",
reload_config_action=ReloadConfigAction.SHUTDOWN,
)
The extractor will then periodically check if the config file has changed. The
default interval is 5 minutes, you can change this by setting the
reload_config_interval
attribute. As with any other interval in
extractor-utils, the unit is seconds.
When using remote configuration, you will still need to configure the extractor
with some minimal parameters - namely a CDF project, credentials for that
project, and an extraction pipeline ID to fetch configs from. There is also a
new global config field named type
which is either remote
or local
(which
is the default).
An example for this minimal config follows:
type: remote
cognite:
# Read these from environment variables
host: ${COGNITE_BASE_URL}
project: ${COGNITE_PROJECT}
idp-authentication:
token-url: ${COGNITE_TOKEN_URL}
client-id: ${COGNITE_CLIENT_ID}
secret: ${COGNITE_CLIENT_SECRET}
scopes:
- ${COGNITE_BASE_URL}/.default
extraction-pipeline:
external-id: my-extraction-pipeline
The config file stored in CDF should omit all of these fields, as they will be
overwritten by the ConfigResolver
to be the values given here. In other words,
for most extractor deployments, you should be able to leave out the cognite
field in the config stored in CDF.
.env
files will now be loaded if present at runtime- Check that a configured extraction pipeline actually exists, and report an appropriate error if not.
- A few type hints in retry module were more restrictive than needed (such as
requiring
int
whenfloat
would work). - Gracefully handle wrongful data in state stores. If JSON parsing fails, use an empty state store as default.
- Exception messages for
InvalidConfigError
s have been improved, and when using the extractor base class it will print them in a formatted way instead of dumping a stack trace on invalid configs.
- Use Optional with defaults in code instead of dataclass defaults in
UploaderExtractorConfig
, as this allows non-default config sections in subclasses.
- Include defaults for queue sizes in
UploaderExtractor
- Allow wider ranges for certain dependencies
- Allow
>=3.7.4, <5
for typing-extensions - Allow
>=5.3.0, <7
for PyYAML
- Allow
uploader_extractor
anduploader_types
modules used to create extractors writing to events, timeseries, or raw, by calling a common method. This is primarily used for the utils extensions, but can be useful for very simple extractors in general.
- Fixed a signature bug in
SequenceUploadQueue
's__enter__
method preventing it to be used as a context.
- Fixed an issue where the base class would not always load a default
LocalStateStore
if requested
- A
get_current_state_store
class method on the Extractor class which returns the most recent state store loaded
- Fixed retries to not block the GIL and respect the cancellation token
- A
get_current_config
class method on the Extractor class which returns the most recent config file read
- Option to not handle
SIGINT
s gracefully - Configurable heartbeat interval
- Use
cognite-sdk-core
as base instead ofcognite-sdk
- Update Arrow to
>1.0.0
To update your projects to 2.0.0 there are two things to consider:
-
If you are specifying a version of the Cognite SDK to use in your dependency list, you must now specify
cognite-sdk-core
instead. Otherwise you might install both versions. E.g. If yourpyproject.toml
looks like this:cognite-sdk = "^2.32.0" cognite-extractor-utils = "^1.3.3"
You must also change the
cognite-sdk
dependency when updatingextractor-utils
:cognite-sdk-core = "^2.32.0" cognite-extractor-utils = "^2.0.0"
-
If your extractor is using the included
Arrow
package, there are a few breaking changes (most notibly that thetimestamp
attribute is renamed toint_timestamp
). Consult their migration guide to make sure you are in compliance.
- Allow environment variable substitution in bool type config fields without breaking generic environment variable substitution
- Reverts 1.6.0 as it broke generic environment variable substitution
- Allow environment variable substitution in bool type config fields
- Make Cognite SDK timeout configurable
- Add a base class for extractors to remove a lot of boilerplate code necesarry for startup/shutdown, initialization etc.
- Allow using
Enum
s in config classes
- An
ensure_assets
function similar toensure_timeseries
- A
BytesUploadQueue
that takes byte arrays directly instead of file paths - A
throttled_loop
generator that iterates at most everyn
seconds - An
EitherIdConfig
to configure e.g. data sets with either an id or external id in the same field.
- Never call
/login/status
as the endpoint is deprecated
- Inlcude missing classes and modules in docs
- Using
dataset-id
ordataset-external-id
fields, use the newEitherIdConfig
instead.
- Add a base class for extractors to remove a lot of boilerplate code necesarry for startup/shutdown, initialization etc.
- Fix an issue in
SequenceUploadQueue.add_to_upload_queue
when adding multiple rows to the same id
- Changed bucket sizes for observed times in uploader metrics to be more suited for expected values.
- Option to provide additional custum args to token fetching (via the
token_custom_args
arg to theCogniteClient
constructor)
- Use token fetching from Cognite SDK instead of our own implementation
Authenticator
class
- Add config parameter to enable metrics of log messages. It reveals how many logging events happened per logger and log-level.
- Reduce cardinality (by reducing label count) on autogenerated metrics. E.g.
don't label
TIMESERIES_UPLOADER_POINTS_WRITTEN
with which time series it's writing to.
- Add option to specify external IDs for data sets
- Fix labels for generated Prometheus
- Add a cancellation token to all data uploaders and metrics pushers
- Add various automatic Prometheus metrics for data uploaders
- Fix URL creation in OAuth flow
- Fix a pool issue in FileUploadQueue
- General OIDC token support
- Add option to authenticate to CDF with AAD
- Add upload queues for sequences and files
- TimeSeriesUploadQueue can now auto-create string time series
- Add option to create missing time series from upload queue
- Fixed an issue in EitherId where the
repr
method didn't return a string (as would be expected).
- Add py.typed file so mypy knows that package is typed
- Don't require auth for prometheus push gateways as push gateways can be configured to allow unauthorized access.
- Fixed a bug where the state store config would not allow raw state stores
without explicit
null
on local
- An outside_state method to test if a new proposed state in state stores is covered or not
- A general SIGINT handler
- Several minor additions to configtools: Defaults in StateStoreConfig, optional dataset ID in CogniteConfig, option to have version as int or None
- Add a metrics factory that caches instances
- A concurrency issue with TimeSeriesUploadQueue where uploads could fail if points were added at the very start of the upload call
- Fix documentation build
Release the first stable version. Open source the library under the Apache 2.0 license