Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for Settings and Constants management #1521

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
276 changes: 276 additions & 0 deletions docs/docs/settings.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
.. _settings:

=====================================
Library Settings and Constants
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will any user need to access constants? For the list, they seem internal only.
These could be part of the code documentation. Putting them here adds complexity in the explanation for most users.

=====================================

This guide explains the rationale behind the :class:`Settings <settings_utils.Settings>` and :class:`Constants <settings_utils.Constants>` system, how to extend and configure them, and how to use them effectively in your application.

All the settings can be easily accessed with:

.. code-block:: python

import unitxt

print(unitxt.settings.default_verbosity) # Output: "info"

All the settings can be easily modified with:

.. code-block:: python

unitxt.settings.default_verbosity = "debug"

Or through environment variables:

.. code-block::

export UNITXT_DEFAULT_VERBOSITY = "debug"

Rationale
=========

Managing application-wide configuration and constants can be challenging, especially in larger systems. The :class:`Settings <settings_utils.Settings>` and :class:`Constants <settings_utils.Constants>` classes provide a centralized, thread-safe, and type-safe way to manage these configurations.

- **Settings**: Designed for mutable configurations that can be customized dynamically, with optional type enforcement and environment variable overrides.
- **Constants**: Designed for immutable values that remain consistent throughout the application lifecycle.

By centralizing these configurations, you can:
- Ensure consistency across your application.
- Simplify debugging and testing.
- Enable dynamic configuration using environment variables or runtime contexts.

Adding New Settings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not for users, only for contributors. I don't think we should document it in this tutorial. Only in the code (at latest as a last step - "for developers").

===================

To add a new setting, follow these steps:

1. Open the :class:`Settings <settings_utils.Settings>` initialization block in the :class:`settings_utils <settings_utils>` module.
2. Add a new setting key with a tuple of `(type, default_value)` to enforce its type and provide a default value.

.. code-block:: python

settings.new_feature_enabled = (bool, False) # Adding a new boolean setting.

Guidelines:
- Use a clear and descriptive name for the setting.
- Always specify the type as one of `int`, `float`, or `bool`.

Adding New Constants
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover here, constants are not to be added by users.

====================

To add a new constant:

1. Open the :class:`Constants <settings_utils.Constants>` initialization block in the :class:`settings_utils <settings_utils>` module.
2. Assign a new constant key with its value.

.. code-block:: python

constants.new_constant = "new_value" # Adding a new constant.

Guidelines:
- Constants should represent fixed, immutable values.
- Use clear and descriptive names that indicate their purpose.

Using Settings Context
======================

The :class:`Settings <settings_utils.Settings>` class provides a `context` manager to temporarily override settings within a specific block of code. After exiting the block, the settings revert to their original values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important.


Example:

.. code-block:: python

from unitxt import settings

print(settings.default_verbosity) # Output: "info"

with settings.context(default_verbosity="debug"):
print(settings.default_verbosity) # Output: "debug"

print(settings.default_verbosity) # Output: "info"

This feature is useful for scenarios like testing or running specific tasks with modified configurations.

List of Settings
================

Below is the list of available settings, their types, default values, corresponding environment variable names, and descriptions:

.. list-table::
:header-rows: 1

* - Setting
- Type
- Default Value
- Environment Variable
- Description
* - allow_unverified_code
- bool
- False
- UNITXT_ALLOW_UNVERIFIED_CODE
- Enables or disables execution of unverified code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Enables or disables execution of unverified code.
- Enables or disables execution of unverified code. Unverified code includes executable code from HF datasets and calls to ExecuteExpressions or other operators that run user code. This ensure only trusted code is executed.

* - use_only_local_catalogs
- bool
- False
- UNITXT_USE_ONLY_LOCAL_CATALOGS
- Restricts operations to use only local catalogs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Restricts operations to use only local catalogs.
- Restricts loading of artifacts to only use local catalogs on local filesystems (and not remote GitHub repos).

* - global_loader_limit
- int
- None
- UNITXT_GLOBAL_LOADER_LIMIT
- Sets a limit on the number of global data loaders.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is default value for "loader_limit"?

* - num_resamples_for_instance_metrics
- int
- 1000
- UNITXT_NUM_RESAMPLES_FOR_INSTANCE_METRICS
- Number of resamples used for calculating instance-level metrics.
* - num_resamples_for_global_metrics
- int
- 100
- UNITXT_NUM_RESAMPLES_FOR_GLOBAL_METRICS
- Number of resamples used for calculating global metrics.
* - max_log_message_size
- int
- 100000
- UNITXT_MAX_LOG_MESSAGE_SIZE
- Maximum size allowed for log messages.
* - catalogs
- None
- None
- UNITXT_CATALOGS
- Specifies the catalogs configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear.

* - artifactories
- None
- None
- UNITXT_ARTIFACTORIES
- Defines the artifact storage configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not clear.

* - default_recipe
- str
- "dataset_recipe"
- UNITXT_DEFAULT_RECIPE
- Specifies the default recipe for datasets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed? What can it be set to?

* - default_verbosity
- str
- "info"
- UNITXT_DEFAULT_VERBOSITY
- Sets the default verbosity level for logging.
* - use_eager_execution
- bool
- False
- UNITXT_USE_EAGER_EXECUTION
- Enables eager execution for tasks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe what it is.

* - remote_metrics
- list
- []
- UNITXT_REMOTE_METRICS
- Defines a list of configurations for remote metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really checked. Should we keep it?

* - test_card_disable
- bool
- False
- UNITXT_TEST_CARD_DISABLE
- Disables the use of test cards when enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use it?

* - test_metric_disable
- bool
- False
- UNITXT_TEST_METRIC_DISABLE
- Disables the use of test metrics when enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use it?

* - metrics_master_key_token
- None
- None
- UNITXT_METRICS_MASTER_KEY_TOKEN
- Specifies the master token for accessing metrics.
* - seed
- int
- 42
- UNITXT_SEED
- Default seed value for random operations.
* - skip_artifacts_prepare_and_verify
- bool
- False
- UNITXT_SKIP_ARTIFACTS_PREPARE_AND_VERIFY
- Skips preparation and verification of artifacts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use it?

* - data_classification_policy
- None
- None
- UNITXT_DATA_CLASSIFICATION_POLICY
- Specifies the policy for data classification.
* - mock_inference_mode
- bool
- False
- UNITXT_MOCK_INFERENCE_MODE
- Enables mock inference mode for testing.
* - disable_hf_datasets_cache
- bool
- True
- UNITXT_DISABLE_HF_DATASETS_CACHE
- Disables caching for Hugging Face datasets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important one. Need to describe the behavior, why caching is disabled by default and what changing means.

* - loader_cache_size
- int
- 1
- UNITXT_LOADER_CACHE_SIZE
- Sets the cache size for data loaders.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is it used?

* - task_data_as_text
- bool
- True
- UNITXT_TASK_DATA_AS_TEXT
- Enables representation of task data as plain text.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why set it?

* - default_provider
- str
- "watsonx"
- UNITXT_DEFAULT_PROVIDER
- Specifies the default provider for tasks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Specifies the default provider for tasks.
- Defines the default provider used by CrossProviderInferenceEngine. Used to set the change the platform (OpenAI, HF, Watson) used for inference calls and LLM as Judges without changing code.

* - default_format
- None
- None
- UNITXT_DEFAULT_FORMAT
- Defines the default format for data processing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important.


List of Constants
=================

Below is the list of available constants and their values:

.. list-table::
:header-rows: 1

* - Constant
- Value
* - dataset_file
- Path to `dataset.py`.
* - metric_file
- Path to `metric.py`.
* - local_catalog_path
- Path to the local catalog directory.
* - package_dir
- Directory of the installed package.
* - default_catalog_path
- Default catalog directory path.
* - dataset_url
- URL for dataset resources.
* - metric_url
- URL for metric resources.
* - version
- Current version of the application.
* - catalog_hierarchy_sep
- Separator for catalog hierarchy levels.
* - env_local_catalogs_paths_sep
- Separator for local catalog paths in environment variables.
* - non_registered_files
- List of files excluded from registration.
* - codebase_url
- URL of the codebase repository.
* - website_url
- Official website URL.
* - inference_stream
- Name of the inference stream constant.
* - instance_stream
- Name of the instance stream constant.
* - image_tag
- Default image tag for operations.
* - demos_pool_field
- Field name for demos pool.

Conclusion
==========

The `Settings` and `Constants` system provides a robust and flexible way to manage your application's configuration and constants. By following the guidelines above, you can extend and use these classes effectively in your application.
1 change: 1 addition & 0 deletions docs/docs/tutorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,5 @@ Tutorials ✨
tags_and_descriptions
types_and_serializers
contributors_guide
settings

82 changes: 82 additions & 0 deletions src/unitxt/settings_utils.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,85 @@
"""Library Settings and Constants.

This module provides a mechanism for managing application-wide configuration and immutable constants. It includes the `Settings` and `Constants` classes, which are implemented as singleton patterns to ensure a single shared instance across the application. Additionally, it defines utility functions to access these objects and configure application behavior.

### Key Components:

1. **Settings Class**:
- A singleton class for managing mutable configuration settings.
- Supports type enforcement for settings to ensure correct usage.
- Allows dynamic modification of settings using a context manager for temporary changes.
- Retrieves environment variable overrides for settings, enabling external customization.

#### Available Settings:
- `allow_unverified_code` (bool, default: False): Whether to allow unverified code execution.
- `use_only_local_catalogs` (bool, default: False): Restrict operations to local catalogs only.
- `global_loader_limit` (int, default: None): Limit for global data loaders.
- `num_resamples_for_instance_metrics` (int, default: 1000): Number of resamples for instance-level metrics.
- `num_resamples_for_global_metrics` (int, default: 100): Number of resamples for global metrics.
- `max_log_message_size` (int, default: 100000): Maximum size of log messages.
- `catalogs` (default: None): List of catalog configurations.
- `artifactories` (default: None): Artifact storage configurations.
- `default_recipe` (str, default: "dataset_recipe"): Default recipe for dataset operations.
- `default_verbosity` (str, default: "info"): Default verbosity level for logging.
- `use_eager_execution` (bool, default: False): Enable eager execution for tasks.
- `remote_metrics` (list, default: []): List of remote metrics configurations.
- `test_card_disable` (bool, default: False): Disable test cards if set to True.
- `test_metric_disable` (bool, default: False): Disable test metrics if set to True.
- `metrics_master_key_token` (default: None): Master token for metrics.
- `seed` (int, default: 42): Default seed for random operations.
- `skip_artifacts_prepare_and_verify` (bool, default: False): Skip artifact preparation and verification.
- `data_classification_policy` (default: None): Policy for data classification.
- `mock_inference_mode` (bool, default: False): Enable mock inference mode.
- `disable_hf_datasets_cache` (bool, default: True): Disable caching for Hugging Face datasets.
- `loader_cache_size` (int, default: 1): Cache size for data loaders.
- `task_data_as_text` (bool, default: True): Represent task data as text.
- `default_provider` (str, default: "watsonx"): Default service provider.
- `default_format` (default: None): Default format for data processing.

#### Usage:
- Access settings using `get_settings()` function.
- Modify settings temporarily using the `context` method:
```python
settings = get_settings()
with settings.context(default_verbosity="debug"):
# Code within this block uses "debug" verbosity.
```

2. **Constants Class**:
- A singleton class for managing immutable constants used across the application.
- Constants cannot be modified once set.
- Provides centralized access to paths, URLs, and other fixed application parameters.

#### Available Constants:
- `dataset_file`: Path to the dataset file.
- `metric_file`: Path to the metric file.
- `local_catalog_path`: Path to the local catalog directory.
- `package_dir`: Directory of the installed package.
- `default_catalog_path`: Default catalog directory path.
- `dataset_url`: URL for dataset resources.
- `metric_url`: URL for metric resources.
- `version`: Current version of the application.
- `catalog_hierarchy_sep`: Separator for catalog hierarchy levels.
- `env_local_catalogs_paths_sep`: Separator for local catalog paths in environment variables.
- `non_registered_files`: List of files excluded from registration.
- `codebase_url`: URL of the codebase repository.
- `website_url`: Official website URL.
- `inference_stream`: Name of the inference stream constant.
- `instance_stream`: Name of the instance stream constant.
- `image_tag`: Default image tag for operations.
- `demos_pool_field`: Field name for demos pool.

#### Usage:
- Access constants using `get_constants()` function:
```python
constants = get_constants()
print(constants.dataset_file)
```

3. **Helper Functions**:
- `get_settings()`: Returns the singleton `Settings` instance.
- `get_constants()`: Returns the singleton `Constants` instance.
"""
import importlib.metadata
import importlib.util
import os
Expand Down
Loading