-
Notifications
You must be signed in to change notification settings - Fork 130
components text_generation_datapreprocess
github-actions[bot] edited this page Jan 22, 2025
·
51 revisions
Component to preprocess data for text generation task
Version: 0.0.67
View in Studio: https://ml.azure.com/registries/azureml/components/text_generation_datapreprocess/version/0.0.67
Text Generation task arguments
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
text_key | key for text in an example. format your data keeping in mind that text is concatenated with ground_truth while finetuning in the form - text + groundtruth. for eg. "text"="knock knock\n", "ground_truth"="who's there"; will be treated as "knock knock\nwho's there" | string | False | ||
ground_truth_key | key for ground_truth in an example. we take separate column for ground_truth to enable use cases like summarization, translation, question_answering, etc. which can be repurposed in form of text-generation where both text and ground_truth are needed. This separation is useful for calculating metrics. for eg. "text"="Summarize this dialog:\n{input_dialogue}\nSummary:\n", "ground_truth"="{summary of the dialogue}" | string | True | ||
batch_size | Number of examples to batch before calling the tokenization function | integer | 1000 | True |
Tokenization params
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
pad_to_max_length | If set to True, the returned sequences will be padded according to the model's padding side and padding index, up to their max_seq_length . If no max_seq_length is specified, the padding is done up to the model's max length. |
string | false | True | ['true', 'false'] |
max_seq_length | Default is -1 which means the padding is done up to the model's max length. Else will be padded to max_seq_length . |
integer | -1 | True |
Inputs
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
train_file_path | Path to the registered training data asset. The supported data formats are jsonl , json , csv , tsv and parquet . |
uri_file | True | ||
validation_file_path | Path to the registered validation data asset. The supported data formats are jsonl , json , csv , tsv and parquet . |
uri_file | True | ||
test_file_path | Path to the registered test data asset. The supported data formats are jsonl , json , csv , tsv and parquet . |
uri_file | True | ||
train_mltable_path | Path to the registered training data asset in mltable format. |
mltable | True | ||
validation_mltable_path | Path to the registered validation data asset in mltable format. |
mltable | True | ||
test_mltable_path | Path to the registered test data asset in mltable format. |
mltable | True |
Dataset parameters
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
model_selector_output | output folder of model selector containing model metadata like config, checkpoints, tokenizer config | uri_folder | False |
Validation parameters
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
system_properties | Validation parameters propagated from pipeline. | string | True |
Name | Description | Type |
---|---|---|
output_dir | The folder contains the tokenized output of the train, validation and test data along with the tokenizer files used to tokenize the data | uri_folder |
azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/80