URGENT: Training with multiple datasets for text detector #14580

VishyAnand28 · 2025-01-22T11:35:59Z

VishyAnand28
Jan 22, 2025

Dear Paddle Community,

I am currently performing fine-tuning and I have used data augmentation during training but I have not been able to get good results for both DB, DB++.
I currently possess 2000 data points for training the text detector.

Question:

How to modify the config file to have say 10 datasets during training?
Is it also possible to generate say 2000 new datasets per epoch so that model can learn new things directly from the training by having a dynamic dataset instead of fixed dataset?

I have shared my current config file for perusal.
Thanks for reading this!

I humbly request guidance from @GreatV @WenmuZhou @LDOUBLEV @MissPenguin @tink2123 @UserWangZz and others ........ for guidance.

Global:
debug: false
use_gpu: true
epoch_num: 500
log_smooth_window: 20
print_batch_step: 10
save_model_dir: /home/jovyan/01_Paddle/04_Paddle_Models/03_detector_pretrained/01_DB_Default_Det/v32_train
save_epoch_step: 100
eval_batch_step:

0
2000
cal_metric_during_train: false
pretrained_model: /home/jovyan/01_Paddle/04_Paddle_Models/03_detector_pretrained/01_DB_Default_Det/en_PP-OCRv3_det_distill_train/student.pdparams
checkpoints: null
save_inference_dir: null
use_visualdl: false
infer_img: null
save_res_path: null
distributed: true
Architecture:
freeze_params: true
model_type: det
algorithm: DB
Transform: null
Backbone:
name: MobileNetV3
scale: 0.5
model_name: large
disable_se: true
Neck:
name: RSEFPN
out_channels: 96
shortcut: true
Head:
name: DBHead
k: 50
Loss:
name: DBLoss
balance_loss: true
main_loss_type: DiceLoss
alpha: 5
beta: 10
ohem_ratio: 3
Optimizer:
name: Adam
beta1: 0.9
beta2: 0.999
lr:
name: Cosine
learning_rate: 0.0005
warmup_epoch: 5
regularizer:
name: L2
factor: 0.0001
PostProcess:
name: DBPostProcess
thresh: 0.3
box_thresh: 0.6
max_candidates: 1000
unclip_ratio: 1.5
Metric:
name: DetMetric
main_indicator: hmean
Train:
dataset:
name: SimpleDataSet
data_dir: /home/jovyan/01_Paddle
label_file_list:
- /home/jovyan/01_Paddle/02_paddle_img_det_aug_v1/paddle_det_prep_aug_v1_train.txt
  ratio_list:
- 1.0
  transforms:
- DecodeImage:
  img_mode: BGR
  channel_first: false
- DetLabelEncode: null
- IaaAugment:
  augmenter_args:
  - type: Fliplr
    args:
    p: 0.5
  - type: Affine
    args:
    rotate:
    - -10
    - 10
  - type: Resize
    args:
    size:
    - 0.8
    - 1.5
- EastRandomCropData:
  size:
  - 960
  - 960
    max_tries: 50
    keep_ratio: true
- MakeBorderMap:
  shrink_ratio: 0.4
  thresh_min: 0.3
  thresh_max: 0.7
- MakeShrinkMap:
  shrink_ratio: 0.4
  min_text_size: 8
- NormalizeImage:
  scale: 1./255.
  mean:
  - 0.485
  - 0.456
  - 0.406
    std:
  - 0.229
  - 0.224
  - 0.225
    order: hwc
- ToCHWImage: null
- KeepKeys:
  keep_keys:
  - image
  - threshold_map
  - threshold_mask
  - shrink_map
  - shrink_mask
    loader:
    shuffle: true
    drop_last: false
    batch_size_per_card: 16
    num_workers: 4
    Eval:
    dataset:
    name: SimpleDataSet
    data_dir: /home/jovyan/01_Paddle
    label_file_list:
- /home/jovyan/01_Paddle/02_paddle_img_det_aug_v1/paddle_det_prep_aug_v1_eval.txt
  transforms:
- DecodeImage:
  img_mode: BGR
  channel_first: false
- DetLabelEncode: null
- DetResizeForTest: null
- NormalizeImage:
  scale: 1./255.
  mean:
  - 0.485
  - 0.456
  - 0.406
    std:
  - 0.229
  - 0.224
  - 0.225
    order: hwc
- ToCHWImage: null
- KeepKeys:
  keep_keys:
  - image
  - shape
  - polys
  - ignore_tags
    loader:
    shuffle: false
    drop_last: false
    batch_size_per_card: 1
    num_workers: 2
    profiler_options: null

GreatV · 2025-01-22T11:45:04Z

GreatV
Jan 22, 2025
Maintainer

To address your queries about modifying the configuration for multiple datasets and dynamically generating datasets during training, let me provide detailed guidance:

1. How to modify the config file to use multiple datasets during training?

In PaddleOCR, you can train using multiple datasets by specifying them in the label_file_list and ratio_list fields in the Train section of the configuration file. Here's how:

Changes to Your Config File:

Update the label_file_list to include multiple dataset label files.
Use the ratio_list to specify the proportion of each dataset used during training.

Example:

Train:
  dataset:
    name: SimpleDataSet
    data_dir: /home/jovyan/01_Paddle
    label_file_list:
    - /home/jovyan/01_Paddle/dataset1_labels.txt
    - /home/jovyan/01_Paddle/dataset2_labels.txt
    - /home/jovyan/01_Paddle/dataset3_labels.txt
    ratio_list:
    - 0.5  # 50% from dataset1
    - 0.3  # 30% from dataset2
    - 0.2  # 20% from dataset3
    transforms:
      - DecodeImage:
          img_mode: BGR
          channel_first: false
      - DetLabelEncode: null
      - IaaAugment:
          augmenter_args:
            - type: Fliplr
              args:
                p: 0.5
            - type: Affine
              args:
                rotate:
                - -10
                - 10
            - type: Resize
              args:
                size:
                - 0.8
                - 1.5
      - EastRandomCropData:
          size:
          - 960
          - 960
          max_tries: 50
          keep_ratio: true
      - MakeBorderMap:
          shrink_ratio: 0.4
          thresh_min: 0.3
          thresh_max: 0.7
      - MakeShrinkMap:
          shrink_ratio: 0.4
          min_text_size: 8
      - NormalizeImage:
          scale: 1./255.
          mean:
          - 0.485
          - 0.456
          - 0.406
          std:
          - 0.229
          - 0.224
          - 0.225
          order: hwc
      - ToCHWImage: null
      - KeepKeys:
          keep_keys:
          - image
          - threshold_map
          - threshold_mask
          - shrink_map
          - shrink_mask
  loader:
    shuffle: true
    drop_last: false
    batch_size_per_card: 16
    num_workers: 4

In this setup:

The label_file_list specifies the paths to the label files for each dataset.
The ratio_list determines the sampling ratio for each dataset (e.g., 50% of samples from dataset1, 30% from dataset2, and 20% from dataset3).

2. How to dynamically generate datasets during training (e.g., 2000 new datasets per epoch)?

To achieve dynamic dataset generation, especially for data augmentation during training, you can use custom data pipeline scripts or leverage built-in augmentation techniques.

Steps to Implement Dynamic Dataset Generation:

Modify the transforms Section:
- Include transformations like random cropping, rotation, scaling, flipping, or other augmentations that help the model see "new" data every epoch.

Example:

transforms:
  - DecodeImage:
      img_mode: BGR
      channel_first: false
  - DetLabelEncode: null
  - IaaAugment:
      augmenter_args:
        - type: Fliplr
          args:
            p: 0.5
        - type: Affine
          args:
            rotate:
            - -10
            - 10
        - type: Resize
          args:
            size:
            - 0.5
            - 2.0
  - EastRandomCropData:
      size:
      - 960
      - 960
      max_tries: 50
      keep_ratio: true
  - MakeBorderMap:
      shrink_ratio: 0.4
      thresh_min: 0.3
      thresh_max: 0.7
  - MakeShrinkMap:
      shrink_ratio: 0.4
      min_text_size: 8
  - NormalizeImage:
      scale: 1./255.
      mean:
      - 0.485
      - 0.456
      - 0.406
      std:
      - 0.229
      - 0.224
      - 0.225
      order: hwc
  - ToCHWImage: null
  - KeepKeys:
      keep_keys:
      - image
      - threshold_map
      - threshold_mask
      - shrink_map
      - shrink_mask

Custom Dynamic Dataset Script:
- If the built-in transformations are insufficient, you can write a custom SimpleDataSet class to dynamically generate synthetic data or apply advanced augmentation techniques during the training loop.

Example of a Python dataset class:

import random
from paddle.io import Dataset

class DynamicTextDataset(Dataset):
    def __init__(self, base_data, augmentations, num_samples):
        self.base_data = base_data
        self.augmentations = augmentations
        self.num_samples = num_samples

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        # Select a random base sample
        sample = random.choice(self.base_data)
        
        # Apply augmentations
        for aug in self.augmentations:
            sample = aug(sample)
        
        return sample

You can integrate this dataset class into your training pipeline by replacing the default SimpleDataSet with your custom class.

Use a Large Epoch Number:
- Since the dynamic dataset generation creates new variations each epoch, increase the epoch_num in your config to allow the model to benefit from diverse training.

Additional Tips:

Augmentation Techniques: Consider using advanced augmentation tools like Albumentations for more robust augmentations.
Synthetic Data Generation: If your dataset is too small, you can generate synthetic training data (e.g., using text rendering libraries like trdg for text-in-image datasets).

Conclusion:

To train with 10 datasets, modify the label_file_list and ratio_list in your config file.
To dynamically generate datasets, enhance the transforms section with robust augmentations or use a custom dataset class for dynamic generation.

Let me know if you need further clarification!

Response generated by feifei-bot | chatgpt-4o-latest

1 reply

VishyAnand28 Jan 22, 2025
Author

@GreatV Thanks for the reply. I have already used albumenations pipeline to already increase my data points but the problem is that there are only few to implement since stretching the images or distorting changes the bounding box co-ordinates. The training of text detector requires both labels as well as co-ordinates. Do you have any recommend to implement other than blurs and contrast?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URGENT: Training with multiple datasets for text detector #14580

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

URGENT: Training with multiple datasets for text detector #14580

VishyAnand28 Jan 22, 2025

Replies: 1 comment · 1 reply

GreatV Jan 22, 2025 Maintainer

1. How to modify the config file to use multiple datasets during training?

Changes to Your Config File:

2. How to dynamically generate datasets during training (e.g., 2000 new datasets per epoch)?

Steps to Implement Dynamic Dataset Generation:

Additional Tips:

Conclusion:

VishyAnand28 Jan 22, 2025 Author

VishyAnand28
Jan 22, 2025

Replies: 1 comment 1 reply

GreatV
Jan 22, 2025
Maintainer

VishyAnand28 Jan 22, 2025
Author