Could you share the script to train the model? #44

xianruizhong · 2023-07-02T14:08:45Z

As title. Thanks!

yongyaoduan · 2023-09-08T04:09:22Z

I am also interested in the training script for the model mentioned by @xianruizhong. Could you kindly share it, if available? Your assistance would be greatly appreciated.

Thank you!

sachaRfd · 2024-11-23T23:22:00Z

For future runs:

Something simple like this would work :)

from transformers import AlbertForMaskedLM, AlbertConfig, Trainer, TrainingArguments
from datasets import load_dataset
from transformers import DataCollatorForLanguageModeling
from rxnmapper.tokenization_smiles import SmilesTokenizer


def tokenize_function(examples):
    tokenized = tokenizer(examples["rxn"])
    return tokenized


# Tokenizer setup
vocab_path = "PATH/TO/vocab.txt"
tokenizer = SmilesTokenizer(vocab_path)


# Dataset setup
dataset_path = (
    "PATH/TO/data.csv"
)
dataset = load_dataset(
    "csv",
    data_files=dataset_path,
)

# Tokenize the dataset
dataset = dataset.map(
    tokenize_function,
    batched=True,
    num_proc=24,
)

# Model setup
alberta_config_path = "PATH/TO/config.json"
model_config = AlbertConfig.from_pretrained(alberta_config_path)
model = AlbertForMaskedLM(model_config)

# Data collator for MLM, using a mask probability of 15%
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

# Training arguments setup
training_args = TrainingArguments(
    output_dir="models/alberta",
    learning_rate=2e-4,
    # num_train_epochs=3,
    max_steps=1_500_000,
    weight_decay=0.001,
    logging_steps=100,
    eval_strategy="no",
    save_strategy="steps",
    save_steps=1_000,
    save_total_limit=2,
    save_only_model=True,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    logging_dir="logs",
    logging_first_step=True,
    report_to="wandb",
    run_name="alberta_test_run",
)


# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset["train"],
)

# Train the model
trainer.train()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you share the script to train the model? #44

Could you share the script to train the model? #44

xianruizhong commented Jul 2, 2023

yongyaoduan commented Sep 8, 2023

sachaRfd commented Nov 23, 2024

Could you share the script to train the model? #44

Could you share the script to train the model? #44

Comments

xianruizhong commented Jul 2, 2023

yongyaoduan commented Sep 8, 2023

sachaRfd commented Nov 23, 2024