Baal in Production | Classification | NLP | Hugging Face #242
-
So in my current use case I am trying to implement something similar to https://baal.readthedocs.io/en/latest/notebooks/baal_prod_cls/ using the baal docs for NLP with Hugging Face Currently I have unlabelled data and the idea is to introduce a human oracle in a production setting which can label data in batches and we can save/use the updated model which we get after active learning. I am trying to base this implementation on https://github.com/baal-org/baal/blob/master/experiments/nlp_bert_mcdropout.py and https://baal.readthedocs.io/en/latest/notebooks/baal_prod_cls/) from the baal docs for NLP with Hugging Face. My understanding from the above mentioned tutorials is as follows:
# MC dropout thing
predictions = baal_model.predict_on_dataset(active_set.pool, iterations=15)
# Pick top n most uncertain
top_uncertainty = heuristic(predictions)[:n]
oracle_indices = active_set._pool_to_oracle_index(top_uncertainty)
labels = [get_label(idx) for idx in oracle_indices] # Get labels from human oracle
active_set.label(top_uncertainty ,labels) # Label them Just wanted to confirm/check if I am going in the right direction? Few other questions:
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
Hello! the code sample you've shown looks great! For your questions:
I hope I answered your questions! |
Beta Was this translation helpful? Give feedback.
-
At this point, in my current PR #245 taking inputs from human oracle is pretty basic. I am not sure about the direction of baal but maybe something along the following lines can help generate a lot of interest. Of course this is just a suggestion and would need significant effort. Basically a minimal UI in which we can see where we are going after each active learning, this initial tab/page of UI could start very basic with some basic data sources such as S3, Azure SQL or file system etc. from where data has to be loaded and other inputs like model,problem type and heuristic and the following can be the next tab/page where user can get into labeling via active learning. I read your discussion about label studio but I have very limited experience with that but based on a quick look does not seem that they offer all the metrics etc while you are labeling. If you do plan on doing something along these lines do count me in @Dref360 |
Beta Was this translation helpful? Give feedback.
Hello!
the code sample you've shown looks great!
For your questions:
5000 / query_size
retraining. I have seen cases where we stop when the model stops improving. Unfortunately, we don't have an implementation for that. Do you think this would be valuable? If so, we should open an issue.ActiveLearningDataset
manages the split between labelled and unlabelled.baal_model.unpatch()
to get the original model. codeI hope I answered your questions!