Baal in Production | Classification | NLP | Hugging Face #242

nitish1295 · 2022-11-21T09:46:55Z

nitish1295
Nov 21, 2022

So in my current use case I am trying to implement something similar to https://baal.readthedocs.io/en/latest/notebooks/baal_prod_cls/ using the baal docs for NLP with Hugging Face

Currently I have unlabelled data and the idea is to introduce a human oracle in a production setting which can label data in batches and we can save/use the updated model which we get after active learning.

I am trying to base this implementation on https://github.com/baal-org/baal/blob/master/experiments/nlp_bert_mcdropout.py and https://baal.readthedocs.io/en/latest/notebooks/baal_prod_cls/) from the baal docs for NLP with Hugging Face.

My understanding from the above mentioned tutorials is as follows:

https://github.com/baal-org/baal/blob/master/experiments/nlp_bert_mcdropout.py#L51-L55 will need to change to active_set.can_label = True and instead of label_randomly I'll have to and provide indexes and labels from human oracle for both the initial train and validation set
Now that active_set.can_label = True this https://github.com/baal-org/baal/blob/master/experiments/nlp_bert_mcdropout.py#L130 will not work and here I will have to add predictions using MCdropout and then selection of most uncertain samples using whatever heuristics I specify. Something like the following

  
  # MC dropout thing 
  predictions = baal_model.predict_on_dataset(active_set.pool, iterations=15)

 # Pick top n most uncertain
  top_uncertainty = heuristic(predictions)[:n]

  oracle_indices = active_set._pool_to_oracle_index(top_uncertainty) 
  labels = [get_label(idx) for idx in oracle_indices] # Get labels from human oracle
  active_set.label(top_uncertainty ,labels) # Label them

Just wanted to confirm/check if I am going in the right direction?

Few other questions:

Is there any implementation of a stopping criterion in baal to stop active learning after certain iterations based on some heuristic? I can't seem to find any
Just confirming, _pool_to_oracle_index returns indexes which you can use to pick out points from your main dataset i.e. train set+ pool set, correct?
active_set.label(idx,labels), this function will ensure that once I label points they are not considered in the pool and are considered in train, correct? I am just checking that I do not have to manually remove them from pool and add them to train
What are your views about saving the model after n active learning loops have been run. Suppose I have run 10 loop of active learning and now I want to use the model which was trained on this data. Do I just do trainer.save_model("path_to_save") but wouldn't that save the model with dropout enabled. Maybe later when I load the model from checkpoint I can disable dropouts but, is there a better way to do this.

Answered by Dref360

Nov 21, 2022

Hello!

the code sample you've shown looks great!

For your questions:

Usually we have a budget let's say we can label 5000 items so we stop after 5000 / query_size retraining. I have seen cases where we stop when the model stops improving. Unfortunately, we don't have an implementation for that. Do you think this would be valuable? If so, we should open an issue.
Yes the "oracle" index is based on the full dataset.
Yup you don't have to do anything, ActiveLearningDataset manages the split between labelled and unlabelled.
Yes this would save it with Dropout always activated. You can save baal_model.unpatch() to get the original model. code

I hope I answered your questions!

View full answer

Dref360 · 2022-11-21T15:50:57Z

Dref360
Nov 21, 2022
Maintainer

Hello!

the code sample you've shown looks great!

For your questions:

Usually we have a budget let's say we can label 5000 items so we stop after 5000 / query_size retraining. I have seen cases where we stop when the model stops improving. Unfortunately, we don't have an implementation for that. Do you think this would be valuable? If so, we should open an issue.
Yes the "oracle" index is based on the full dataset.
Yup you don't have to do anything, ActiveLearningDataset manages the split between labelled and unlabelled.
Yes this would save it with Dropout always activated. You can save baal_model.unpatch() to get the original model. code

I hope I answered your questions!

2 replies

nitish1295 Nov 21, 2022
Author

Thanks for your inputs! Maybe I will create a complete notebook for this, you can maybe include this in your docs. I understand if you feel that it might be redundant.

Regarding the stopping criterion, I am more inclined towards something where the model stops improving category. In my current active learning problems, we have been trying to figure out a good stopping criterion since in some cases at some point adding new data points become doesn't do much and in some cases it is detrimental(Disclaimer: At the moment we have only seen this if you do purely uncertainty sampling and do not care about diversity in your samples, can't say anything about Batch BALD or BALD yet).

Also most of the active learning libraries do not offer any stopping criterion but in a production setting we encounter the 'When to stop problem' often.

A paper I was looking at for reference https://arxiv.org/pdf/2104.01836.pdf.

Dref360 Nov 21, 2022
Maintainer

Oh yes, a notebook would be fantastic and much appreciated!

Right this is definitely a feature we should offer. I opened #243 to track it.

nitish1295 · 2022-12-06T11:13:46Z

nitish1295
Dec 6, 2022
Author

At this point, in my current PR #245 taking inputs from human oracle is pretty basic.

I am not sure about the direction of baal but maybe something along the following lines can help generate a lot of interest. Of course this is just a suggestion and would need significant effort. Basically a minimal UI in which we can see where we are going after each active learning, this initial tab/page of UI could start very basic with some basic data sources such as S3, Azure SQL or file system etc. from where data has to be loaded and other inputs like model,problem type and heuristic and the following can be the next tab/page where user can get into labeling via active learning.

I read your discussion about label studio but I have very limited experience with that but based on a quick look does not seem that they offer all the metrics etc while you are labeling.

If you do plan on doing something along these lines do count me in @Dref360

3 replies

Dref360 Dec 8, 2022
Maintainer

cc @GeorgePearse as he mentioned something similar not too long ago.

I think the idea of a "dashboard" makes a ton of sense and is definitely something we want to do.
I previously worked on a rough dashboard in React baal-dashboard.

Would be great to jam on how we can build this in a reliable way.
From what I understand we need:

Labelling interface
Active learning loop
Storing/retrieving logs.

Do you think we could combine Label Studio + Baal + MLFlow to do this?

nitish1295 Dec 9, 2022
Author

I will look more into label studio, it does look like they offer a lot of useful features but two key things I still did not find. Any examples/templates to display statistics/metrics while labeling and also how/if we can save the models at a certain iteration of the AL loop
I was leaning more towards something we build(such as baal-dashboard), slightly more custom which can give us more freedom to move in different directions.
I was thinking Flutter+Baal + MLFlow. Seems like Flutter gives you a lot of freedom in terms of platforms as well as well as different approaches you might want to take to build this

Let me know what you think about this. I understand if we do not have the time/resources to work on this but a custom approach seems more useful to me.

Also let me know if we can move this to a separate discussion before we dive deeper into this

FYI we can learn flutter as we go along.

Dref360 Dec 11, 2022
Maintainer

Yes let's move it to its own discussion. We can also create a channel on Slack!

Opened #247

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Baal in Production | Classification | NLP | Hugging Face #242

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Baal in Production | Classification | NLP | Hugging Face #242

nitish1295 Nov 21, 2022

Replies: 2 comments · 5 replies

Dref360 Nov 21, 2022 Maintainer

nitish1295 Nov 21, 2022 Author

Dref360 Nov 21, 2022 Maintainer

nitish1295 Dec 6, 2022 Author

Dref360 Dec 8, 2022 Maintainer

nitish1295 Dec 9, 2022 Author

Dref360 Dec 11, 2022 Maintainer

nitish1295
Nov 21, 2022

Replies: 2 comments 5 replies

Dref360
Nov 21, 2022
Maintainer

nitish1295 Nov 21, 2022
Author

Dref360 Nov 21, 2022
Maintainer

nitish1295
Dec 6, 2022
Author

Dref360 Dec 8, 2022
Maintainer

nitish1295 Dec 9, 2022
Author

Dref360 Dec 11, 2022
Maintainer