Need a CFG file that works for spacy v3 with webtrf to train a custom ner #10064

badri-rutgers · 2022-01-14T15:28:21Z

badri-rutgers
Jan 14, 2022

What is the correct way to create a CFG file spacyv3 for webtrf model to train a custom ner? I am using . CFG as shwn below but getting 0 accuracy after several epochs:

=========================== Initializing pipeline ===========================
[2022-01-14 14:44:58,990] [INFO] Set up nlp object from config
[2022-01-14 14:44:59,003] [INFO] Pipeline: ['transformer', 'ner']
[2022-01-14 14:44:59,003] [INFO] Resuming training for: ['transformer']
[2022-01-14 14:44:59,009] [INFO] Created vocabulary
[2022-01-14 14:44:59,010] [INFO] Finished initializing nlp object
[2022-01-14 14:44:59,976] [INFO] Initialized pipeline components: ['ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.001
E # LOSS TRANS... LOSS NER ENTS_F ENTS_P ENTS_R SCORE

/usr/local/lib/python3.7/dist-packages/torch/autocast_mode.py:141: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn('User provided device_type of 'cuda', but CUDA is not available. Disabling')
0 0 24.41 72.29 0.00 0.00 0.00 0.00
2 200 11312.20 7433.77 0.00 0.00 0.00 0.00

[nlp]
lang = "en"
pipeline = ["tok2vec","transformer","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@Tokenizers":"spacy.Tokenizer.v1"}

[components]
[components.transformer]
source = "en_core_web_trf"

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH","SHAPE"]
rows = [5000,2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

Answered by polm

Jan 16, 2022

You're using a tok2vec and a transformer in the same pipeline, which is not necessary and probably causing weird things to happen. You should only have either a transformer or a tok2vec in a pipeline, not both.

You can source the transformer from the pretrained pipeline, see here for how to do that, though I would recommend just training from scratch using a GPU config from the quickstart.

Also, it looks like you either do not have a GPU or do not have it configured correctly. Note that training Transformers on CPU is possible but extremely slow and not recommended.

View full answer

polm · 2022-01-16T04:53:36Z

polm
Jan 16, 2022

You're using a tok2vec and a transformer in the same pipeline, which is not necessary and probably causing weird things to happen. You should only have either a transformer or a tok2vec in a pipeline, not both.

You can source the transformer from the pretrained pipeline, see here for how to do that, though I would recommend just training from scratch using a GPU config from the quickstart.

Also, it looks like you either do not have a GPU or do not have it configured correctly. Note that training Transformers on CPU is possible but extremely slow and not recommended.

11 replies

polm Jan 19, 2022

Where are you calling replace_listener? Whatever it is, replace_listeners needs to take an instance, not a class, which is why TransformerListener won't work.

Also, this can be confusing, but to be really clear - we do use the word "tok2vec" in two slightly different ways.

Any component that converts tokens into vectors. This is not only the Tok2Vec architectures but also the Transformers. This is why model.tok2vec can actually be a Transformer - the model needs something that does this conversion, and in this sense the Transformers and other architectures are both doing the same thing (though in different ways).
Non-transformer architectures that fall under definition 1. This is the meaning relevant for the Tok2VecListener, for example.

badri-rutgers Jan 19, 2022
Author

I am calling here:
[components.ner]
source = "en_core_web_trf"
replace_listeners = ["model.tok2vec"]

badri-rutgers Jan 19, 2022
Author

also, if I were to call replace_listeners for Transformers, what is the correct way to use
replace_listeners['spacy-transformers.TransformerListener.v1'] ?

polm Jan 21, 2022

Ah, OK. Note that using something in the config is not exactly the same as a function call.

When you source a component in the config like that, the value of replace_listeners is a reference to a block, so it's using "tok2vec" in the generic sense (definition 1) above. With replace_listeners in a config you almost always want to use "model.tok2vec", and it should be fine here. (It's true we could be more explicit about this in the docs, so we'll work on that - thanks for bringing it to our attention!)

UplandsDynamic Oct 12, 2023

Any component that converts tokens into vectors. This is not only the Tok2Vec architectures but also the Transformers. This is why model.tok2vec can actually be a Transformer - the model needs something that does this conversion, and in this sense the Transformers and other architectures are both doing the same thing (though in different ways).

@polm: I spent hours trying to figure out a working config because of this. The docs are very confusing here.

For instance, this page https://spacy.io/api/language#replace_listeners states:

tok2vec_name | Name of the token-to-vector component, typically "tok2vec" or "transformer".

A small clarification to the docs would probably be welcomed by those yet to tread this path ... :-)

badri-rutgers · 2022-01-31T21:34:45Z

badri-rutgers
Jan 31, 2022
Author

I trained using the config as in project ner_demo_update. but I am getting zero entities for custom label. the custom label appears in the eval list (TITLE AND CEONAME) but I am getting 0 entities PERSON 100.00 89.44 94.42 TITLE 0.00 0.00 0.00 ORG 98.21 95.51 96.84 GPE 100.00 97.80 98.89 CARDINAL 100.00 71.43 83.33 DATE 98.55 90.67 94.44 TIME 100.00 98.78 99.39 FAC 100.00 28.57 44.44 MONEY 98.63 83.72 90.57 PERCENT 100.00 90.91 95.24 CEONAME 0.00 0.00 0.00 LOC 100.00 100.00 100.00 ORDINAL 100.00 100.00 100.00 PRODUCT 100.00 100.00 100.00 WORK_OF_ART 100.00 100.00 100.00 LAW 0.00 0.00 0.00

…

________________________________ From: polm ***@***.***> Sent: Friday, January 21, 2022 1:53 AM To: explosion/spaCy ***@***.***> Cc: Badri Nath ***@***.***>; Author ***@***.***> Subject: Re: [explosion/spaCy] Need a CFG file that works for spacy v3 with webtrf to train a custom ner (Discussion #10064) Ah, OK. Note that using something in the config is not exactly the same as a function call. When you source a component in the config like that, the value of replace_listeners is a reference to a block, so it's using "tok2vec" in the generic sense (definition 1) above. With replace_listeners in a config you almost always want to use "model.tok2vec", and it should be fine here. (It's true we could be more explicit about this in the docs, so we'll work on that - thanks for bringing it to our attention!) — Reply to this email directly, view it on GitHub<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexplosion%2FspaCy%2Fdiscussions%2F10064%23discussioncomment-2012319&data=04%7C01%7Cbadri%40cs.rutgers.edu%7C6c249f9c12e542c80a4408d9dcaab4b1%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637783448005590070%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=v9KypllJnoVm3hkf6UMmLrTFE%2BaTHApnFONE1KC3FpE%3D&reserved=0>, or unsubscribe<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAWKRUW75AXC4CQVPZUI66DDUXD7F5ANCNFSM5L7CYKGQ&data=04%7C01%7Cbadri%40cs.rutgers.edu%7C6c249f9c12e542c80a4408d9dcaab4b1%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637783448005590070%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=iDWiVPref2IYdPNefkcIkEyY7qWqX8J76TXvAnQc8Yw%3D&reserved=0>. Triage notifications on the go with GitHub Mobile for iOS<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cbadri%40cs.rutgers.edu%7C6c249f9c12e542c80a4408d9dcaab4b1%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637783448005590070%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2GriSRnpUp4kJSJ4%2FlAHK20TxmZ1M0hBVNGNUWNPpg0%3D&reserved=0> or Android<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cbadri%40cs.rutgers.edu%7C6c249f9c12e542c80a4408d9dcaab4b1%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637783448005590070%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=FlkQRG6oSeSjrZC7AJw3uPByjeObA5pPYTFbNe8X4UA%3D&reserved=0>. You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

polm Feb 1, 2022

What's the output of spacy debug data for your pipeline?

This could happen if either your train or dev data has no instances of your new entities. How did you create your training data?

If your training data looks OK, you could try training a from-scratch model with just your entities rather than doing an update to check if that works. If it does work, then something is wrong with how the update is happening, if it doesn't work, then something is wrong with your data.

badri-rutgers · 2022-02-01T15:14:58Z

badri-rutgers
Feb 1, 2022
Author

when i use a factory ner in config in the transformer pipeline, apparently (i need to understand more) I have to set annotating_components=['transformer']. with this I am getting cust NERS recognized. it has been a great learning experience. your prompt replies are highly appreciated. badri

…

________________________________ From: polm ***@***.***> Sent: Monday, January 31, 2022 11:04 PM To: explosion/spaCy ***@***.***> Cc: Badri Nath ***@***.***>; Author ***@***.***> Subject: Re: [explosion/spaCy] Need a CFG file that works for spacy v3 with webtrf to train a custom ner (Discussion #10064) What's the output of spacy debug data for your pipeline? This could happen if either your train or dev data has no instances of your new entities. How did you create your training data? If your training data looks OK, you could try training a from-scratch model with just your entities rather than doing an update to check if that works. If it does work, then something is wrong with how the update is happening, if it doesn't work, then something is wrong with your data. — Reply to this email directly, view it on GitHub<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexplosion%2FspaCy%2Fdiscussions%2F10064%23discussioncomment-2086209&data=04%7C01%7Cbadri%40cs.rutgers.edu%7Cf6b442804b4145dab38608d9e537e3b9%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637792850473883626%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Vj90J2SIvyPJfBK7fAhugAxMQnsDO6p4tqp6%2F7AS%2BKc%3D&reserved=0>, or unsubscribe<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAWKRUWZHPM7O4Z6VQRND7W3UY5LTHANCNFSM5L7CYKGQ&data=04%7C01%7Cbadri%40cs.rutgers.edu%7Cf6b442804b4145dab38608d9e537e3b9%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637792850474039715%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ROah%2FF9dfAymChSjjTRHTFeic2Ii1XxC88a4iL0NB%2B0%3D&reserved=0>. Triage notifications on the go with GitHub Mobile for iOS<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cbadri%40cs.rutgers.edu%7Cf6b442804b4145dab38608d9e537e3b9%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637792850474039715%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=yMb9Df7ezDiSjuY6IPbjYbGQJSUAohykn%2FNAkkQZwno%3D&reserved=0> or Android<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cbadri%40cs.rutgers.edu%7Cf6b442804b4145dab38608d9e537e3b9%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637792850474196092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4ZYGjCvQpwQuJz0Dt8o2QZlHLIWmInKg7xVkE5xLAgo%3D&reserved=0>. You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

polm Feb 3, 2022

Just a note, but you shouldn't need to put the Transformer in annotating_components unless it's frozen.

badri-rutgers · 2022-02-02T10:33:23Z

badri-rutgers
Feb 2, 2022
Author

train from scratch model works but update does not work. Here is the output of debug data for my pipeline ValueError: [E955] Can't find table(s) lemma_rules for language 'en' in spacy-lookups-data. Make sure you have the package installed or provide your own lookup tables if no default lookups are available for your language. =============================== NER (per type) =============================== P R F PERSON 99.80 90.55 94.95 DATE 99.00 93.42 96.13 TIME 100.00 97.39 98.68 ORG 97.57 95.95 96.75 PERCENT 100.00 97.44 98.70 GPE 100.00 97.53 98.75 MONEY 98.06 88.37 92.97 **CEONAME 0.00 0.00 0.00** **TITLE 0.00 0.00 0.00** ORDINAL 100.00 90.00 94.74 CARDINAL 98.59 80.00 88.33 FAC 100.00 25.00 40.00 LOC 100.00 66.67 80.00 LAW 100.00 100.00 100.00 QUANTITY 100.00 83.33 90.91 WORK_OF_ART 100.00 100.00 100.00 PRODUCT 0.00 0.00 0.00

…

________________________________ From: polm ***@***.***> Sent: Monday, January 31, 2022 11:04 PM To: explosion/spaCy ***@***.***> Cc: Badri Nath ***@***.***>; Author ***@***.***> Subject: Re: [explosion/spaCy] Need a CFG file that works for spacy v3 with webtrf to train a custom ner (Discussion #10064) What's the output of spacy debug data for your pipeline? This could happen if either your train or dev data has no instances of your new entities. How did you create your training data? If your training data looks OK, you could try training a from-scratch model with just your entities rather than doing an update to check if that works. If it does work, then something is wrong with how the update is happening, if it doesn't work, then something is wrong with your data. — Reply to this email directly, view it on GitHub<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexplosion%2FspaCy%2Fdiscussions%2F10064%23discussioncomment-2086209&data=04%7C01%7Cbadri%40cs.rutgers.edu%7Cf6b442804b4145dab38608d9e537e3b9%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637792850473883626%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Vj90J2SIvyPJfBK7fAhugAxMQnsDO6p4tqp6%2F7AS%2BKc%3D&reserved=0>, or unsubscribe<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAWKRUWZHPM7O4Z6VQRND7W3UY5LTHANCNFSM5L7CYKGQ&data=04%7C01%7Cbadri%40cs.rutgers.edu%7Cf6b442804b4145dab38608d9e537e3b9%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637792850474039715%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ROah%2FF9dfAymChSjjTRHTFeic2Ii1XxC88a4iL0NB%2B0%3D&reserved=0>. Triage notifications on the go with GitHub Mobile for iOS<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cbadri%40cs.rutgers.edu%7Cf6b442804b4145dab38608d9e537e3b9%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637792850474039715%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=yMb9Df7ezDiSjuY6IPbjYbGQJSUAohykn%2FNAkkQZwno%3D&reserved=0> or Android<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cbadri%40cs.rutgers.edu%7Cf6b442804b4145dab38608d9e537e3b9%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637792850474196092%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4ZYGjCvQpwQuJz0Dt8o2QZlHLIWmInKg7xVkE5xLAgo%3D&reserved=0>. You are receiving this because you authored the thread.Message ID: ***@***.***>

10 replies

polm Feb 3, 2022

CPU means not transformers, so no, it'll be different. But I'm suggesting this to clarify data issues, not as a permanent solution.

You could also use a GPU config from the quickstart to use a transformer without training from our pretrained model. This is our recommended approach, especially when adding brand new entities.

badri-rutgers Feb 3, 2022
Author

yes will do that. BTW, cfg with NER factory for custom-entities works fine. ['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'cust_ner'] works fine. NER update is not working. Also, what is the fundamental difference in NER factory (start from scratch) and NER update? what are the pros and cons?

badri-rutgers Feb 3, 2022
Author

You could also use a GPU config from the quickstart to use a transformer without training from our pretrained model
u mean freeze the transformer? if so, that's what I am doing

badri-rutgers Feb 3, 2022
Author

When I run CPU based web_sm model and execute debug, I get this;
============================ Data file validation ============================
✔ Pipeline can be initialized with data
✔ Corpus is loadable

=============================== Training stats ===============================
Language: en
Training pipeline: tok2vec, tagger, parser, attribute_ruler, lemmatizer, ner
Components from other pipelines: ner
Frozen components: tok2vec, tagger, parser, senter, attribute_ruler, lemmatizer
30872 training docs
30872 evaluation docs
⚠ 7338 training examples also in evaluation data

============================== Vocab & Vectors ==============================
ℹ 723988 total word(s) in the data (14978 unique)
ℹ No word vectors present in the package

========================== Named Entity Recognition ==========================
ℹ 20 label(s)
0 missing value(s) (tokens with '-' label)
⚠ Some model labels are not present in the train data. The model
performance may be degraded for these labels after training: 'LANGUAGE'.
✘ 461 invalid whitespace entity spans
⚠ Low number of examples for label 'WORK_OF_ART' (33)
⚠ Low number of examples for label 'NORP' (5)
⚠ Low number of examples for label 'LAW' (46)
⚠ Low number of examples for label 'PRODUCT' (33)
⚠ Low number of examples for label 'QUANTITY' (18)
⚠ Low number of examples for label 'EVENT' (4)
⚠ Low number of examples for label 'LOC' (15)
✔ Examples without occurrences available for all labels
Entity spans consisting of or starting/ending with whitespace characters are
considered invalid.

=========================== Part-of-speech Tagging ===========================
ℹ 1 label(s) in train data
⚠ Some model labels are not present in the train data. The model
performance may be degraded for these labels after training: 'JJ', 'JJR', 'JJS',
'PDT', 'HYPH', 'MD', 'WDT', 'ADD', 'RBR', 'WP', 'VBD', '''', 'SYM', 'RB', 'VB',
'-LRB-', '$', 'WP$', 'XX', 'CC', 'POS', 'UH', 'FW', 'VBG', 'LS', 'RBS', 'WRB',
'AFX', 'PRP$', 'VBN', ',', 'RP', 'DT', 'NNP', 'VBP', '``', 'PRP', 'NFP', '.',
'NNPS', ':', 'VBZ', 'EX', 'NNS', 'NN', '-RRB-', 'TO', 'IN', 'CD'.

============================= Dependency Parsing =============================
ℹ Found 723988 sentence(s) with an average length of 1.0 words.
ℹ 1 label(s) in train data
ℹ 1 label(s) in projectivized train data

================================== Summary ==================================
✔ 3 checks passed
⚠ 10 warnings
✘ 1 error

polm Feb 6, 2022

Thanks, that's very helpful.

Couple of things that jump out at me from that output:

It looks like your training and dev data have a lot of overlap. That's fine for getting started but given your data volume they should be completely separate to test your model properly.

✘ 461 invalid whitespace entity spans

This is not a huge number given the size of your training data, but it is unusually large, and these kind of errors are not normal. You are included spaces before or after entities in the entity annotations. Those annotations will be basically unusable. You should look at why that is happening.

For the "Low number of examples" warnings, the model is probably going to forget all those labels because you have basically no data for them.

Your part of speech and dependency data look like they're just invalid and you won't be able to train a useful model. I see that in your real config you are trying to train these components - do you actually want to train these, or do you just want to use the pretrained models?

To clear things up a little I have two questions:

Can you give an example of a sentence with your entities in it?
What is the ultimate goal here? You want to have en_core_web_trf + NER for your custom entities, right? Are you actually going to use the tagger, parser, lemmatizer etc.?

badri-rutgers · 2022-02-02T19:35:35Z

badri-rutgers
Feb 2, 2022
Author

Update: even after installing spacy_lookups, tokens recognized as new labels is still zero. interesting is that new labels appear in eval output but no tokens recognized as such. Input samples do have sufficient custom entities

0 replies

badri-rutgers · 2022-02-02T22:43:05Z

badri-rutgers
Feb 2, 2022
Author

#config file used
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 1

[nlp]
lang = "en"
pipeline = ["transformer","tagger","parser","attribute_ruler","lemmatizer","ner"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 64
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.attribute_ruler]
source = "en_core_web_trf"

[components.lemmatizer]
source = "en_core_web_trf"

[components.ner]
source = "en_core_web_trf"
replace_listeners = ["model.tok2vec"]

[components.parser]
source = "en_core_web_trf"

[components.tagger]
source = "en_core_web_trf"

[components.transformer]
source = "en_core_web_trf"

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.1
accumulate_gradient = 3
patience = 1000
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = ["transformer","tagger","parser","attribute_ruler","lemmatizer"]
before_to_disk = null
annotating_components = []

#[training.batcher]
#@batchers = "spacy.batch_by_padded.v1"
#discard_oversize = true
#get_length = null
#size = 2000
#buffer = 256

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
tag_acc = 0.16
dep_uas = 0.0
dep_las = 0.16
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.02
lemma_acc = 0.5
ents_f = 0.16
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
speed = 0.0

[pretraining]

[initialize]
vocab_data = null
vectors = null
init_tok2vec = ${paths.init_tok2vec}
after_init = null
#lookups = null

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "en_core_web_trf"
vocab = "en_core_web_trf"

[initialize.components]

[initialize.components.ner]

[initialize.components.ner.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/ner.json"
require = false

[initialize.components.parser]

[initialize.components.parser.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/parser.json"
require = false

[initialize.components.tagger]

[initialize.components.tagger.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/tagger.json"
require = false

[initialize.tokenizer]

0 replies

badri-rutgers · 2022-02-06T13:22:38Z

badri-rutgers
Feb 6, 2022
Author

Your part of speech and dependency data look like they're just invalid and you won't be able to train a useful model. I see that in your real config you are trying to train these components - do you actually want to train these, or do you just want to use the pretrained models?

>>> NO, I do not want to train POS /DEP. Where in the CFG does it indicate that I am doing this? How do I ensure that I use pretrained models?

To clear things up a little I have two questions: 1. Can you give an example of a sentence with your entities in it? >>> will do. I have to anon it. 2. What is the ultimate goal here? You want to have en_core_web_trf + NER for your custom entities, right? 3. >>>>>>>>>>>>>.YES: 4. Are you actually going to use the tagger, parser, lemmatizer etc.? 5. >>>>>>>>>>NO. (Components are frozen on the CFG) I only want to use trf+ NER for two new entities:

…

________________________________ From: polm ***@***.***> Sent: Sunday, February 6, 2022 1:41 AM To: explosion/spaCy ***@***.***> Cc: Badri Nath ***@***.***>; Author ***@***.***> Subject: Re: [explosion/spaCy] Need a CFG file that works for spacy v3 with webtrf to train a custom ner (Discussion #10064) Thanks, that's very helpful. Couple of things that jump out at me from that output: It looks like your training and dev data have a lot of overlap. That's fine for getting started but given your data volume they should be completely separate to test your model properly. ✘ 461 invalid whitespace entity spans This is not a huge number given the size of your training data, but it is unusually large, and these kind of errors are not normal. You are included spaces before or after entities in the entity annotations. Those annotations will be basically unusable. You should look at why that is happening. For the "Low number of examples" warnings, the model is probably going to forget all those labels because you have basically no data for them. Your part of speech and dependency data look like they're just invalid and you won't be able to train a useful model. I see that in your real config you are trying to train these components - do you actually want to train these, or do you just want to use the pretrained models? To clear things up a little I have two questions: 1. Can you give an example of a sentence with your entities in it? 2. What is the ultimate goal here? You want to have en_core_web_trf + NER for your custom entities, right? Are you actually going to use the tagger, parser, lemmatizer etc.? — Reply to this email directly, view it on GitHub<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fexplosion%2FspaCy%2Fdiscussions%2F10064%23discussioncomment-2118952&data=04%7C01%7Cbadri%40cs.rutgers.edu%7C14315367c0c74eec9dc708d9e93bc455%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637797265185556720%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2KxCDhW7qbkTT%2BjusDTbtwmEni7Wh37ngVSMvzBarGA%3D&reserved=0>, or unsubscribe<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAWKRUW6H32RBMADUVE2PCNTUZYJ3FANCNFSM5L7CYKGQ&data=04%7C01%7Cbadri%40cs.rutgers.edu%7C14315367c0c74eec9dc708d9e93bc455%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637797265185556720%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Sm%2BAjAuu5KiboBKKdbrjmlHZ400TfoTofI%2B3s6Rip%2Bw%3D&reserved=0>. Triage notifications on the go with GitHub Mobile for iOS<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cbadri%40cs.rutgers.edu%7C14315367c0c74eec9dc708d9e93bc455%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637797265185556720%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5ZINH4Z0LUQvdPC4%2F9yPy3PGaMd1YpRuMbAvn0cr2CM%3D&reserved=0> or Android<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cbadri%40cs.rutgers.edu%7C14315367c0c74eec9dc708d9e93bc455%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C637797265185556720%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=CGargEDInwXFz4x45mLeFkrcBj%2FJD6UJJKrcEpxVdAE%3D&reserved=0>. You are receiving this because you authored the thread.Message ID: ***@***.***>

6 replies

badri-rutgers Feb 7, 2022
Author

thank you so much for your time. Will spend time understanding your recommendations.

badri-rutgers Feb 7, 2022
Author

as to the whitspace issue, I did check and there are no spaces before quotes but almost all my entities are multi word tokens. For example, {"label": "CEONAME", "pattern": "Donald Trump Jr."}, {"label": "ORGNAME", "pattern": "Windmill Associates, INC"}, would this cause whitespace count. I am using these samples to determine spans for each label via entity ruler

polm Feb 8, 2022

Normal whitespace in between words is not a problem. The warning from debug data only applies to incorrect whitespace, like if the entity is between [] brackets in His name is [ Barack Obama] (starting space). ~400 instances with your data volume is not a lot but it is still worth looking into.

It sounds like you are using an EntityRuler to create your training data?

One thing to note about your CEONAME pattern is it looks like it's a subset of PERSON labels. That's going to be hard to learn, especially if you try to learn CEONAME and PERSON at the same time, which may just not work. If you want to get CEO names from documents that sounds like it would be better structured as a relation extraction problem. (We've suggested this pattern before in the past, see this talk, though I'm not sure if we have a detailed writeup anywhere.).

One other thing - I see you are using the label ORGNAME, but the label in the pretrained model is ORG.

polm Feb 8, 2022

Found a detailed writeup of our suggested process for things that are too fine-grained for simple NER:

https://spacy.io/usage/rule-based-matching#models-rules-pos-dep

badri-rutgers Feb 8, 2022
Author

Starting with a blank nlp and ner demo replace works. Previously , I was using NER DEMO update.
Here is the basic flow:

nlp=spacy.blank("en")
corpus=[sample_sentences]
ruler=nlp.add_pipe("entity_ruler",first=True)
patterns=[
{"label": "CUSTNER1", "pattern": "Blah Blah NC, LLC"},
     {"label": "CUSTNER2", "pattern": "Vice President"},
......  #additional patterns
]
ruler.add_patterns(patterns)  
#recognize only new custom entities
for sentence in corpus:
    doc=nlp(sentence)
    entities = []
    for ent in doc.ents:
        entities.append([ent.start_char, ent.end_char, ent.label_])
    TRAIN_DATA.append([sentence, {"entities": entities}])

   convert TRAIN_DATA  to  train.spacy and valid.spacy
used NER_DEMO_REPLACE  project/CFG
   after training, load cust ner model to nlp pipeline
custom_ner=spacy.load(/model-best)
nlp.add_pipe(source=custom_ner)
['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'ner_custname']

sample='text'
doc=nlp(sample)
for  ent in doc.ents:
   print(ent.text,ent.label_)

The pipeline recognizes both new custom entities as well as the standard ner entities

Need a CFG file that works for spacy v3 with webtrf to train a custom ner #10064

Replies: 7 comments · 29 replies

badri-rutgers Jan 19, 2022 Author

badri-rutgers Jan 19, 2022 Author

badri-rutgers Jan 31, 2022 Author

badri-rutgers Feb 1, 2022 Author

badri-rutgers Feb 2, 2022 Author

badri-rutgers Feb 3, 2022 Author

badri-rutgers Feb 3, 2022 Author

badri-rutgers Feb 3, 2022 Author

badri-rutgers Feb 2, 2022 Author

badri-rutgers Feb 2, 2022 Author

badri-rutgers Feb 6, 2022 Author

badri-rutgers Feb 7, 2022 Author

badri-rutgers Feb 7, 2022 Author

badri-rutgers Feb 8, 2022 Author

Replies: 7 comments 29 replies

badri-rutgers Jan 19, 2022
Author

badri-rutgers Jan 19, 2022
Author

badri-rutgers
Jan 31, 2022
Author

badri-rutgers
Feb 1, 2022
Author

badri-rutgers
Feb 2, 2022
Author

badri-rutgers Feb 3, 2022
Author

badri-rutgers Feb 3, 2022
Author

badri-rutgers Feb 3, 2022
Author

badri-rutgers
Feb 2, 2022
Author

badri-rutgers
Feb 2, 2022
Author

badri-rutgers
Feb 6, 2022
Author

badri-rutgers Feb 7, 2022
Author

badri-rutgers Feb 7, 2022
Author

badri-rutgers Feb 8, 2022
Author