Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra forms parsing #29

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

gpolydatas
Copy link

3,4 forms being parsed with changes being pushed to the form parser and the template added in the json.

@eloukas
Copy link
Collaborator

eloukas commented Dec 24, 2024

@gpolydatas

Couple things to address first:

  1. Right now, both extract_items.py and item_lists.py have the item lists for the new forms. Please keep them only at item_lists.py
  2. The extract_items.py does not have any other change inside right now in the PR.
  3. Also, at a later stage, for your contribution to be merged, we will need have some output pictures of the code working, description of the changes, plus some unit tests for the code. Let's focus on the first two (2) now in order not to be overwhelmed.

You can see the file differences (old vs. new version) yourself at https://github.com/nlpaueb/edgar-crawler/pull/29/files

@gpolydatas
Copy link
Author

not really sure what's happening here:
The code is substantially more than those 800 rows appearing in your repo in your main branch
image

The code definitely has additions as those 2 functions did not exist before
image
image

the items listed also does not exist anywhere else apart from being imported on top
image
This is where i see those changes/addtions :
https://github.com/gpolydatas/edgar-crawler/blob/extra_forms_parsing/extract_items.py

@eloukas
Copy link
Collaborator

eloukas commented Dec 28, 2024

@gpolydatas I can now see some changes, nice!

  1. Could you please give us some screenshots & markdown on the expected outcome of the code when run?
  2. From one quick skim on the code, I see that you do the extraction based on the XML/XBRL tags of the filings? If that's correct, then what years have you tested on the filings? I am asking because, if I remember correctly, XBRL filings appeared after 2010, and inline xbrl filing started appearing (mainly) after 2018. If so, sadly, this xml-based extraction will not work for previous filings. And, as you might see, parsing for other items (10k,8k,10q) works based on regular expressions on strings and this actually works for modern filings (even if they have inline xbrl) or older filings (pure .txt files).
  3. Based on the previous, could you also describe in a high-level what chat changes did you do in the code etc.? This helps open-source software maintainers have a better sense of what you did vs. actually reading in detail the code of each PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants