Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates in NPO dataset #3

Open
cmainov opened this issue Aug 2, 2022 · 10 comments
Open

Duplicates in NPO dataset #3

cmainov opened this issue Aug 2, 2022 · 10 comments

Comments

@cmainov
Copy link
Collaborator

cmainov commented Aug 2, 2022

Hey Jesse,

I am in the process of creating a rodeo dataset that will be used for the spatial grids (detailing distances between NPOs and board members)...I've come across an interesting finding and that is that there are ~7,000 EINs that are duplicated in the NPO IRS 1023 EZ dataset. Importantly, a feature of some of those duplicates is that they have different geocoded locations (see attached image). I am trying to understand this discrepancy in the data and wanted to reach out for your insights since you have worked with these data extensively...could it be that these reflect NPOs moving addresses over the years?
Screen Shot 2022-08-02 at 2 17 47 PM

@cmainov
Copy link
Collaborator Author

cmainov commented Aug 2, 2022

e.g., see rows 5 and 6 in the output from the image (forgot to mention that above)

@cmainov
Copy link
Collaborator Author

cmainov commented Aug 2, 2022

Screen Shot 2022-08-02 at 2 24 20 PM

Here is a more detailed look at the data showing the differences across years that I mentioned.

@cmainov
Copy link
Collaborator Author

cmainov commented Aug 2, 2022

Screen Shot 2022-08-02 at 2 41 16 PM

In this output, we have some duplicated EINs with different coordinates but filed in the same year: This sounds like these may reflect organizations that have more than 1 central location? Would love to get your take on this

@lecy
Copy link
Member

lecy commented Aug 2, 2022

We need to check back to the raw data - do we actually have the same organization filing twice?

It could be two things - perhaps the geocoding code identified an ambiguous address, so it returned a couple of results? Or its was a non-match at one step so it gets passed to another step (match by PO box or by ZIP code only), and somehow gets added back to the sample twice.

More likely, if the EIN appears twice in the raw data with two different addresses, it was a resubmission of the application. Usually it is to submit clarifying information, or perhaps they lost their status for failure to file their annual 990 and had to reapply (though that would typically be a couple of years apart, not the same year).

My guess is they submit an update to their application with a new address.

@lecy
Copy link
Member

lecy commented Aug 2, 2022

EIN 11111111 seems dubious, though! I would check that one first.

@lecy
Copy link
Member

lecy commented Aug 2, 2022

For these cases they are likely applying for reinstatement of nonprofit status after failing to file the 990 in a timely manner and having their status revoked (there is a lag between the first and second ruledate):

#3 (comment)

@cmainov
Copy link
Collaborator Author

cmainov commented Aug 3, 2022

Thank you for following up

@cmainov
Copy link
Collaborator Author

cmainov commented Aug 3, 2022

I agree about EIN 1111111. I will take a look...I think this is something that will need to be addressed. If we are counting the same organization twice, it will obviously skew the results. Just taking note of it for now.

@cmainov
Copy link
Collaborator Author

cmainov commented Aug 3, 2022

Screen Shot 2022-08-03 at 9 49 32 AM

here are the three instances of EIN == 1111111...they are all different organizations in different states

@cmainov
Copy link
Collaborator Author

cmainov commented Aug 3, 2022

Screen Shot 2022-08-03 at 10 02 17 AM

SOLVED: Hey Jesse, I found another identifying column in the dataset Case.Number (I'm guessing every organization that files is given one of these at the time of filing, but maybe you can fill me in more on this)...It looks like there are only 110 duplicates when I use this column as the row identifier. Upon checking them, the coordinates are the same for the duplicate rows, which tells me that this is the scenario that you posited in your response: namely, that we are dealing with organizations that likely had to submit more information at a later time.

Just to have it documented: I believe we should be using the Case.Number as we progress into other realms of analysis/data mgmt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants