Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facing Issue while constructing the KG. #3

Open
bphariharan opened this issue Dec 24, 2024 · 9 comments
Open

Facing Issue while constructing the KG. #3

bphariharan opened this issue Dec 24, 2024 · 9 comments

Comments

@bphariharan
Copy link

bphariharan commented Dec 24, 2024

Hello Team,

We've gone through the repository source code and we really liked it. We tried executing your construct_kg.ipynb but, we encountered an issue.

ValueError: You are trying to merge on float64 and object columns for key 'x_id'. If you wish to proceed you should use pd.concat

At this line of construct_kg.ipynb:-
edge_df = new_df.merge(new_node_df.loc[new_node_df['node_source']=='PHECODE'], left_on='x_id', right_on='node_id', how='left')

We're also trying to new_homo_hg_hms.pt from the Harvard Dataverse provided here. Which gave us the error UnpicklingError: invalid load key, '?'. It would be great if you could help us possible solution and/or workaround for this matter so, that we can continue to use it.
I tried running the .pt file using
torch.load(<file_path>) and with open("C:/Users/admin/Downloads/new_homo_hg_hms.pt", 'rb') as file: data = pickle.Unpickler(file,fix_imports=True, encoding='ASCII', errors='strict', buffers=None,).load()

I'm also attaching the necessary screenshots for your reference. Do let me know if you need any other information.

  1. construct_kg.ipynb Error.
    image
    image

  2. Unpickling Error:
    image
    image

Thanks & Regards,

@bphariharan
Copy link
Author

Hey @ruthjohnson95,
Hope you're doing well.
I would like to know if there's way by which you can help us with above created issue.

Thanks & Regards,
Hariharan

@ruthjohnson95
Copy link
Collaborator

Hi @bphariharan, thank you for bringing this to our attention. It's on the top of our list of items once the holiday break is over. I appreciate your patience!

@ruthjohnson95
Copy link
Collaborator

accidentally closed this-- reopening

@ruthjohnson95
Copy link
Collaborator

For the unpickling error, the file is a DGL graph object so it needs to be read using dgl.load_graphs(...). An example below:

homo_hg = dgl.load_graphs("/n/home01/ruthjohnson/kg_paper/construct_kg/phekg/new_homo_hg_hms.pt")
homo_hg = homo_hg[0][0]

Also see https://docs.dgl.ai/en/1.1.x/generated/dgl.graph.html

@ruthjohnson95
Copy link
Collaborator

For the first error, it looks like the new_df automatically casts the x_id columns to floats. We can just cast this to string type and that should address the error.

new_df['x_id'] = new_df['x_id'].astype(str)

Let us know if this resolves the errors. We are glad you are enjoying the software.

@bphariharan
Copy link
Author

bphariharan commented Jan 8, 2025

For the first error, it looks like the new_df automatically casts the x_id columns to floats. We can just cast this to string type and that should address the error.

new_df['x_id'] = new_df['x_id'].astype(str)

Let us know if this resolves the errors. We are glad you are enjoying the software.

Hi @ruthjohnson95,
Thanks for the reply.

The unpickling error has been resolved when loading the graph using dgl.load_graphs() method. But, upon trying the resolution for 1st error (type mismatch error) I came across another issue.

image
image

Apart from this I have a couple of more queries:-

  1. What tool/software you have used to visualize the graph? I've been trying different tools to visualize the graph using tools such as networkx, gephi, Plotly and Mathplotlib etc. but, because, the graph is huge is size (with close 67k nodes and 1.3 Million edges) it breaks after few hours of execution. Now, I feel this is necessary for me to validate the neighboring nodes I retrieved by providing the index of the node.
  2. While looking into the execution of construct_kg.ipynb I also found that there's a mismatch of CSV files version used in the code and the given in the repository. I understand why the terminologies are provided with _filtered name but, for other CSVs especially the phecode_definitions1.1.csv I'm not quite sure as to why the older version is provided in the repository.
  3. In the new_node_map_df.tab provided here what is the understanding of node_id column?
  4. Lastly, would it be possible for you to tell us the embedding technique used as it will us to perform queries on the graph.

Thanks & Regards,
Hariharan B P

@bphariharan
Copy link
Author

Hey @ruthjohnson95 ,

Please let me know if you've any updates for us.

@ruthjohnson95
Copy link
Collaborator

Hi @bphariharan,

Can you provide a screenshot of the dataframe after you do new_df['x_id'] = new_df['x_id'].astype(str)?

  1. Because the graph is so large, I haven't found a software to visualize the network. To inspect the neighboring nodes, you can get the neighboring nodes by retreiving the edges in dgl.

  2. The 1.1 version is used to maintain continuity since we originally constructed the kg using this version last year. Also see PheMAP files not exist #1

  3. node_id refers to the clinical code (often numeric) that corresponds to each clinical vocabulary term. For example, the node id for the ICD9 code "Diabetes without complications" is 250.0

  4. We've provided all methodological details in the preprint here: https://www.medrxiv.org/content/10.1101/2024.12.03.24318322v2. Please feel free to reach out with any specific questions you might have.

@bphariharan
Copy link
Author

Hi @ruthjohnson95 ,

Thank you for the reply.

Sure, I will provide you the screenshot of edge_df, new_df and new_node_df :-

  1. edge_df

    image

  2. new_df

    image

  3. new_node_df

    image

node_id refers to the clinical code (often numeric) that corresponds to each clinical vocabulary term. For example, the node id for the ICD9 code "Diabetes without complications" is 250.0

While going through the the CSV provided I found that there are some exception cases where the CPT Codes are more than 5 digits. Assuming, that these are the list of nodes, is there any reason as to why these exceptions?

Thanks & Regards,
Hariharan B P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants