Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert_graph_formats() breaks for bipartite networkx graphs, where the nodes are neither strings nor ints #241

Open
shashank025 opened this issue Jun 1, 2024 · 4 comments

Comments

@shashank025
Copy link

shashank025 commented Jun 1, 2024

Describe the bug

When the input graph for the method cdlib.utils.convert_graph_formats() is a networkx graph where the node type is neither a string not an int (but is hashable as networkx expects), if the graph is bipartite, this method fails when converting to igraph format with the following error when attempting to populate the gi.vs["type"] value:

Traceback (most recent call last):
  File "/Users/shashankr/projects/cdlib/test.py", line 29, in <module>
    convert_graph_formats(G, ig.Graph)
  File "/Users/shashankr/projects/cdlib/cdlib/utils.py", line 197, in convert_graph_formats
    return __from_nx_to_igraph(graph, directed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shashankr/projects/cdlib/cdlib/utils.py", line 141, in __from_nx_to_igraph
    if type(name) == int else a_r[int(name.replace("\\", ""))]
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: '<__main__.Node object at 0x100e5b560>'

To Reproduce
Steps to reproduce the behavior:

  • CDlib version: 0.4.0
  • Operating System: MacOs Sonoma 14.5 (23F79)
  • Python version: 3.12
  • Version(s) of CDlib required libraries: Not sure

Here's a script (test.py) to repro the issue:

import networkx as nx
import igraph as ig

from cdlib import algorithms
from cdlib.utils import convert_graph_formats

class Node:
  def __init__(self, id, type):
    self.id = id
    self.type = type

  def __hash__(self):
    return hash(self.id)

  def __eq__(self, other):
    return self.id == other.id

john = Node('John Travolta', 'actor')
nick = Node('Nick Cage', 'actor')
face_off = Node('Face Off', 'movie')

G = nx.Graph()
G.add_node(john)
G.add_node(nick)
G.add_edge(john, face_off, label='ACTED_IN')
G.add_edge(nick, face_off, label='ACTED_IN')

convert_graph_formats(G, ig.Graph)

Simply run with:

python test.py

Expected behavior

I'd expect that call to convert_graph_formats() to work.

Screenshots
NA

Additional context

I also have a fix that I'm workin on! I'll produce a PR shortly.

@shashank025
Copy link
Author

shashank025 commented Jun 1, 2024

The root cause is that the logic inside the bipartite case branch was treating nodes that are just hashable (but neither strings nor ints) as strings. The fix is to properly handle such nodes by maintaining a mapping from strings to corresponding nodes:

diff --git a/cdlib/utils.py b/cdlib/utils.py
index 17a6e03..9ec917b 100644
--- a/cdlib/utils.py
+++ b/cdlib/utils.py
@@ -132,8 +132,9 @@ def __from_nx_to_igraph(g: object, directed: bool = None) -> object:
             gi.add_edges([("\\" + str(u), "\\" + str(v)) for (u, v) in g.edges()])

     if bipartite.is_bipartite(g) and not skip_bipartite:
+        convert = {str(x):x for x in g.nodes()}
         gi.vs["type"] = [
-            a_r[name] if type(name) == int else a_r[int(name.replace("\\", ""))]
+            a_r[name] if type(name) == int else a_r[convert[name.replace("\\", "")]]
             for name in gi.vs["name"]
         ]

@GiulioRossetti
Copy link
Owner

Hi,
I'll look into it - but I cannot guarantee you that support for generic node object descriptors will be integrated.

The main reason for the actual nodetype constraints is that cdlib accommodates several graph representations (networkx, graph_tool, igraph, ad-hoc ones...).

Allowing generic objects as nodes makes it complicated to handle some cross-representation transformations (and could affect downstream visualization/evaluation tasks requiring printable node labels - we cannot assume that all objects will came with well defined and unique string reprs).

Moreover, using complex objects as nodes usually generate a relevant memory footprint: if such information is not needed during computation (rarely the case in CD) it is indeed better to keep it separated from the graph topology (and maybe use it for post-process analysis).

@shashank025
Copy link
Author

shashank025 commented Jun 2, 2024

Thank you!

I understand that cdlib needs to work with diverse graph representations, etc, and I agree that working with complex objects can have adverse memory impact, but real world applications using cdlib will often already deal with complex node objects. As a library, cdlib will be more "ergonomic" if it can consume such objects as is.

The good thing is this should be relatively easy for cdlib to do, by taking advantage of the fact that node objects are hashable, by simply invoking id(node) on the object, which results in integer node values. Applications can later "recover" the original node object via:

recover = { id(node): node for node in graph.nodes() }

I already see similar code in cdlib:

convert = { str(node): node for node in graph.nodes() }

which makes me suspect the approach I'm suggesting should be easy to apply.

Let me know what you think! I'm actually willing to help with this as well.

@YogevHend
Copy link

YogevHend commented Dec 24, 2024

There is a similar issue when using other methods such as walktrap, when returning the NodeClustering object, we create a Clustering object which attempts to cast back whatever has been the node back from string into integer.
in here

Whenever non-integer nodes are used this causes ValueError: invalid literal for int() with base 10
to be thrown.

It might be a good idea to not allow use of non-integer nodes if you do not support them since it requires a bit of looking into before you realize that this is the root cause.

A suggestion for a solution might be to implement a dictionary that maps hashes of nodes to themselves and use the hashes instead of the actual nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants