completed configurable backends

MKLab-ITI · Jun 10, 2024 · 7b2cc72 · 7b2cc72
1 parent d329967
commit 7b2cc72
Show file tree

Hide file tree

Showing 21 changed files with 168 additions and 92 deletions.
diff --git a/docs/userguide/convergence.md → docs/advanced/ranking.md b/docs/userguide/convergence.md → docs/advanced/ranking.md
@@ -1,18 +1,15 @@
-# Demo
+# Node Ranking
 
-As a quick start, let us construct a graph 
-and a set of nodes. The graph's class can be
-imported either from the `networkx` library or from
-`pygrank` itself. The two are in large part interoperable
-and both can be parsed by our algorithms.
-But our implementation is tailored to graph signal
-processing needs and thus tends to be faster and consume
-only a fraction of the memory.
+Here we will see how an appropriate convergence manager
+can be used to speed up a node ranking process, where
+nodes obtain ordinal values 1,2,3,... based on their
+importance in the graph structure (1 is the most 
+important node). For starters, let us construct some data to test with:
 
 ```python
-from pygrank import Graph
+import pygrank as pg
 
-graph = Graph()
+graph = pg.Graph()
 graph.add_edge("A", "B")
 graph.add_edge("B", "C")
 graph.add_edge("C", "D")
@@ -24,16 +21,8 @@ seeds = {"A", "B"}
 ```
 
 We now run a personalized PageRank
-to score the structural relatedness of graph nodes to the ones of the given set.
-First, let us import the library:
-
-```python
-import pygrank as pg
-```
-
-For instructional purposes,
-we experiment with (personalized) *PageRank*
-and make it output the node order of ranks.
+to score the structural relatedness of graph nodes to the ones of the given set
+and apply a postprocessor that ranks nodes based on their score:
 
 ```python
 ranker = pg.PageRank(alpha=0.85, tol=1.E-6, normalization="auto") >> pg.Ordinals()
@@ -61,6 +50,7 @@ print(ordinals["B"], ordinals["D"], ordinals["E"])
 # 3.0 5.0 4.0
 ```
 
-Close to the previous results at a fraction of the time! For large graphs,
+This is close to the previous results at a fraction of the time!
+For large graphs,
 most ordinals would be near the ideal ones. Note that convergence time 
 does not take into account the time needed to preprocess graphs.
diff --git a/docs/userguide/quickstart.md b/docs/userguide/quickstart.md
@@ -1,7 +1,7 @@
 # Quickstart
 
 ## 1. Install and import
-Install the library using `pip install pygrank` and import it. Construct a node ranking algorithm from a graph filter by incrementally applying postprocessors using >>. There are many components and parameters available. You can use [autotuning](autotuning.md) to find good configurations.
+Install the library using `pip install pygrank` and import it. Construct a node ranking algorithm from a graph filter by incrementally applying postprocessors using >>. There are many components and parameters available. Use [autotuning](autotuning.md) to find good configurations.
 
 ```python
 import pygrank as pg
@@ -28,4 +28,4 @@ Evaluate the scores using a stochastic generalization of the unsupervised conduc
 measure = pg.Conductance()  # an evaluation measure
 pg.benchmark_print_line("My conductance", measure(scores))  # pretty
 print("Cite this algorithm as:", hk5_advanced.cite())
-```~~
+```
diff --git a/docs/userguide/setup.md b/docs/userguide/setup.md
@@ -7,6 +7,17 @@ and install or upgrade to the latest version of `pygrank` with:
 pip install --upgrade pygrank
 ```
 
+## Creating graphs
+
+When working n practical problems,
+use `networkx` to construct graphs
+by adding edges between Python objects.
+ `pygrank` also provides its own `pygrank.Graph` class
+that implements a subset of `networkx.Graph` operations
+to gain several optimizations; it tends to be faster for the
+construction of large graphs and consumes
+only a fraction of the memory.
+
 ## Backends
 
 Several popular computational backends are supported.
@@ -41,12 +52,13 @@ The same message points to a configuration file stored under *home/.pygrank*.
 In addition to automatically downloaded content, there is a JSON configuration 
 file specifying the default backend to be set upon first import and the option 
 to silence the reminder message. The configuration looks like this and can either be 
-edited directly or programmatically set with `pg.set_backend_preference(name, reminder=True)`):
+edited directly or programmatically set with `pg.set_backend_preference(name, reminder=True, **init)`):
 
 ```json
 {
   "backend": "numpy", 
-  "reminder": "true"
+  "reminder": "true",
+  "init": {}
 }
 ```
 
@@ -64,28 +76,26 @@ necessarily be the fastest option for dense or very sparse graphs.
 
 ### <span class="component">tensorflow</span>
 <b class="parameters">About</b><br>Performs computations within the `tensorflow` execution environment.
-The latter is an open-source platform for machine learning developed by the Google Brain team. 
-It allows for efficient computation across multiple CPUs and GPUs, making it suitable for 
-performant large-scale data processing and deep learning applications.
+The latter is an open-source platform for machine learning developed by the Google Brain team.
 There 
 are two modes in which this backend can be executed: `"dense"` (default) and `"sparse"`.
 The mode may be provided as additional arguments to the
-`pg.set_backend("tensorflow", mode=...)` call.
+`pg.set_backend("tensorflow", mode="dense" device="auto")` call.
 In dense mode, the tensorflow backend attempts to store graphs in dense square
 matrices that take full advantage of tensorflow's parallelization.
 If there is not enough memory to allocate a sparse adjacency matrix,
 the backend generates a sparse version and creates a warning.
-<br>
+The backend's initialization also accepts a device string or object to
+which computations should be internally transferred. This needs to
+be one among tensorflow's available devices.
 <br>
 <b class="parameters">Installation</b><br> `pip install tensorflow[and-cuda]`<br>On Windows install WSL2 (Windows Subsystem for Linux) first.<br>
 <b class="parameters">Links</b><br> [tensorflow](https://www.tensorflow.org/install)
 
 
 ### <span class="component">pytorch</span>
 <b class="parameters">About</b><br>Performs computations within the `pytorch` execution environment.
-The latter is an open-source platform for machine learning developed by Meta's AI Research lab. 
- It is known for its flexibility, ease of use, and dynamic computation graph, which makes it popular 
-in research and production.
+The latter is an open-source platform for machine learning developed by Meta's AI Research lab.
 Similarly to `"tensorflow"`, 
 are two modes in which this backend can be executed: `"dense"` (default) and `"sparse"`.
 The mode may be provided as additional arguments to the
@@ -94,9 +104,26 @@ In dense mode, the pytorch backend attempts to store graphs in dense square
 matrices that take full advantage of tensorflow's parallelization.
 If there is not enough memory to allocate a sparse adjacency matrix,
 the backend generates a sparse version and creates a warning.
+The backend's initialization also accepts a device string or object to
+which computations should be internally transferred. This needs to
+be one among pytorch's available devices (typically `"cuda"` or `"cpu"`).
 <br>
-<br>
-<br>
+<b class="parameters">Installation</b><br> For full installation instructions visit pytorch's website in the links below.<br>
+<b class="parameters">Links</b><br> [pytorch](https://pytorch.org/get-started/locally)
+
+### <span class="component">torch_sparse</span>
+<b class="parameters">About</b><br>Performs computations within the `pytorch` execution environment,
+but contrary to the `"pytorch` backend uses the sparse computations of the `torch_sparse` library.
+The latter is an open-source platform for machine learning developed by Meta's AI Research lab.
+Similarly to `"tensorflow"`, 
+are two modes in which this backend can be executed: `"dense"` (default) and `"sparse"`.
+The backend's initialization only accepts a device string or object to
+which computations should be internally transferred. This needs to
+be one among pytorch's available devices (typically `"cuda"` or `"cpu"`).
+!!! info
+    `"torch_sparse"` is much more computationally efficient than `"pytorch"`
+    for computations with sparse data structures.
+
 <b class="parameters">Installation</b><br> For full installation instructions visit pytorch's website in the links below.<br>
 <b class="parameters">Links</b><br> [pytorch](https://pytorch.org/get-started/locally) <br>
 [torch_sparse](https://github.com/rusty1s/pytorch_sparse)

diff --git a/examples/run_backend.py → examples/playground/run_backend.py b/examples/run_backend.py → examples/playground/run_backend.py
@@ -1,17 +1,22 @@
 import pygrank as pg
 import torch
+
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
 
 with pg.Backend("torch_sparse", device=device):
     _, graph, community = next(pg.load_datasets_one_community(["amazon"]))
-    ppr = pg.PageRank(alpha=0.9, normalization="symmetric", assume_immutability=True,
-                      convergence=pg.ConvergenceManager(max_iters=38, error_type="iters"))
+    ppr = pg.PageRank(
+        alpha=0.9,
+        normalization="symmetric",
+        assume_immutability=True,
+        convergence=pg.ConvergenceManager(max_iters=38, error_type="iters"),
+    )
     ppr.preprocessor(graph)
     signal = pg.to_signal(graph, {node: 1.0 for node in community})
     torch.cuda.synchronize()  # correct timing
     scores = ppr(signal)
     print(ppr.convergence)
     print(scores["B00005MHUG"])  # 0.00508212111890316
     print(scores["B00006RGI2"])  # 0.70645672082901
-    print(scores["0006497993"])  # 0.19633759558200836
+    print(scores["0006497993"])  # 0.19633759558200836
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -13,8 +13,10 @@ nav:
         - 'userguide/autotuning.md'
         - 'userguide/preprocessing.md'
     - Applications:
+        - 'advanced/ranking.md'
         - 'advanced/community.md'
         - 'advanced/gnn.md'
+        - 'advanced/fairness.md'
     - R&D:
         - 'tips/citations.md'
         - 'tips/big.md'

diff --git a/pygrank/algorithms/convergence.py b/pygrank/algorithms/convergence.py
@@ -141,7 +141,7 @@ def start(self, restart_timer: bool = True):
 
     def has_converged(self, new_ranks: BackendPrimitive) -> bool:
         # TODO: convert to any backend
-        new_ranks = np.array(new_ranks).squeeze()
+        new_ranks = backend.to_numpy(new_ranks).squeeze()
         self.accumulated_ranks = (
             self.accumulated_ranks * self.iteration + new_ranks
         ) / (self.iteration + 1)

diff --git a/pygrank/core/backend/__init__.py b/pygrank/core/backend/__init__.py
@@ -110,8 +110,8 @@ def converted(*args, **kwargs):
                         return converted
 
                     setattr(thismod, api, converter(mod.__dict__[api]))
-                else:  # pragma: no cover
-                    raise Exception("Missing implementation for " + str(api))
+                #else:  # pragma: no cover
+                #    raise Exception("Missing implementation for " + str(api))
     return mod.backend_init(*args, **kwargs)
 
 
@@ -157,9 +157,9 @@ def get_backend_preference():  # pragma: no cover
     return {"mod_name": mod_name, **init_parameters}
 
 
-def set_backend_preference(mod_name: str ,
-                           remind_where_to_find: bool = True,
-                           **kwargs):  # pragma: no cover
+def set_backend_preference(
+    mod_name: str, remind_where_to_find: bool = True, **kwargs
+):  # pragma: no cover
     default_dir = os.path.join(os.path.expanduser("~"), ".pygrank")
     if not os.path.exists(default_dir):
         os.makedirs(default_dir)
@@ -169,7 +169,7 @@ def set_backend_preference(mod_name: str ,
             {
                 "backend": mod_name.lower(),
                 "reminder": str(remind_where_to_find).lower(),
-                "init": {str(k): str(v) for k, v in kwargs.items()}
+                "init": {str(k): str(v) for k, v in kwargs.items()},
             },
             config_file,
         )

diff --git a/pygrank/core/backend/ddask.py b/pygrank/core/backend/ddask.py
@@ -18,7 +18,7 @@ def backend_init(*args, splits: int = 8, client=None, **kwargs):
         if client is None:
             client = dsk.distributed.Client(*args, **kwargs)
         __pygrank_dask_config["client"] = client
-    else:
+    elif client is not None:
         __pygrank_dask_config["client"] = client
     return __pygrank_dask_config["client"]
 
@@ -117,7 +117,8 @@ def multiply_and_collect(signal, split):
 
     # Use Dask to parallelize the multiplication
     futures = [
-        __pygrank_dask_config["client"].submit(multiply_and_collect, signal, split) for split in M_splits
+        __pygrank_dask_config["client"].submit(multiply_and_collect, signal, split)
+        for split in M_splits
     ]
     results = __pygrank_dask_config["client"].gather(futures)
 

diff --git a/pygrank/core/backend/numpy.py b/pygrank/core/backend/numpy.py
@@ -46,7 +46,7 @@ def to_array(obj, copy_array=False):
     if obj.__class__.__module__ == "tensorflow.python.framework.ops":
         return obj.numpy()
     if obj.__class__.__module__ == "torch":
-        return obj.detach().numpy()
+        return obj.detach().cpu().numpy()
     return np.array(obj)
 
 

diff --git a/pygrank/core/backend/pytorch.py b/pygrank/core/backend/pytorch.py
@@ -34,10 +34,15 @@ def diag(x, offset=0):
 def backend_init(mode="dense", device=None):
     __pygrank_torch_config["mode"] = mode
     if device is not None and device == "auto":
-        if not isinstance(__pygrank_torch_config["device"], str) or __pygrank_torch_config["device"] != "auto":
+        if (
+            not isinstance(__pygrank_torch_config["device"], str)
+            or __pygrank_torch_config["device"] != "auto"
+        ):
             return
         device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-        warnings.warn(f"[pygrank.backend.pytorch] Automatically detected device to run on {device}: {torch.cuda.get_device_name(device)}")
+        warnings.warn(
+            f"[pygrank.backend.pytorch] Automatically detected device to run on {device}: {torch.cuda.get_device_name(device)}"
+        )
     if device is not None and isinstance(device, str):
         device = torch.device(device)
     __pygrank_torch_config["device"] = device
@@ -94,14 +99,22 @@ def scipy_sparse_to_backend(M):
             return torch.FloatTensor(M.todense()).to(__pygrank_torch_config["device"])
         except MemoryError:
             warnings.warn(
-                f"[pygrank.backend.pytorch] Not enough memory to convert a scipy sparse matrix with shape {M.shape} to a numpy dense matrix before moving it to your device.\nWill create a torch.sparse_coo_tensor instead.\nAdd the option mode=\"sparse\" to the backend's initialization to hide this message,\nbut prefer switching to the torch_sparse backend for a performant implementation.")
+                f"[pygrank.backend.pytorch] Not enough memory to convert a scipy sparse matrix with shape {M.shape} "
+                f"to a numpy dense matrix before moving it to your device.\nWill create a torch.sparse_coo_tensor instead."
+                f'\nAdd the option mode="sparse" to the backend\'s initialization to hide this message,'
+                f"\nbut prefer switching to the torch_sparse backend for a performant implementation."
+            )
 
     coo = M.tocoo()
-    return torch.sparse_coo_tensor(
-        torch.LongTensor(np.vstack((coo.col, coo.row))),
-        torch.FloatTensor(coo.data),
-        coo.shape,
-    ).coalesce().to(__pygrank_torch_config["device"])
+    return (
+        torch.sparse_coo_tensor(
+            torch.LongTensor(np.vstack((coo.col, coo.row))),
+            torch.FloatTensor(coo.data),
+            coo.shape,
+        )
+        .coalesce()
+        .to(__pygrank_torch_config["device"])
+    )
 
 
 def to_array(obj, copy_array=False):
@@ -111,12 +124,16 @@ def to_array(obj, copy_array=False):
                 return torch.clone(obj).to(__pygrank_torch_config["device"])
             return obj.to(__pygrank_torch_config["device"])
         return torch.ravel(obj).to(__pygrank_torch_config["device"])
-    return torch.ravel(torch.FloatTensor(np.array([v for v in obj], dtype=np.float32))).to(__pygrank_torch_config["device"])
+    return torch.ravel(
+        torch.FloatTensor(np.array([v for v in obj], dtype=np.float32))
+    ).to(__pygrank_torch_config["device"])
 
 
 def to_primitive(obj):
     if isinstance(obj, float):
-        return torch.tensor(obj, dtype=torch.float32).to(__pygrank_torch_config["device"])
+        return torch.tensor(obj, dtype=torch.float32).to(
+            __pygrank_torch_config["device"]
+        )
     return torch.FloatTensor(obj).to(__pygrank_torch_config["device"])
 
 
@@ -132,9 +149,9 @@ def self_normalize(obj):
 
 
 def conv(signal, M):
-    #if M.is_sparse:
+    # if M.is_sparse:
     return torch.mv(M, signal)
-    #return M@signal.reshape((-1,1))
+    # return M@signal.reshape((-1,1))
 
 
 def length(x):

diff --git a/pygrank/core/backend/specification.py b/pygrank/core/backend/specification.py
@@ -146,3 +146,13 @@ def epsilon() -> float:  # pragma: no cover
 
 def shape0(M: BackendPrimitive) -> int:  # pragma: no cover
     pass
+
+
+def to_numpy(obj):
+    import numpy as np
+
+    if obj.__class__.__module__ == "tensorflow.python.framework.ops":
+        return obj.numpy()
+    if obj.__class__.__module__ == "torch":
+        return obj.detach().cpu().numpy()
+    return np.array(obj)