PecanPy package

Command line interface

Command line utility for PecanPy.

This is the command line interface for the pecanpy package.

Examples

Run PecanPy in command line using PreComp mode to embed the karate network:

$ pecanpy --input demo/karate.edg --ouptut demo/karate.emb --mode PreComp

Checkout the full list of parameters by:

$ pecanpy --help
pecanpy.cli.parse_args()[source]

Parse node2vec arguments.

pecanpy.cli.check_mode(g, args)[source]

Check mode selection.

Give recommendation to user for pecanpy mode based on graph size and density.

pecanpy.cli.main()[source]

Pipeline for representational learning for all nodes in a graph.

Graph Data Structures

Lite graph objects used by pecanpy.

class pecanpy.graph.BaseGraph[source]

Bases: object

Base Graph object.

Handles node id and provides general properties including num_nodes, and density. The num_edges property is to be specified by the derived graph objects.

property nodes: List[str]

Return the list of node IDs.

property num_nodes: int

Return the number of nodes in the graph.

property num_edges: int

Return the number of edges in the graph.

property density: float

Return the edge density of the graph.

set_node_ids(node_ids: Sequence[str] | None, implicit_ids: bool = False, num_nodes: int | None = None)[source]

Update ID list and mapping.

Set _node_ids given the input node IDs and also set the corresponding _node_idmap based on it, which maps from node ID to the index.

Parameters:
  • node_ids (list of str, optional) – List of node IDs to use. If not available, will implicitly set node IDs to the canonical ordering of nodes with a warning message, which is suppressed if implicit_ids is set to True.

  • implicit_ids (bool) – Implicitly set the node IDs to the canonical node ordering. If set to False and node IDs are not available, it will also set implicit node IDs, but with a warning message. The warning message can be suppressed if implicit_ids is set to True as a confirmation of the behavior.

  • num_nodes (int, optional) – Number of nodes, used when try to set implicit node IDs.

get_has_nbrs()[source]

Abstract method to be specified by derived classes.

get_move_forward()[source]

Abstract method to be specified by derived classes.

class pecanpy.graph.AdjlstGraph[source]

Bases: BaseGraph

Adjacency list Graph object used for reading/writing edge list files.

Sparse Graph object that stores graph as adjacency list.

Note

AdjlstGraph is only used for reading/writing edge list files and do not support random walk computations since Numba njit do not work with Python data structures like list and dict.

Examples

Read .edg file and create SparseGraph object using .read_edg method.

>>> from pecanpy.graph import AdjlstGraph
>>>
>>> # initialize SparseGraph object
>>> g = AdjlstGraph()
>>>
>>> # read graph from edgelist
>>> g.read(path_to_edg_file, weighted=True, directed=False)
>>>
>>> indptr, indices, data = g.to_csr()  # convert to csr
>>>
>>> dense_mat = g.to_dense()  # convert to dense adjacency matrix
>>>
>>> g.save(edg_outpath)  # save the graph to an edge list file
property edges_iter: Iterator[Tuple[int, int, float]]

Return an iterator that iterates over all edges.

property edges: List[Tuple[int, int, float]]

Return a list of triples (head, tail, weight) representing edges.

property num_edges

Return the number of edges in the graph.

get_node_idx(node_id: str) int[source]

Get index of the node and create new node when necessary.

add_node(node_id: str)[source]

Create a new node.

Add a new node to the graph if not already existing, by updating the ID list, ID map, and the adjacency list data. Otherwise pass through without further actions.

Note

Does not raise error even if the node alrealy exists.

add_edge(id1: str, id2: str, weight: float = 1.0, directed: bool = False)[source]

Add an edge to the graph.

Note

Non-positive edges are ignored.

Parameters:
  • id1 (str) – first node id.

  • id2 (str) – second node id.

  • weight (float) – the edge weight, default is 1.0

  • directed (bool) – whether the edge is directed or not.

read(path: str, weighted: bool, directed: bool, delimiter: str = '\t')[source]

Read an edgelist file and create sparse graph.

Note

Implicitly discard zero weighted edges; if the same edge is defined multiple times with different edge weights, then the last specified weight will be used (warning for such behavior will be printed).

Parameters:
  • path (str) – path to edgelist file, where the file is tab separated and contains 2 or 3 columns depending on whether the input graph is weighted, where the the first column contains the source nodes and the second column contains the destination nodes that interact with the corresponding source nodes.

  • weighted (bool) – whether the graph is weighted. If unweighted, only two columns are expected in the edgelist file, and the edge weights are implicitly set to 1 for all interactions. If weighted, a third column encoding the weight of the interaction in numeric value is expected.

  • directed (bool) – whether the graph is directed, if undirected, the edge connecting from destination node to source node is created with same edge weight from source node to destination node.

  • delimiter (str) – delimiter of the edge list file, default is tab.

save(path: str, unweighted: bool = False, delimiter: str = '\t')[source]

Save AdjLst as an .edg edge list file.

Parameters:
  • unweighted (bool) – If set to True, only write two columns, corresponding to the head and tail nodes of the edges, and ignore the edge weights (default: False).

  • delimiter (str) – Delimiter for separating fields.

to_csr() float32]][source]

Construct compressed sparse row matrix.

to_dense() Any][source]

Construct dense adjacency matrix.

Note

This method does not return a DenseGraph object, but instead returns a dense adjacency matrix as NDArray, where the index is the same as that of nodes.

Returns:

Full adjacency matrix as 2d numpy array.

Return type:

NDArray

classmethod from_mat(adj_mat: ~nptyping.ndarray.NDArray[~nptyping.base_meta_classes.Shape[*, *], ~typing.Any], node_ids: ~typing.List[str], **kwargs)[source]

Construct graph using adjacency matrix and node IDs.

Parameters:
  • adj_mat (NDArray) – 2D numpy array of adjacency matrix

  • node_ids (list of str) – node ID list

Returns:

An adjacency graph object representing the adjacency matrix.

class pecanpy.graph.SparseGraph[source]

Bases: BaseGraph

Sparse Graph object that stores graph as adjacency list.

Examples

Read .edg file and create SparseGraph object using .read_edg method.

>>> from pecanpy.graph import SparseGraph
>>>
>>> # initialize SparseGraph object
>>> g = SparseGraph()
>>>
>>> # read graph from edgelist
>>> g.read_edg(path_to_edg_file, weighted=True, directed=False)
>>>
>>> # save the csr graph as npz file to be used later
>>> g.save(npz_outpath)
property num_edges: int

Return the number of edges in the graph.

read_edg(path: str, weighted: bool, directed: bool, delimiter: str = '\t')[source]

Create CSR sparse graph from edge list.

First create AdjlstGraph by reading the edge list file, and then convert to SparseGraph via to_csr.

Parameters:
  • path (str) – path to edgelist file.

  • weighted (bool) – whether the graph is weighted.

  • directed (bool) – whether the graph is directed.

  • delimiter (str) – delimiter used between node IDs.

read_npz(path: str, weighted: bool, implicit_ids: bool = False)[source]

Directly read a CSR sparse graph.

Note

To generate a CSR file compatible with PecanPy, first load the graph

as a sparse graph using the SparseGraph (with csr=True). Then save the sparse graph to a csr file using the save method from SparseGraph. The saved .npz file can then be loaded directly by SparseGraph later.

Parameters:
  • path (str) – path to the csr file, which is an npz file with four arrays with keys ‘IDs’, ‘data’, ‘indptr’, ‘indices’, which correspond to the node IDs, the edge weights, the offset array for each node, and the indices of the edges.

  • weighted (bool) – whether the graph is weighted, if unweighted, all edge weights will be converted to 1.

  • directed (bool) – not used, for compatibility with SparseGraph.

  • implicit_ids (bool) – Implicitly set the node IDs to the canonical node ordering from the CSR graph. If unset and the IDs field is not found in the input CSR graph, a warning message will be displayed on screen. The missing IDs field can happen, for example, when the user uses the CSR graph prepared by scipy.sparse.csr.

save(path: str)[source]

Save CSR as .csr.npz file.

classmethod from_adjlst_graph(adjlst_graph, **kwargs)[source]

Construct csr graph from adjacency list graph.

Parameters:

adjlst_graph (pecanpy.graph.AdjlstGraph) – Adjacency list graph to be converted.

classmethod from_mat(adj_mat: ~nptyping.ndarray.NDArray[~nptyping.base_meta_classes.Shape[*, *], ~typing.Any], node_ids: ~typing.List[str], **kwargs)[source]

Construct csr graph using adjacency matrix and node IDs.

Note

Only consider positive valued edges.

Parameters:
  • adj_mat (NDArray) – 2D numpy array of adjacency matrix

  • node_ids (list of str) – node ID list

class pecanpy.graph.DenseGraph[source]

Bases: BaseGraph

Dense Graph object that stores graph as array.

Examples

Read .npz files and create DenseGraph object using read_npz

>>> from pecanpy.graph import DenseGraph
>>>
>>> g = DenseGraph() # initialize DenseGraph object
>>>
>>> g.read_npz(paht_to_npz_file, weighted=True, directed=False)

Read .edg files and create DenseGraph object using read_edg

>>> from pecanpy.graph import DenseGraph
>>>
>>> # initialize DenseGraph object
>>> g = DenseGraph()
>>>
>>> # read graph from edgelist
>>> g.read_edg(path_to_edg_file, weighted=True, directed=False)
>>>
>>> # save the dense graph as npz file to be used later
>>> g.save(npz_outpath)
property num_edges: int

Return the number of edges in the graph.

property data: Any] | None

Return the adjacency matrix.

property nonzero: bool_] | None

Return the nonzero mask for the adjacency matrix.

read_npz(path: str, weighted: bool, implicit_ids: bool = False)[source]

Read .npz file and create dense graph.

Parameters:
  • path (str) – path to .npz file.

  • weighted (bool) – whether the graph is weighted, if unweighted, all none zero weights will be converted to 1.

  • implicit_ids (bool) – Implicitly set the node IDs to the canonical ordering from the dense adjacency matrix object. If unset and the IDs field is not found in the object, a warning message will be displayed on screen. This warning message can be suppressed if implicit_ids is set to True as a confirmation of the behavior.

read_edg(path: str, weighted: bool, directed: bool, delimiter: str = '\t')[source]

Read an edgelist file and construct dense graph.

save(path: str)[source]

Save dense graph as .dense.npz file.

classmethod from_adjlst_graph(adjlst_graph, **kwargs)[source]

Construct dense graph from adjacency list graph.

Parameters:

adjlst_graph (pecanpy.graph.AdjlstGraph) – Adjacency list graph to be converted.

classmethod from_mat(adj_mat: ~nptyping.ndarray.NDArray[~nptyping.base_meta_classes.Shape[*, *], ~typing.Any], node_ids: ~typing.List[str], **kwargs)[source]

Construct dense graph using adjacency matrix and node IDs.

Parameters:
  • adj_mat (NDArray) – 2D numpy array of adjacency matrix

  • node_ids (list of str) – node ID list

Node2vec implementations

Different strategies for generating node2vec walks.

class pecanpy.pecanpy.Base(p: float = 1, q: float = 1, workers: int = 1, verbose: bool = False, extend: bool = False, gamma: float = 0, random_state: int | None = None)[source]

Bases: BaseGraph

Base node2vec object.

This base object provides the skeleton for the node2vec walk algorithm, which consists of the simulate_walks method that generate node2vec random walks. In contrast to the original Python implementation of node2vec, it is parallelized where each process generates walks independently.

Parameters:
  • p (float) – return parameter, value less than 1 encourages returning back to previous vertex, and discourage for value grater than 1 (default: 1).

  • q (float) – in-out parameter, value less than 1 encourages walks to go “outward”, and value greater than 1 encourage walking within a localized neighborhood (default: 1)

  • workers (int) – number of threads to be spawned for running node2vec including walk generation and word2vec embedding (default: 1)

  • verbose (bool) – show progress bar for walk generation.

  • extend (bool) – use node2vec+ extension if set to True (default: False).

  • gamma (float) – Multiplication factor for the std term of edge weights added to the average edge weights as the noisy edge threshold, only used by node2vec+ (default: 0)

  • random_state (int, optional) – Random seed for generating random walks. Note that to fully ensure reproducibility, use single thread (i.e., workers=1), and potentially need to set the Python environment variable PYTHONHASHSEED to match the random_state (default: None).

Note

The preprocess_transition_probs is required for implementations that precomputes and stores 2nd order transition probabilities.

Examples

Generate node2vec embeddings

>>> from pecanpy import pecanpy as node2vec
>>>
>>> # initialize node2vec object, similarly for SparseOTF and DenseOTF
>>> g = node2vec.PreComp(p=0.5, q=1, workers=4, verbose=True)
>>> # alternatively, can specify ``extend=True`` for using node2vec+
>>>
>>> # load graph from edgelist file
>>> g.read_edg(path_to_edg_file, weighted=True, directed=False)
>>> # precompute and save 2nd order transition probs (for PreComp only)
>>> g.preprocess_transition_probs()
>>>
>>> # generate random walks, which could then be used to train w2v
>>> walks = g.simulate_walks(num_walks=10, walk_length=80)
>>>
>>> # alternatively, generate the embeddings directly using ``embed``
>>> emd = g.embed()
simulate_walks(num_walks: int, walk_length: int) List[List[str]][source]

Generate walks starting from each nodes num_walks time.

Note

This is the master process that spawns worker processes, where the worker function node2vec_walks genearte a single random walk starting from a vertex of the graph.

Parameters:
  • num_walks (int) – number of walks starting from each node.

  • walks_length (int) – length of walk.

setup_get_normalized_probs()[source]

Transition probability computation setup.

This function performs necessary preprocessing of computing the average edge weights array, which is used later by the transition probability computation function get_extended_normalized_probs, if node2vec+ is used. Otherwise, returns the normal transition function get_noramlized_probs with a trivial placeholder for average edge weights array noise_thresholds.

preprocess_transition_probs()[source]

Null default preprocess method.

embed(dim: int = 128, num_walks: int = 10, walk_length: int = 80, window_size: int = 10, epochs: int = 1, verbose: bool = False) float32][source]

Generate embeddings.

This is a shortcut function that combines simulate_walks with Word2Vec to generate the node2vec embedding.

Note

The resulting embeddings are aligned with the graph, i.e., the index of embeddings is the same as that for the graph.

Parameters:
  • dim (int) – dimension of the final embedding, default is 128

  • num_walks (int) – number of random walks generated using each node as the seed node, default is 10

  • walk_length (int) – length of the random walks, default is 80

  • window_size (int) – context window sized for training the Word2Vec model, default is 10

  • epochs (int) – number of epochs for training Word2Vec, default is 1

  • verbose (bool) – print time usage for random walk generation and skip-gram training if set to True

Returns:

The embedding matrix, each row is a node embedding

vector. The index is the same as that for the graph.

Return type:

Embeddings

class pecanpy.pecanpy.FirstOrderUnweighted(*args, **kwargs)[source]

Bases: Base, SparseRWGraph

Directly sample edges for first order random walks.

get_move_forward()[source]

Wrap move_forward.

class pecanpy.pecanpy.PreCompFirstOrder(*args, **kwargs)[source]

Bases: Base, SparseRWGraph

Precompute transition probabilities for first order random walks.

get_move_forward()[source]

Wrap move_forward.

preprocess_transition_probs()[source]

Precompute and store first order transition probabilities.

class pecanpy.pecanpy.PreComp(*args, **kwargs)[source]

Bases: Base, SparseRWGraph

Precompute transition probabilities.

This implementation precomputes and stores 2nd order transition probabilities first and uses read off transition probabilities during the process of random walk. The graph type used is SparseRWGraph.

Note

Need to call preprocess_transition_probs() first before generating walks.

get_move_forward()[source]

Wrap move_forward.

This function returns a numba.njit compiled function that takes current vertex index (and the previous vertex index if available) and returns the next vertex index by sampling from a discrete random distribution based on the transition probabilities that are read off the precomputed transition probabilities table.

Note

The returned function is used by the simulate_walks method.

preprocess_transition_probs()[source]

Precompute and store 2nd order transition probabilities.

Each node contains n ** 2 number of 2nd order transition probabilities, where n is the number of neighbors of that specific node, since one can pick any one of its neighbors as the previous node and / or the next node. For each second order transition probability of a node, set up the alias draw table to be used during random walk.

Note

Uses uint64 instead of uint32 for tracking alias_indptr to prevent overflowing since the 2nd order transition probs grows much faster than the first order transition probs, which is the same as the total number of edges in the graph.

class pecanpy.pecanpy.SparseOTF(*args, **kwargs)[source]

Bases: Base, SparseRWGraph

Sparse graph transition on the fly.

This implementation does NOT precompute transition probabilities in advance but instead calculates them on-the-fly during the process of random walk. The graph type used is SparseRWGraph.

get_move_forward()[source]

Wrap move_forward.

This function returns a numba.njit compiled function that takes current vertex index (and the previous vertex index if available) and returns the next vertex index by sampling from a discrete random distribution based on the transition probabilities that are calculated on-the-fly.

Note

The returned function is used by the simulate_walks method.

class pecanpy.pecanpy.DenseOTF(*args, **kwargs)[source]

Bases: Base, DenseRWGraph

Dense graph transition on the fly.

This implementation does NOT precompute transition probabilities in advance but instead calculates them on-the-fly during the process of random walk. The graph type used is DenseRWGraph.

get_move_forward()[source]

Wrap move_forward.

This function returns a numba.njit compiled function that takes current vertex index (and the previous vertex index if available) and returns the next vertex index by sampling from a discrete random distribution based on the transition probabilities that are calculated on-the-fly.

Note

The returned function is used by the simulate_walks method.

pecanpy.pecanpy.alias_setup(probs)[source]

Construct alias lookup table.

This code is modified from the blog post here: https://lips.cs.princeton.edu/the-alias-method-efficient-sampling-with-many-discrete-outcomes/ , where you can find more details about how the method works. In general, the alias method improves the time complexity of sampling from a discrete random distribution to O(1) if the alias table is setup in advance.

Parameters:

probs (list(float32)) – normalized transition probabilities array, could be in either list or NDArray, of float32 values.

pecanpy.pecanpy.alias_draw(j, q)[source]

Draw sample from a non-uniform discrete distribution using alias sampling.