PecanPy package
Command line interface
Command line utility for PecanPy.
This is the command line interface for the pecanpy
package.
Examples
Run PecanPy in command line using PreComp
mode to embed the karate network:
$ pecanpy --input demo/karate.edg --ouptut demo/karate.emb --mode PreComp
Checkout the full list of parameters by:
$ pecanpy --help
Graph Data Structures
Lite graph objects used by pecanpy.
- class pecanpy.graph.BaseGraph[source]
Bases:
object
Base Graph object.
Handles node id and provides general properties including num_nodes, and density. The num_edges property is to be specified by the derived graph objects.
- set_node_ids(node_ids: Sequence[str] | None, implicit_ids: bool = False, num_nodes: int | None = None)[source]
Update ID list and mapping.
Set _node_ids given the input node IDs and also set the corresponding _node_idmap based on it, which maps from node ID to the index.
- Parameters:
node_ids (
list
ofstr
, optional) – List of node IDs to use. If not available, will implicitly set node IDs to the canonical ordering of nodes with a warning message, which is suppressed if implicit_ids is set to True.implicit_ids (bool) – Implicitly set the node IDs to the canonical node ordering. If set to False and node IDs are not available, it will also set implicit node IDs, but with a warning message. The warning message can be suppressed if implicit_ids is set to True as a confirmation of the behavior.
num_nodes (int, optional) – Number of nodes, used when try to set implicit node IDs.
- class pecanpy.graph.AdjlstGraph[source]
Bases:
BaseGraph
Adjacency list Graph object used for reading/writing edge list files.
Sparse Graph object that stores graph as adjacency list.
Note
AdjlstGraph is only used for reading/writing edge list files and do not support random walk computations since Numba njit do not work with Python data structures like list and dict.
Examples
Read
.edg
file and createSparseGraph
object using.read_edg
method.>>> from pecanpy.graph import AdjlstGraph >>> >>> # initialize SparseGraph object >>> g = AdjlstGraph() >>> >>> # read graph from edgelist >>> g.read(path_to_edg_file, weighted=True, directed=False) >>> >>> indptr, indices, data = g.to_csr() # convert to csr >>> >>> dense_mat = g.to_dense() # convert to dense adjacency matrix >>> >>> g.save(edg_outpath) # save the graph to an edge list file
- property edges_iter: Iterator[Tuple[int, int, float]]
Return an iterator that iterates over all edges.
- property edges: List[Tuple[int, int, float]]
Return a list of triples (head, tail, weight) representing edges.
- property num_edges
Return the number of edges in the graph.
- add_node(node_id: str)[source]
Create a new node.
Add a new node to the graph if not already existing, by updating the ID list, ID map, and the adjacency list data. Otherwise pass through without further actions.
Note
Does not raise error even if the node alrealy exists.
- add_edge(id1: str, id2: str, weight: float = 1.0, directed: bool = False)[source]
Add an edge to the graph.
Note
Non-positive edges are ignored.
- read(path: str, weighted: bool, directed: bool, delimiter: str = '\t')[source]
Read an edgelist file and create sparse graph.
Note
Implicitly discard zero weighted edges; if the same edge is defined multiple times with different edge weights, then the last specified weight will be used (warning for such behavior will be printed).
- Parameters:
path (str) – path to edgelist file, where the file is tab separated and contains 2 or 3 columns depending on whether the input graph is weighted, where the the first column contains the source nodes and the second column contains the destination nodes that interact with the corresponding source nodes.
weighted (bool) – whether the graph is weighted. If unweighted, only two columns are expected in the edgelist file, and the edge weights are implicitly set to 1 for all interactions. If weighted, a third column encoding the weight of the interaction in numeric value is expected.
directed (bool) – whether the graph is directed, if undirected, the edge connecting from destination node to source node is created with same edge weight from source node to destination node.
delimiter (str) – delimiter of the edge list file, default is tab.
- save(path: str, unweighted: bool = False, delimiter: str = '\t')[source]
Save AdjLst as an
.edg
edge list file.
- to_dense() Any] [source]
Construct dense adjacency matrix.
Note
This method does not return a DenseGraph object, but instead returns a dense adjacency matrix as NDArray, where the index is the same as that of
nodes
.- Returns:
Full adjacency matrix as 2d numpy array.
- Return type:
NDArray
- classmethod from_mat(adj_mat: ~nptyping.ndarray.NDArray[~nptyping.base_meta_classes.Shape[*, *], ~typing.Any], node_ids: ~typing.List[str], **kwargs)[source]
Construct graph using adjacency matrix and node IDs.
- Parameters:
adj_mat (NDArray) – 2D numpy array of adjacency matrix
node_ids (
list
of str) – node ID list
- Returns:
An adjacency graph object representing the adjacency matrix.
- class pecanpy.graph.SparseGraph[source]
Bases:
BaseGraph
Sparse Graph object that stores graph as adjacency list.
Examples
Read
.edg
file and createSparseGraph
object using.read_edg
method.>>> from pecanpy.graph import SparseGraph >>> >>> # initialize SparseGraph object >>> g = SparseGraph() >>> >>> # read graph from edgelist >>> g.read_edg(path_to_edg_file, weighted=True, directed=False) >>> >>> # save the csr graph as npz file to be used later >>> g.save(npz_outpath)
- read_edg(path: str, weighted: bool, directed: bool, delimiter: str = '\t')[source]
Create CSR sparse graph from edge list.
First create
AdjlstGraph
by reading the edge list file, and then convert toSparseGraph
viato_csr
.
- read_npz(path: str, weighted: bool, implicit_ids: bool = False)[source]
Directly read a CSR sparse graph.
Note
- To generate a CSR file compatible with PecanPy, first load the graph
as a sparse graph using the SparseGraph (with
csr=True
). Then save the sparse graph to a csr file using thesave
method fromSparseGraph
. The saved.npz
file can then be loaded directly bySparseGraph
later.
- Parameters:
path (str) – path to the csr file, which is an npz file with four arrays with keys ‘IDs’, ‘data’, ‘indptr’, ‘indices’, which correspond to the node IDs, the edge weights, the offset array for each node, and the indices of the edges.
weighted (bool) – whether the graph is weighted, if unweighted, all edge weights will be converted to 1.
directed (bool) – not used, for compatibility with
SparseGraph
.implicit_ids (bool) – Implicitly set the node IDs to the canonical node ordering from the CSR graph. If unset and the IDs field is not found in the input CSR graph, a warning message will be displayed on screen. The missing IDs field can happen, for example, when the user uses the CSR graph prepared by scipy.sparse.csr.
- classmethod from_adjlst_graph(adjlst_graph, **kwargs)[source]
Construct csr graph from adjacency list graph.
- Parameters:
adjlst_graph (
pecanpy.graph.AdjlstGraph
) – Adjacency list graph to be converted.
- classmethod from_mat(adj_mat: ~nptyping.ndarray.NDArray[~nptyping.base_meta_classes.Shape[*, *], ~typing.Any], node_ids: ~typing.List[str], **kwargs)[source]
Construct csr graph using adjacency matrix and node IDs.
Note
Only consider positive valued edges.
- Parameters:
adj_mat (NDArray) – 2D numpy array of adjacency matrix
node_ids (
list
of str) – node ID list
- class pecanpy.graph.DenseGraph[source]
Bases:
BaseGraph
Dense Graph object that stores graph as array.
Examples
Read
.npz
files and createDenseGraph
object usingread_npz
>>> from pecanpy.graph import DenseGraph >>> >>> g = DenseGraph() # initialize DenseGraph object >>> >>> g.read_npz(paht_to_npz_file, weighted=True, directed=False)
Read
.edg
files and createDenseGraph
object usingread_edg
>>> from pecanpy.graph import DenseGraph >>> >>> # initialize DenseGraph object >>> g = DenseGraph() >>> >>> # read graph from edgelist >>> g.read_edg(path_to_edg_file, weighted=True, directed=False) >>> >>> # save the dense graph as npz file to be used later >>> g.save(npz_outpath)
- property data: Any] | None
Return the adjacency matrix.
- property nonzero: bool_] | None
Return the nonzero mask for the adjacency matrix.
- read_npz(path: str, weighted: bool, implicit_ids: bool = False)[source]
Read
.npz
file and create dense graph.- Parameters:
path (str) – path to
.npz
file.weighted (bool) – whether the graph is weighted, if unweighted, all none zero weights will be converted to 1.
implicit_ids (bool) – Implicitly set the node IDs to the canonical ordering from the dense adjacency matrix object. If unset and the IDs field is not found in the object, a warning message will be displayed on screen. This warning message can be suppressed if implicit_ids is set to True as a confirmation of the behavior.
- read_edg(path: str, weighted: bool, directed: bool, delimiter: str = '\t')[source]
Read an edgelist file and construct dense graph.
- classmethod from_adjlst_graph(adjlst_graph, **kwargs)[source]
Construct dense graph from adjacency list graph.
- Parameters:
adjlst_graph (
pecanpy.graph.AdjlstGraph
) – Adjacency list graph to be converted.
- classmethod from_mat(adj_mat: ~nptyping.ndarray.NDArray[~nptyping.base_meta_classes.Shape[*, *], ~typing.Any], node_ids: ~typing.List[str], **kwargs)[source]
Construct dense graph using adjacency matrix and node IDs.
- Parameters:
adj_mat (NDArray) – 2D numpy array of adjacency matrix
node_ids (
list
of str) – node ID list
Node2vec implementations
Different strategies for generating node2vec walks.
- class pecanpy.pecanpy.Base(p: float = 1, q: float = 1, workers: int = 1, verbose: bool = False, extend: bool = False, gamma: float = 0, random_state: int | None = None)[source]
Bases:
BaseGraph
Base node2vec object.
This base object provides the skeleton for the node2vec walk algorithm, which consists of the
simulate_walks
method that generate node2vec random walks. In contrast to the original Python implementation of node2vec, it is parallelized where each process generates walks independently.- Parameters:
p (float) – return parameter, value less than 1 encourages returning back to previous vertex, and discourage for value grater than 1 (default: 1).
q (float) – in-out parameter, value less than 1 encourages walks to go “outward”, and value greater than 1 encourage walking within a localized neighborhood (default: 1)
workers (int) – number of threads to be spawned for running node2vec including walk generation and word2vec embedding (default: 1)
verbose (bool) – show progress bar for walk generation.
extend (bool) – use node2vec+ extension if set to
True
(default:False
).gamma (float) – Multiplication factor for the std term of edge weights added to the average edge weights as the noisy edge threshold, only used by node2vec+ (default: 0)
random_state (int, optional) – Random seed for generating random walks. Note that to fully ensure reproducibility, use single thread (i.e., workers=1), and potentially need to set the Python environment variable
PYTHONHASHSEED
to match the random_state (default:None
).
Note
The
preprocess_transition_probs
is required for implementations that precomputes and stores 2nd order transition probabilities.Examples
Generate node2vec embeddings
>>> from pecanpy import pecanpy as node2vec >>> >>> # initialize node2vec object, similarly for SparseOTF and DenseOTF >>> g = node2vec.PreComp(p=0.5, q=1, workers=4, verbose=True) >>> # alternatively, can specify ``extend=True`` for using node2vec+ >>> >>> # load graph from edgelist file >>> g.read_edg(path_to_edg_file, weighted=True, directed=False) >>> # precompute and save 2nd order transition probs (for PreComp only) >>> g.preprocess_transition_probs() >>> >>> # generate random walks, which could then be used to train w2v >>> walks = g.simulate_walks(num_walks=10, walk_length=80) >>> >>> # alternatively, generate the embeddings directly using ``embed`` >>> emd = g.embed()
- simulate_walks(num_walks: int, walk_length: int) List[List[str]] [source]
Generate walks starting from each nodes
num_walks
time.Note
This is the master process that spawns worker processes, where the worker function
node2vec_walks
genearte a single random walk starting from a vertex of the graph.
- setup_get_normalized_probs()[source]
Transition probability computation setup.
This function performs necessary preprocessing of computing the average edge weights array, which is used later by the transition probability computation function
get_extended_normalized_probs
, if node2vec+ is used. Otherwise, returns the normal transition functionget_noramlized_probs
with a trivial placeholder for average edge weights arraynoise_thresholds
.
- embed(dim: int = 128, num_walks: int = 10, walk_length: int = 80, window_size: int = 10, epochs: int = 1, verbose: bool = False) float32] [source]
Generate embeddings.
This is a shortcut function that combines
simulate_walks
withWord2Vec
to generate the node2vec embedding.Note
The resulting embeddings are aligned with the graph, i.e., the index of embeddings is the same as that for the graph.
- Parameters:
dim (int) – dimension of the final embedding, default is 128
num_walks (int) – number of random walks generated using each node as the seed node, default is 10
walk_length (int) – length of the random walks, default is 80
window_size (int) – context window sized for training the
Word2Vec
model, default is 10epochs (int) – number of epochs for training
Word2Vec
, default is 1verbose (bool) – print time usage for random walk generation and skip-gram training if set to True
- Returns:
- The embedding matrix, each row is a node embedding
vector. The index is the same as that for the graph.
- Return type:
Embeddings
- class pecanpy.pecanpy.FirstOrderUnweighted(*args, **kwargs)[source]
Bases:
Base
,SparseRWGraph
Directly sample edges for first order random walks.
- class pecanpy.pecanpy.PreCompFirstOrder(*args, **kwargs)[source]
Bases:
Base
,SparseRWGraph
Precompute transition probabilities for first order random walks.
- class pecanpy.pecanpy.PreComp(*args, **kwargs)[source]
Bases:
Base
,SparseRWGraph
Precompute transition probabilities.
This implementation precomputes and stores 2nd order transition probabilities first and uses read off transition probabilities during the process of random walk. The graph type used is
SparseRWGraph
.Note
Need to call
preprocess_transition_probs()
first before generating walks.- get_move_forward()[source]
Wrap
move_forward
.This function returns a
numba.njit
compiled function that takes current vertex index (and the previous vertex index if available) and returns the next vertex index by sampling from a discrete random distribution based on the transition probabilities that are read off the precomputed transition probabilities table.Note
The returned function is used by the
simulate_walks
method.
- preprocess_transition_probs()[source]
Precompute and store 2nd order transition probabilities.
Each node contains n ** 2 number of 2nd order transition probabilities, where n is the number of neighbors of that specific node, since one can pick any one of its neighbors as the previous node and / or the next node. For each second order transition probability of a node, set up the alias draw table to be used during random walk.
Note
Uses uint64 instead of uint32 for tracking alias_indptr to prevent overflowing since the 2nd order transition probs grows much faster than the first order transition probs, which is the same as the total number of edges in the graph.
- class pecanpy.pecanpy.SparseOTF(*args, **kwargs)[source]
Bases:
Base
,SparseRWGraph
Sparse graph transition on the fly.
This implementation does NOT precompute transition probabilities in advance but instead calculates them on-the-fly during the process of random walk. The graph type used is
SparseRWGraph
.- get_move_forward()[source]
Wrap
move_forward
.This function returns a
numba.njit
compiled function that takes current vertex index (and the previous vertex index if available) and returns the next vertex index by sampling from a discrete random distribution based on the transition probabilities that are calculated on-the-fly.Note
The returned function is used by the
simulate_walks
method.
- class pecanpy.pecanpy.DenseOTF(*args, **kwargs)[source]
Bases:
Base
,DenseRWGraph
Dense graph transition on the fly.
This implementation does NOT precompute transition probabilities in advance but instead calculates them on-the-fly during the process of random walk. The graph type used is
DenseRWGraph
.- get_move_forward()[source]
Wrap
move_forward
.This function returns a
numba.njit
compiled function that takes current vertex index (and the previous vertex index if available) and returns the next vertex index by sampling from a discrete random distribution based on the transition probabilities that are calculated on-the-fly.Note
The returned function is used by the
simulate_walks
method.
- pecanpy.pecanpy.alias_setup(probs)[source]
Construct alias lookup table.
This code is modified from the blog post here: https://lips.cs.princeton.edu/the-alias-method-efficient-sampling-with-many-discrete-outcomes/ , where you can find more details about how the method works. In general, the alias method improves the time complexity of sampling from a discrete random distribution to O(1) if the alias table is setup in advance.
- Parameters:
probs (list(float32)) – normalized transition probabilities array, could be in either list or NDArray, of float32 values.