PecanPy package
Command line interface
Command line utility for PecanPy.
This is the command line interface for the pecanpy
package.
Examples
Run PecanPy in command line using PreComp
mode to embed the karate network:
$ pecanpy --input demo/karate.edg --ouptut demo/karate.emb --mode PreComp
Checkout the full list of parameters by:
$ pecanpy --help
Graph Data Structures
Lite graph objects used by pecanpy.
- class pecanpy.graph.BaseGraph[source]
Bases:
object
Base Graph object.
Handles node id and provides general properties including num_nodes, and density. The num_edges property is to be specified by the derived graph objects.
Initialize ID list and ID map.
- set_node_ids(node_ids: Optional[Sequence[str]], implicit_ids: bool = False, num_nodes: Optional[int] = None)[source]
Update ID list and mapping.
Set _node_ids given the input node IDs and also set the corresponding _node_idmap based on it, which maps from node ID to the index.
- Parameters
node_ids (
list
ofstr
, optional) – List of node IDs to use. If not available, will implicitly set node IDs to the canonical ordering of nodes with a warning message, which is suppressed if implicit_ids is set to True.implicit_ids (bool) – Implicitly set the node IDs to the canonical node ordering. If set to False and node IDs are not available, it will also set implicit node IDs, but with a warning message. The warning message can be suppressed if implicit_ids is set to True as a confirmation of the behavior.
num_nodes (int, optional) – Number of nodes, used when try to set implicit node IDs.
- class pecanpy.graph.AdjlstGraph[source]
Bases:
BaseGraph
Adjacency list Graph object used for reading/writing edge list files.
Sparse Graph object that stores graph as adjacency list.
Note
AdjlstGraph is only used for reading/writing edge list files and do not support random walk computations since Numba njit do not work with Python data structures like list and dict.
Examples
Read
.edg
file and createSparseGraph
object using.read_edg
method.>>> from pecanpy.graph import AdjlstGraph >>> >>> # initialize SparseGraph object >>> g = AdjlstGraph() >>> >>> # read graph from edgelist >>> g.read(path_to_edg_file, weighted=True, directed=False) >>> >>> indptr, indices, data = g.to_csr() # convert to csr >>> >>> dense_mat = g.to_dense() # convert to dense adjacency matrix >>> >>> g.save(edg_outpath) # save the graph to an edge list file
Initialize AdjlstGraph object.
- property edges_iter: Iterator[Tuple[int, int, float]]
Return an iterator that iterates over all edges.
- property edges: List[Tuple[int, int, float]]
Return a list of triples (head, tail, weight) representing edges.
- property num_edges
Return the number of edges in the graph.
- add_node(node_id: str)[source]
Create a new node.
Add a new node to the graph if not already exsitsed, by updating the ID list, ID map, and the adjacency list data. Otherwise pass through without further actions.
Note
Does not raise error even if the node alrealy exists.
- add_edge(id1: str, id2: str, weight: float = 1.0, directed: bool = False)[source]
Add an edge to the graph.
Note
Non-positive edges are ignored.
- read(path: str, weighted: bool, directed: bool, delimiter: str = '\t')[source]
Read an edgelist file and create sparse graph.
Note
Implicitly discard zero weighted edges; if the same edge is defined multiple times with different edge weights, then the last specified weight will be used (warning for such behavior will be printed).
- Parameters
path (str) – path to edgelist file, where the file is tab seperated and contains 2 or 3 columns depending on whether the input graph is weighted, where the the first column contains the source nodes and the second column contains the destination nodes that interact with the corresponding source nodes.
weighted (bool) – whether the graph is weighted. If unweighted, only two columns are expected in the edgelist file, and the edge weights are implicitely set to 1 for all interactions. If weighted, a third column encoding the weight of the interaction in numeric value is expected.
directed (bool) – whether the graph is directed, if undirected, the edge connecting from destination node to source node is created with same edge weight from source node to destination node.
delimiter (str) – delimiter of the edge list file, default is tab.
- save(path: str, unweighted: bool = False, delimiter: str = '\t')[source]
Save AdjLst as an
.edg
edge list file.
- to_dense() Any] [source]
Construct dense adjacency matrix.
Note
This method does not return DenseGraph object, but instead return dense adjacency matrix as NDArray, the index is the same as that of
nodes
.- Returns
Full adjacency matrix as 2d numpy array.
- Return type
NDArray
- classmethod from_mat(adj_mat: ~nptyping.base_meta_classes.NDArray[~nptyping.base_meta_classes.Shape[*, *], ~typing.Any], node_ids: ~typing.List[str], **kwargs)[source]
Construct graph using adjacency matrix and node IDs.
- Parameters
adj_mat (NDArray) – 2D numpy array of adjacency matrix
node_ids (
list
of str) – node ID list
- Returns
An adjacency graph object representing the adjacency matrix.
- class pecanpy.graph.SparseGraph[source]
Bases:
BaseGraph
Sparse Graph object that stores graph as adjacency list.
Examples
Read
.edg
file and createSparseGraph
object using.read_edg
method.>>> from pecanpy.graph import SparseGraph >>> >>> # initialize SparseGraph object >>> g = SparseGraph() >>> >>> # read graph from edgelist >>> g.read_edg(path_to_edg_file, weighted=True, directed=False) >>> >>> # save the csr graph as npz file to be used later >>> g.save(npz_outpath)
Initialize SparseGraph object.
- read_edg(path: str, weighted: bool, directed: bool, delimiter: str = '\t')[source]
Create CSR sparse graph from edge list.
First create
AdjlstGraph
by reading the edge list file, and then convert toSparseGraph
viato_csr
.
- read_npz(path: str, weighted: bool, implicit_ids: bool = False)[source]
Directly read a CSR sparse graph.
Note
- To generate a CSR file compatible with PecanPy, first load the graph
as a sparse graph using the SparseGraph (with
csr=True
). Then save the sparse graph to a csr file using thesave
method fromSparseGraph
. The saved.npz
file can then be loaded directly bySparseGraph
later.
- Parameters
path (str) – path to the csr file, which is an npz file with four arrays with keys ‘IDs’, ‘data’, ‘indptr’, ‘indices’, which correspond to the node IDs, the edge weights, the offset array for each node, and the indices of the edges.
weighted (bool) – whether the graph is weighted, if unweighted, all edge weights will be converted to 1.
directed (bool) – not used, for compatibility with
SparseGraph
.implicit_ids (bool) – Implicitly set the node IDs to the canonical node ordering from the CSR graph. If unset and the IDs field is not found in the input CSR graph, a warning message will be displayed on screen. The missing IDs field can happen, for example, when the user uses the CSR graph prepared by scipy.sparse.csr.
- classmethod from_adjlst_graph(adjlst_graph, **kwargs)[source]
Construct csr graph from adjacency list graph.
- Parameters
adjlst_graph (
pecanpy.graph.AdjlstGraph
) – Adjacency list graph to be converted.
- classmethod from_mat(adj_mat: ~nptyping.base_meta_classes.NDArray[~nptyping.base_meta_classes.Shape[*, *], ~typing.Any], node_ids: ~typing.List[str], **kwargs)[source]
Construct csr graph using adjacency matrix and node IDs.
Note
Only consider positive valued edges.
- Parameters
adj_mat (NDArray) – 2D numpy array of adjacency matrix
node_ids (
list
of str) – node ID list
- class pecanpy.graph.DenseGraph[source]
Bases:
BaseGraph
Dense Graph object that stores graph as array.
Examples
Read
.npz
files and createDenseGraph
object usingread_npz
>>> from pecanpy.graph import DenseGraph >>> >>> g = DenseGraph() # initialize DenseGraph object >>> >>> g.read_npz(paht_to_npz_file, weighted=True, directed=False)
Read
.edg
files and createDenseGraph
object usingread_edg
>>> from pecanpy.graph import DenseGraph >>> >>> # initialize DenseGraph object >>> g = DenseGraph() >>> >>> # read graph from edgelist >>> g.read_edg(path_to_edg_file, weighted=True, directed=False) >>> >>> # save the dense graph as npz file to be used later >>> g.save(npz_outpath)
Initialize DenseGraph object.
- property data: Any]]
Return the adjacency matrix.
- property nonzero: bool_]]
Return the nonzero mask for the adjacency matrix.
- read_npz(path: str, weighted: bool, implicit_ids: bool = False)[source]
Read
.npz
file and create dense graph.- Parameters
path (str) – path to
.npz
file.weighted (bool) – whether the graph is weighted, if unweighted, all none zero weights will be converted to 1.
implicit_ids (bool) – Implicitly set the node IDs to the canonical ordering from the dense adjacency matrix object. If unset and the IDs field is not found in the object, a warning message will be displayed on screen. This warning message can be suppressed if implicit_ids is set to True as a confirmation of the behavior.
- read_edg(path: str, weighted: bool, directed: bool, delimiter: str = '\t')[source]
Read an edgelist file and construct dense graph.
- classmethod from_adjlst_graph(adjlst_graph, **kwargs)[source]
Construct dense graph from adjacency list graph.
- Parameters
adjlst_graph (
pecanpy.graph.AdjlstGraph
) – Adjacency list graph to be converted.
- classmethod from_mat(adj_mat: ~nptyping.base_meta_classes.NDArray[~nptyping.base_meta_classes.Shape[*, *], ~typing.Any], node_ids: ~typing.List[str], **kwargs)[source]
Construct dense graph using adjacency matrix and node IDs.
- Parameters
adj_mat (NDArray) – 2D numpy array of adjacency matrix
node_ids (
list
of str) – node ID list
Node2vec implementations
Different strategies for generating node2vec walks.
- class pecanpy.pecanpy.Base(p: float = 1, q: float = 1, workers: int = 1, verbose: bool = False, extend: bool = False, gamma: float = 0, random_state: Optional[int] = None)[source]
Bases:
BaseGraph
Base node2vec object.
This base object provides the skeleton for the node2vec walk algorithm, which consists of the
simulate_walks
method that generate node2vec random walks. In contrast to the original Python implementaion of node2vec, it is prallelized where each process generate walks independently.Note
The
preprocess_transition_probs
is required for implenetations that precomputes and store 2nd order transition probabilities.Examples
Generate node2vec embeddings
>>> from pecanpy import pecanpy as node2vec >>> >>> # initialize node2vec object, similarly for SparseOTF and DenseOTF >>> g = node2vec.PreComp(p=0.5, q=1, workers=4, verbose=True) >>> # alternatively, can specify ``extend=True`` for using node2vec+ >>> >>> # load graph from edgelist file >>> g.read_edg(path_to_edg_file, weighted=True, directed=False) >>> # precompute and save 2nd order transition probs (for PreComp only) >>> g.preprocess_transition_probs() >>> >>> # generate random walks, which could then be used to train w2v >>> walks = g.simulate_walks(num_walks=10, walk_length=80) >>> >>> # alternatively, generate the embeddings directly using ``embed`` >>> emd = g.embed()
Initializ node2vec base class.
- Parameters
p (float) – return parameter, value less than 1 encourages returning back to previous vertex, and discourage for value grater than 1 (default: 1).
q (float) – in-out parameter, value less than 1 encourages walks to go “outward”, and value greater than 1 encourage walking within a localized neighborhood (default: 1)
workers (int) – number of threads to be spawned for runing node2vec including walk generation and word2vec embedding (default: 1)
verbose (bool) – show progress bar for walk generation.
extend (bool) – use node2vec+ extension if set to
True
(default:False
).gamma (float) – Multiplication factor for the std term of edge weights added to the average edge weights as the noisy edge threashold, only used by node2vec+ (default: 0)
random_state (int, optional) – Random seed for generating random walks. Note that to fully ensure reproducibility, use single thread (i.e., workers=1), and potentially need to set the Python environment variable
PYTHONHASHSEED
to match the random_state (default:None
).
- simulate_walks(num_walks: int, walk_length: int) List[List[str]] [source]
Generate walks starting from each nodes
num_walks
time.Note
This is the master process that spawns worker processes, where the worker function
node2vec_walks
genearte a single random walk starting from a vertex of the graph.
- setup_get_normalized_probs()[source]
Transition probability computation setup.
This is function performs necessary preprocessing of computing the average edge weights array, which is used later by the transition probability computation function
get_extended_normalized_probs
, if node2vec+ is used. Otherwise, return the normal transition functionget_noramlized_probs
with a trivial placeholder for average edge weights arraynoise_thresholds
.
- embed(dim: int = 128, num_walks: int = 10, walk_length: int = 80, window_size: int = 10, epochs: int = 1, verbose: bool = False) float32] [source]
Generate embeddings.
This is a shortcut function that combines
simulate_walks
withWord2Vec
to generate the node2vec embedding.Note
The resulting embeddings are aligned with the graph, i.e., the index of embeddings is the same as that for the graph.
- Parameters
dim (int) – dimension of the final embedding, default is 128
num_walks (int) – number of random walks generated using each node as the seed node, default is 10
walk_length (int) – length of the random walks, default is 80
window_size (int) – context window sized for training the
Word2Vec
model, default is 10epochs (int) – number of epochs for training
Word2Vec
, default is 1verbose (bool) – print time usage for random walk generation and skip-gram training if set to True
- Returns
- The embedding matrix, each row is a node embedding
vector. The index is the same as that for the graph.
- Return type
Embeddings
- class pecanpy.pecanpy.FirstOrderUnweighted(*args, **kwargs)[source]
Bases:
Base
,SparseRWGraph
Directly sample edges for first order random walks.
Initialize FirstOrderUnweighted mode.
- class pecanpy.pecanpy.PreCompFirstOrder(*args, **kwargs)[source]
Bases:
Base
,SparseRWGraph
Precompute transition probabilities for first order random walks.
Initialize PreCompFirstOrder mode.
- class pecanpy.pecanpy.PreComp(*args, **kwargs)[source]
Bases:
Base
,SparseRWGraph
Precompute transition probabilites.
This implementation precomputes and store 2nd order transition probabilites first and uses read off transition probabilities during the process of random walk. The graph type used is
SparseRWGraph
.Note
Need to call
preprocess_transition_probs()
first before generating walks.Initialize PreComp mode node2vec.
- get_move_forward()[source]
Wrap
move_forward
.This function returns a
numba.njit
compiled function that takes current vertex index (and the previous vertex index if available) and return the next vertex index by sampling from a discrete random distribution based on the transition probabilities that are read off the precomputed transition probabilities table.Note
The returned function is used by the
simulate_walks
method.
- preprocess_transition_probs()[source]
Precompute and store 2nd order transition probabilities.
Each node contains n ** 2 number of 2nd order transition probabilities, where n is the number of neigbors of that specific nodes, since one can pick any one of its neighbors as the previous node and / or the next node. For each second order transition probability of a node, set up the alias draw table to be used during random walk.
Note
Uses uint64 instaed of uint32 for tracking alias_indptr to prevent overflowing since the 2nd order transition probs grows much faster than the first order transition probs, which is the same as the total number of edges in the graph.
- class pecanpy.pecanpy.SparseOTF(*args, **kwargs)[source]
Bases:
Base
,SparseRWGraph
Sparse graph transition on the fly.
This implementation do NOT precompute transition probabilities in advance but instead calculate them on-the-fly during the process of random walk. The graph type used is
SparseRWGraph
.Initialize PreComp mode node2vec.
- get_move_forward()[source]
Wrap
move_forward
.This function returns a
numba.njit
compiled function that takes current vertex index (and the previous vertex index if available) and return the next vertex index by sampling from a discrete random distribution based on the transition probabilities that are calculated on-the-fly.Note
The returned function is used by the
simulate_walks
method.
- class pecanpy.pecanpy.DenseOTF(*args, **kwargs)[source]
Bases:
Base
,DenseRWGraph
Dense graph transition on the fly.
This implementation do NOT precompute transition probabilities in advance but instead calculate them on-the-fly during the process of random walk. The graph type used is
DenseRWGraph
.Initialize DenseOTF mode node2vec.
- get_move_forward()[source]
Wrap
move_forward
.This function returns a
numba.njit
compiled function that takes current vertex index (and the previous vertex index if available) and return the next vertex index by sampling from a discrete random distribution based on the transition probabilities that are calculated on-the-fly.Note
The returned function is used by the
simulate_walks
method.
- pecanpy.pecanpy.alias_setup(probs)[source]
Construct alias lookup table.
This code is modified from the blog post here: https://lips.cs.princeton.edu/the-alias-method-efficient-sampling-with-many-discrete-outcomes/ , where you can find more details about how the method work. In general, the alias method improves the time complexity of sampling from a discrete random distribution to O(1) if the alias table is setup in advance.
- Parameters
probs (list(float32)) – normalized transition probabilities array, could be in either list or NDArray, of float32 values.