Before April 2019 on the OmniPath webpage (http://omnipathdb.org/) we had a few tutorials for pypath
. However over the past years we developed a lot pypath
and especially recently a number of important points in the interface changed (although we wanted to keep compatibility as much as possible). This is a new comprehensive tutorial which replaced the previous tutorials by April 2019 and has been updated in August 2019.
pypath
provides an easy way to build the OmniPath network as it has been described in our paper. At the first time this will take several minutes, because all data will be downloaded from the original providers. Next time pypath will use the data from its cache directory, so the network will build much faster. If you want to load it even faster, you can save it into a pickle dump.
from pypath import main
from pypath import settings
pa = main.PyPath()
#pa.load_omnipath() # This is commented out because it takes > 1h
# to run it for the first time due to the vast
# amount of data download.
# Once you populated the cache it still takes
# approx. 30 min to build the entire OmniPath
# as the process consists of quite some data
# processing. If you dump it in a pickle, you
# can load the network in < 1 min
You can find the predefined formats in the pypath.data_formats
module. For example, to load one resource from there, let's say Signor:
from pypath import main
from pypath import data_formats
pa = main.PyPath()
pa.load_resources({'signor': data_formats.pathway['signor']})
Or to load all activity flow resources with literature references:
from pypath import main
from pypath import data_formats
pa = main.PyPath()
pa.init_network(data_formats.pathway)
Or to load all activity flow resources, including the ones without literature references:
pa = main.PyPath()
pa.init_network(data_formats.pathway_all)
Here we show how to build a network from your own files. The advantage of building network with pypath is that you don't need to worry about merging redundant elements, neither about different formats and identifiers. Let's say you have two files with network data:
network1.csv
entrezA,entrezB,effect
1950,1956,inhibition
5290,207,stimulation
207,2932,inhibition
1956,5290,stimulation
network2.sif
EGF + EGFR
EGFR + PIK3CA
EGFR + SOS1
PIK3CA + RAC1
RAC1 + MAP3K1
SOS1 + HRAS
HRAS + MAP3K1
PIK3CA + AKT1
AKT1 - GSK3B
Note: you need to create these files in order to load them.
import pypath
import pypath.input_formats as input_formats
input1 = input_formats.ReadSettings(
name = 'egf1',
input = 'network1.csv',
header = True,
separator = ',',
id_col_a = 0,
id_col_b = 1,
id_type_a = 'entrez',
id_type_b = 'entrez',
sign = (2, 'stimulation', 'inhibition'),
ncbi_tax_id = 9606,
)
input2 = input_formats.ReadSettings(
name = 'egf2',
input = 'network2.sif',
separator = ' ',
id_col_a = 0,
id_col_b = 2,
id_type_a = 'genesymbol',
id_type_b = 'genesymbol',
sign = (1, '+', '-'),
ncbi_tax_id = 9606,
)
inputs = {
'egf1': input1,
'egf2': input2
}
pa = main.PyPath()
pa.reload()
pa.init_network(lst = inputs)
Here we use the network created above (because it is reasonable size, not like the networks we could get from most of the network databases). Igraph has excellent plotting capabilities built on top of the cairo library.
import igraph
plot = igraph.plot(pa.graph, target = 'egf_network.png',
edge_width = 0.3, edge_color = '#777777',
vertex_color = '#97BE73', vertex_frame_width = 0,
vertex_size = 70.0, vertex_label_size = 15,
vertex_label_color = '#FFFFFF',
# due to a bug in either igraph or IPython,
# vertex labels are not visible on inline plots:
inline = False, margin = 120)
from IPython.display import Image
Image(filename='egf_network.png')
For this you will need the PyPath
class from the pypath.main
module which takes care about building and querying the network. Also you need the pypath.data_formats
module where you find a number of predefined input settings organized in larger categories (e.g. activity flow, enzyme-substrate, transcriptional regulation, etc). These input settings will tell pypath
how to download and process the data.
from pypath import main
from pypath import data_formats
For example data_formats.pathway
is a collection of databases which fit into the activity flow concept, i.e. one protein either stimulates or inhibits the other. It is a dictionary with names as keys and the input settings as values:
data_formats.pathway
Such a dictionary you can pass to the init_network
method of the PyPath
object. Then it will download the data from the original sources, translate the identifiers and merge the networks. Pypath stores all downloaded data in a cache, by default ~/.pypath/cache
in your user's home directory. For this reason when you load a resource for the first time it might take long but next time will be faster as data will be fetched from the cache. First create a pypath.main.PyPath
object, then build the network:
pa = main.PyPath()
pa.init_network(data_formats.pathway)
You can add more resource sets a similar way:
pa.load_resources(data_formats.ptm)
To load one single resource simply create a one element dict:
pa.load_resources({'matrixdb': data_formats.interaction['matrixdb']})
You can find all the pre-defined datasets in the pypath.data_formats
module. As already we mentined above, the pathway
dataset contains the literature curated activity flow resources. This was the original focus of pypath and OmniPath, however since then we added a great variety of other kinds of resource definitions. Here we give an overview of these.
data_formats.pathway
: activity flow networks with literature referencesdata_formats.activity_flow
: synonym for pathway
data_formats.pathway_noref
: activity flow networks without literature referencesdata_formats.pathway_all
: all activity flow datadata_formats.ptm
: enzyme-substrate interaction networks with literature referencesdata_formats.enzyme_substrate
: synonym for ptm
data_formats.ptm_noref
: enzyme-substrate networks without literature referencesdata_formats.ptm_all
: all enzyme-substrate datadata_formats.interaction
: undirected interactions from both literature curated and high-throughput collections (e.g. IntAct, BioGRID)data_formats.interaction_misc
: undirected, high-scale interaction networks without the constraint of having any literature reference (e.g. the unbiased human interactome screen from the Vidal lab)data_formats.transcription_onebyone
: transcriptional regulation databases (TF-target interactions) with all databases downloaded directly and processed by pypath
data_formats.transcription
: transcriptional regulation only from the DoRothEA datadata_formats.mirna_target
: miRNA-mRNA interactions from literature curated resourcesdata_formats.tf_mirna
: transcriptional regulation of miRNA from literature curated resourcesdata_formats.lncrna_protein
: lncRNA-protein interactions from literature curated datasetsdata_formats.ligand_receptor
: ligand-receptor interactions from both literature curated and other kinds of resourcesdata_formats.pathwaycommons
: the PathwayCommons databasedata_formats.reaction
: process description databases; not guaranteed to work at this momentdata_formats.reaction_misc
: alternative definitions to load process description databases; not guaranteed to work at this momentdata_formats.small_molecule_protein
: signaling interactions between small molecules and proteinsTo see the list of the resources in a dataset, you can check the dict keys or the name
attribute of each element:
data_formats.pathway.keys()
[resource.name for resource in data_formats.pathway.values()]
Once you built a network you can use it for various purposes and write your own scripts for further processing or analysis. The network is represented by an igraph
object (igraph.org):
pa.graph
Number of edges and nodes:
pa.ecount, pa.vcount
The edge and vertex sequences you can access in the es
and vs
attributes, you can iterate these or index by integers. The edge and vertex attributes you can access by string keys. E.g. get the sources of edge 0:
pa.graph.es[81]['sources']
By default the igraph
object is undirected but it carries all direction information in Python objects assigned to each edge. Pypath can convert it to a directed igraph
object, but you still need the Direction
objects to have the signs, as igraph
has no signed network representation. Certain methods need the directed igraph
object and they will automatically create it, but you can create it manually:
pa.get_directed()
You find the directed network in the pa.dgraph
attribute:
pa.dgraph
Now let's take a look on the pypath.main.Direction
objects which contain details about directions and signs. First as an example, select a random edge:
edge = pa.graph.es[3241]
The Direction
object is in the dirs
edge attribute:
d = edge['dirs']
It has a method to print its content a human readable way:
print(pa.graph.es[3241]['dirs'])
From this we see the databases phosphoELM and Signor agree that protein P17252
has an effect on Q15139
and Signor in addition tells us this effect is stimulatory. However in your scripts you can query the Direction
objects a number of ways. Each Direction
object calls the two possible directions either straight or reverse:
d.straight
d.reverse
It can tell you if one of these directions is supported by any of the network resources:
d.get_dir(d.straight)
Or it can return those resources:
d.get_dir(d.straight, sources = True)
The opposite direction is not supported by any resource:
d.get_dir(d.reverse, sources = True)
Similar way the signs can be queried. The returned pair of boolean values mean if the interaction in this direction is stimulatory or inhibitory, respectively.
d.get_sign(d.straight)
Or you can ask whether it is inhibition:
d.is_inhibition(d.straight)
Or if the interaction is directed at all:
d.is_directed()
Sometimes resources don't agree, for example one tells an interaction is inhibition while according to others it is stimulation; or one tells A effects B and another resource the other way around. Here we preserve all these potentially contradicting information in the Direction
object and at the end you decide what to do with it depending on your purpose. If you want to get rid of ambiguity there is a method to get a consensus direction and sign which returns the attributes the most resources agree on:
d.consensus_edges()
In igraph
the vertices are numbered but this numbering can change at certain operations. Instead the we can use the vertex attributes. In PyPath
for proteins the name
attribute is UniProt ID by default and the label
is Gene Symbol.
pa.graph.vs['name'][:5]
pa.graph.vs['label'][:5]
The PyPath
object offers a number of helper methods to access the nodes by their names. For example, uniprot
or up
returns the igraph.Vertex
for a UniProt ID:
type(pa.up('P00533'))
Similarly genesymbol
or gs
for Gene Symbols:
type(pa.gs('ESR1'))
Each of these has a "plural" version:
len(list(pa.gss(['MTOR', 'ATG16L2', 'ULK1'])))
And a generic method where you can mix UniProts and Gene Symbols:
len(list(pa.proteins(['MTOR', 'P00533'])))
Above you could see how to query the directions and names of individual edges and nodes. Building on top of these, other methods give a way to query causality, i.e. which proteins are affected by an other one, and which others are its regulators. The example below returns the nodes PIK3CA is stimulated by, the gs
prefix tells we query by the Gene Symbol:
pa.gs_stimulated_by('PIK3CA')
It returns a so called _NamedVertexSeq
object, which you can get a series of igraph.Vertex
objects or Gene Symbols or UniProt IDs from:
list(pa.gs_stimulated_by('PIK3CA').gs())[:5]
list(pa.gs_stimulated_by('PIK3CA').up())[:5]
Note, the names of these methods are a bit contraintuitive, the for example the gs_stimulates
returns the genes stimulated by PIK3CA:
list(pa.gs_stimulates('PIK3CA').gs())[:5]
'PIK3CA' in set(pa.affected_by('AKT1').gs())
There are many similary methods, inhibited_by
returns negative regulators, affected_by
does not consider +/- signs, without gs_
and up_
prefixes you can provide either of these identifiers, neighbors
does not consider the direction. At the end .gs()
converts the result for a list of Gene Symbols, up()
to UniProts, .ids()
to vertex IDs and by default it yields igraph.Vertex
objects:
list(pa.neighbors('AKT1').ids())[:5]
Finally, with neighborhood
methods return the indirect neighborhood in custom number of steps (however size of the neighborhood increases rapidly with number of steps):
print(list(pa.neighborhood('ATG3', 1).gs()))
print(list(pa.neighborhood('ATG3', 2).gs()))
len(list(pa.neighborhood('ATG3', 3).gs()))
len(list(pa.neighborhood('ATG3', 4).gs()))
Just like nodes also edges can be accessed by identifiers like Gene Symbols. get_edge
returns an igraph.Edge
if the edge exists otherwise None
.
type(pa.get_edge('EGF', 'EGFR'))
type(pa.get_edge('EGF', 'P00533'))
type(pa.get_edge('EGF', 'AKT1'))
print(pa.get_edge('EGF', 'EGFR')['dirs'])
Select a random edge and in the references
attribute you find a list of references:
edge = pa.get_edge( 'MAP1LC3B', 'SQSTM1')
edge['references']
Each reference has a PubMed ID:
edge['references'][0].pmid
edge['references'][0].open()
These 3 references come from 3 different databases, but there must be 2 overlaps between them:
edge['refs_by_source']
The pypath.mapping
module is for ID translation, most of the time you can simply call the map_name
method:
from pypath import mapping
mapping.map_name('P00533', 'uniprot', 'genesymbol')
mapping.map_name('8408', 'entrez', 'uniprot')
A number of mapping tables are predefined and loaded automatically. However it does not translate in 2 steps if no direct translation table is available. For example Entrez to Gene Symbol you can translate this way:
mapping.map_names(
mapping.map_name('8408', 'entrez', 'uniprot'),
'uniprot',
'genesymbol',
)
By default the map_name
function returns a set
because it accounts for ambiguous mapping. However most often the ID translation is unambiguous, and you want to retrieve only one ID. The map_name0
returns a string, even in case of ambiguity, it returns a random element from the resulted set:
mapping.map_name0('GABARAPL3', 'genesymbol', 'uniprot')
The pypath.ptm
module builds a database of enzyme-substrate interactions.
from pypath import ptm
ptm_db = ptm.get_db()
Here you got a dictionary with pairs of UniProt IDs as keys and a list of special objects representing enzyme-substrate interactions as values:
print(ptm_db.enz_sub[('Q13177', 'P01236')][0])
Alternatively the enzyme-substrate interactions can be assigned to network edges:
pa.load_ptms2()
print(pa.graph.es['ptm'][444][0])
This module provides various annotations about the function and location of the proteins.
from pypath import annot
a = annot.get_db()
OmniPath contains annotations from 27 resources. These provide various information about the characteristics of the proteins, e.g. their localization or function. The AnnotationTable
object loads all annotations by default, optionally you can limit this to certain resources. For example, if you only want to load the pathway membership annotations from SIGNOR, SignaLink, NetPath and KEGG, you can provide the names of the appropriate classes:
pathways = annot.AnnotationTable(
protein_sources = (
'SignalinkPathways',
'KeggPathways',
'NetpathPathways',
'SignorPathways',
)
)
The AnnotationTable
object provides methods to query all resources together, or build a boolean array out of them. To see all annotations of one protein:
pathways.all_annotations('P00533')
pathways.create_dataframe = True
pathways.make_dataframe()
pathways.df[:10]
The AnnotationTable
object contains the resource specific annotation objects:
a.annots
For each of these you can query the names of the fields, their possible values and the set of proteins annotated with any combination of the values:
matrisome = a.annots['Matrisome']
matrisome.get_names()
matrisome.get_values('subclass')
matrisome.get_subset(subclass = 'Collagens')
pypath
does not combine the annotations in the annot
module, exactly what goes in goes out. For example, WNT pathway from Signor and SignaLink won't be merged automatically. However with the pypath.annot.CustomAnnotation
class anyone can do it. For inter-cellular communication categories the pypath.intercell
module combines the data from all the relevant resources and creates categories based on a combination of evidences.
from pypath import intercell
i = intercell.get_db() # this takes quite some time
# unless you load annotations from a pickle cache
i
i.make_df()
i.df[:10]
i.class_names
i.children['receptor']
i.counts()
i.classes_by_entity('P00533')
i.class_labels
list(i.classes['adhesion'])[:10]
pypath.go
is an almost standalone module for management of the Gene Ontology tree and annotations. The main objects here are GeneOntology
and GOAnnotation
. The former represents the ontology tree, i.e. terms and their relationships, the latter their assignment to gene products. Both provides many versatile methods for querying.
from pypath import go
goa = go.GOAnnotation()
goa.ontology # the GeneOntology object
goa # the GOAnnotation object
Among many others, the most versatile method is select
which is able to select the annotated gene products by various expressions built from GO terms or IDs. It understands AND
, OR
, NOT
and parentheses.
query = """(cell surface OR
external side of plasma membrane OR
extracellular region) AND
(regulation of transmembrane transporter activity OR
channel regulator activity)"""
result = goa.select(query)
print(list(result)[:7])
goa.ontology.get_all_descendants('GO:0005576')
The pypath.complex
module builds a non-redundant list of complexes from 10 original resources. Complexes are unique considering their set of components, and optionally carry stoichiometry information.
from pypath import complex
complexdb = complex.get_db()
complexdb.update_index()
complexdb
To retrieve all complexes containing a specific protein, here MTOR:
complexdb.proteins['P42345']
Note some of the complexes have human readable names, these are preferred at printing if available from any of the databases. Otherwise the complexes are labelled by COMPLEX:list-of-components
.
Take a closer look on one complex object. The hash of the is equivalent with the string representation below, where the UniProt IDs are unique and alphabetically sorted. Hence you can look up complexes using strings as keys despite the dict keys are indeed pypath.intera.Complex
objects:
cplex = complexdb.complexes['COMPLEX:P42345-Q13451']
cplex.components # stoichiometry
cplex.sources # resources
The large datasets above are compiled from many resources. Even if these are already available in the cache, the data processing often takes longer than convenient, e.g. few minutes. Most of the data integration objects in pypath
provide methods to save and load their contents as pickle dumps.
# for `pypath.main.PyPath` objects:
pa.save_network('mynetwork.pickle') # save
pa.init_network(pfile = 'mynetwork.pickle') # load
# for `pypath.annot.AnnotationTable` objects:
a.save_to_pickle('myannots.pickle')
a = annot.AnnotationTable(pickle_file = 'myannots.pickle')
# for `pypath.complex.ComplexAggregator` objects:
complexdb.save_to_pickle('mycomplexes.pickle')
complexdb = complex.ComplexAggregator(pickle_file = 'mycomplexes.pickle')
The original implementation of the network in pypath
is based on igraph
. Work is ongoing to provide a new and more flexible network builder which will result pandas.DataFrame
and to make pypath
independent from igraph
. As a temporary solution you can easily convert the network to a pandas.DataFrame
using the pypath.network
module.
from pypath import main
from pypath import data_formats
from pypath import network
pa = main.PyPath()
pa.init_network(data_formats.pathway_all)
net = network.Network.from_igraph(pa)
net.records[:10]
Now pypath
has an improved logger. All modules sends messages to a log file named by default by the session ID (a 5 char random string). The default path to the log file is ./pypath_log/pypath-xxxxx.log
where xxxxx
is the session ID. When you import pypath
the welcome message tells you the session ID and the log file location.
import pypath
Also by default this is the only message pypath
prints directly to the console, otherwise it only messages to the log. Here is how you can access the session ID and the logger:
pypath.session_mod.session
pypath.session_mod.session.log.fname
pypath.session_mod.session.label
From your scripts and apps you can also easily send messages to the logfile:
pypath.session_mod.session.log.msg('Greetings from the pypath tutorial notebook! :)')
with open(pypath.session_mod.session.log.fname, 'r') as fp:
messages = fp.read().split('\n')
print('\n'.join(messages))
If you create a class inheriting from pypath.session_mod.Logger
it will be automatically connected to the session logger:
class ChildOfLogger(pypath.session_mod.Logger):
def __init__(self):
pypath.session_mod.Logger.__init__(self, name = 'child')
def say_something(self):
self._log('Have a nice day! :D')
col = ChildOfLogger()
col.say_something()
with open(pypath.session_mod.session.log.fname, 'r') as fp:
messages = fp.read().split('\n')
print('\n'.join(messages))
Note, the log messages are flushed by default in every 2 seconds, but their timestamps always refer to the exact time the message has been sent. A second stamp shows the name of the sending submodule or class.
Finally see a log from a real pypath
session:
from pypath import main
from pypath import data_formats
pa = main.PyPath()
pa.init_network(data_formats.pathway)
with open(pypath.session_mod.session.log.fname, 'r') as fp:
messages = fp.read().split('\n')
print('\n'.join(messages[-20:]))
Biological Expression Language (BEL, https://bel-commons.scai.fraunhofer.de/) is a versatile description language to capture relationships between various biological entities spanning wide range of the levels of biological organization. pypath
has a dedicated module to convert the network and the enzyme-substrate interactions to BEL format:
from pypath import main
from pypath import data_formats
from pypath import bel
pa = main.PyPath()
pa.init_network(data_formats.pathway)
You can provide one or more resources to the Bel
class. Supported resources currently are pypath.main.PyPath
and pypath.ptm.PtmAggregator
.
b = bel.Bel(resource = pa)
From the resources we compile a BELGraph
object which provides a Python interface for various operations and you can also export the data in BEL format:
b.main()
b.bel_graph
b.bel_graph.summarize()
b.export_relationships('omnipath_pathways.bel')
with open('omnipath_pathways.bel', 'r') as fp:
bel_str = fp.read()
print(bel_str[:333])
CellPhoneDB is a statistical method and a database for inferring inter-cellular communication pathways between specific cell types from single-cell data. OmniPath/pypath uses CellPhoneDB as a resource for interaction, protein complex and annotation data. Apart from this, pypath is able to export its data in the appropriate format to provide input for the CellPhoneDB Python module. For this you can use the pypath.cellphonedb
module:
from pypath import cellphonedb
from pypath import settings
settings.setup(network_expand_complexes = False)
Here you can provide parameters for the network or provide an already built network. Also you can provide the datasets as pickles to make them load really fast. Otherwise this step will take quite long.
c = cellphonedb.CellPhoneDB()
You can access each of the CellPhoneDB input files as a pandas.DataFrame
and also they've been exported to csv files. For example the interaction_input.csv
contains interactions from all the resources used for building the network (here Signor, SingnaLink, etc.):
c.interaction_dataframe[:10]
The proteins and complexes are annotated (transmembrane, peripheral, secreted, etc.) using data from the pypath.intercell
module (identical to the http://omnipathdb.org/intercell query of the web service):
c.protein_dataframe[:10]