The pypath tutorial collection¶

Before April 2019 on the OmniPath webpage (http://omnipathdb.org/) we had a few tutorials for pypath. However over the past years we developed a lot pypath and especially recently a number of important points in the interface changed (although we wanted to keep compatibility as much as possible). This is a new comprehensive tutorial which replaced the previous tutorials by April 2019 and has been updated in August 2019.

Table of contents¶

1: Quick start – How do I build OmniPath data with pypath?
2: Quick start – I just want a network quickly and play around with pypath
3: Quick start – How do I build networks from any data with pypath?
- 3a: Defining input formats
- 3b: Creating PyPath object and loading the 2 test files
4: Plotting the network with igraph
5: Building networks
- 5a: Which network datasets are pre-defined in pypath?
6: How to access the network
7: Directions and signs
8: Accessing nodes in the network
9: Querying relationships with our without causality
10: Accessing edges by identifiers
11: Literature references
12: Translating identifiers
13: Enzyme-substrate interactions
14: Annotations
15: Inter-cellular signaling roles
16: Gene Ontology
17: Protein complexes
18: Saving datasets as pickles
19: Network in pandas.DataFrame
20: Log messages and sessions
21: BEL export
22: CellPhoneDB export

1: Quick start – How do I build OmniPath data with pypath? ¶

pypath provides an easy way to build the OmniPath network as it has been described in our paper. At the first time this will take several minutes, because all data will be downloaded from the original providers. Next time pypath will use the data from its cache directory, so the network will build much faster. If you want to load it even faster, you can save it into a pickle dump.

from pypath import main
from pypath import settings

pa = main.PyPath()
#pa.load_omnipath() # This is commented out because it takes > 1h 
                    # to run it for the first time due to the vast
                    # amount of data download.
                    # Once you populated the cache it still takes
                    # approx. 30 min to build the entire OmniPath
                    # as the process consists of quite some data
                    # processing. If you dump it in a pickle, you
                    # can load the network in < 1 min

2: Quick start – I just want a network quickly and play around with pypath ¶

You can find the predefined formats in the pypath.data_formats module. For example, to load one resource from there, let's say Signor:

from pypath import main
from pypath import data_formats
pa = main.PyPath()
pa.load_resources({'signor': data_formats.pathway['signor']})

Or to load all activity flow resources with literature references:

from pypath import main
from pypath import data_formats

pa = main.PyPath()
pa.init_network(data_formats.pathway)

Or to load all activity flow resources, including the ones without literature references:

pa = main.PyPath()
pa.init_network(data_formats.pathway_all)

3: Quick start – How do I build networks from any data with pypath? ¶

Here we show how to build a network from your own files. The advantage of building network with pypath is that you don't need to worry about merging redundant elements, neither about different formats and identifiers. Let's say you have two files with network data:

network1.csv

entrezA,entrezB,effect
1950,1956,inhibition
5290,207,stimulation
207,2932,inhibition
1956,5290,stimulation

network2.sif

EGF + EGFR
EGFR + PIK3CA
EGFR + SOS1
PIK3CA + RAC1
RAC1 + MAP3K1
SOS1 + HRAS
HRAS + MAP3K1
PIK3CA + AKT1
AKT1 - GSK3B

Note: you need to create these files in order to load them.

3a: Defining input formats ¶

import pypath
import pypath.input_formats as input_formats

input1 = input_formats.ReadSettings(
    name = 'egf1',
    input = 'network1.csv',
    header = True,
    separator = ',',
    id_col_a = 0,
    id_col_b = 1,
    id_type_a = 'entrez',
    id_type_b = 'entrez',
    sign = (2, 'stimulation', 'inhibition'),
    ncbi_tax_id = 9606,
)

input2 = input_formats.ReadSettings(
    name = 'egf2',
    input = 'network2.sif',
    separator = ' ',
    id_col_a = 0,
    id_col_b = 2,
    id_type_a = 'genesymbol',
    id_type_b = 'genesymbol',
    sign = (1, '+', '-'),
    ncbi_tax_id = 9606,
)

3b: Creating PyPath object and loading the 2 test files ¶

inputs = {
    'egf1': input1,
    'egf2': input2
}

pa = main.PyPath()
pa.reload()
pa.init_network(lst = inputs)

4: Plotting the network with igraph ¶

Here we use the network created above (because it is reasonable size, not like the networks we could get from most of the network databases). Igraph has excellent plotting capabilities built on top of the cairo library.

import igraph
plot = igraph.plot(pa.graph, target = 'egf_network.png',
            edge_width = 0.3, edge_color = '#777777',
            vertex_color = '#97BE73', vertex_frame_width = 0,
            vertex_size = 70.0, vertex_label_size = 15,
            vertex_label_color = '#FFFFFF',
            # due to a bug in either igraph or IPython, 
            # vertex labels are not visible on inline plots:
            inline = False, margin = 120)
from IPython.display import Image
Image(filename='egf_network.png')

5: Building networks ¶

For this you will need the PyPath class from the pypath.main module which takes care about building and querying the network. Also you need the pypath.data_formats module where you find a number of predefined input settings organized in larger categories (e.g. activity flow, enzyme-substrate, transcriptional regulation, etc). These input settings will tell pypath how to download and process the data.

from pypath import main
from pypath import data_formats

For example data_formats.pathway is a collection of databases which fit into the activity flow concept, i.e. one protein either stimulates or inhibits the other. It is a dictionary with names as keys and the input settings as values:

data_formats.pathway

{'trip': <pypath.input_formats.ReadSettings at 0x6da2497bc940>,
 'spike': <pypath.input_formats.ReadSettings at 0x6da2497bc9b0>,
 'signalink3': <pypath.input_formats.ReadSettings at 0x6da2497bc9e8>,
 'guide2pharma': <pypath.input_formats.ReadSettings at 0x6da2497bca20>,
 'ca1': <pypath.input_formats.ReadSettings at 0x6da2497bca58>,
 'arn': <pypath.input_formats.ReadSettings at 0x6da2497bcac8>,
 'nrf2': <pypath.input_formats.ReadSettings at 0x6da2497bcb00>,
 'macrophage': <pypath.input_formats.ReadSettings at 0x6da2497bca90>,
 'death': <pypath.input_formats.ReadSettings at 0x6da2497bcb38>,
 'pdz': <pypath.input_formats.ReadSettings at 0x6da2497bcb70>,
 'signor': <pypath.input_formats.ReadSettings at 0x6da2497bcba8>,
 'adhesome': <pypath.input_formats.ReadSettings at 0x6da2497bcbe0>,
 'hpmr': <pypath.input_formats.ReadSettings at 0x6da2497c0908>,
 'cellphonedb': <pypath.input_formats.ReadSettings at 0x6da2497c09e8>,
 'ramilowski2015': <pypath.input_formats.ReadSettings at 0x6da2497c0ac8>}

Such a dictionary you can pass to the init_network method of the PyPath object. Then it will download the data from the original sources, translate the identifiers and merge the networks. Pypath stores all downloaded data in a cache, by default ~/.pypath/cache in your user's home directory. For this reason when you load a resource for the first time it might take long but next time will be faster as data will be fetched from the cache. First create a pypath.main.PyPath object, then build the network:

pa = main.PyPath()
pa.init_network(data_formats.pathway)

You can add more resource sets a similar way:

pa.load_resources(data_formats.ptm)

To load one single resource simply create a one element dict:

pa.load_resources({'matrixdb': data_formats.interaction['matrixdb']})

5a: Which network datasets are pre-defined in pypath? ¶

You can find all the pre-defined datasets in the pypath.data_formats module. As already we mentined above, the pathway dataset contains the literature curated activity flow resources. This was the original focus of pypath and OmniPath, however since then we added a great variety of other kinds of resource definitions. Here we give an overview of these.

data_formats.pathway: activity flow networks with literature references
data_formats.activity_flow: synonym for pathway
data_formats.pathway_noref: activity flow networks without literature references
data_formats.pathway_all: all activity flow data
data_formats.ptm: enzyme-substrate interaction networks with literature references
data_formats.enzyme_substrate: synonym for ptm
data_formats.ptm_noref: enzyme-substrate networks without literature references
data_formats.ptm_all: all enzyme-substrate data
data_formats.interaction: undirected interactions from both literature curated and high-throughput collections (e.g. IntAct, BioGRID)
data_formats.interaction_misc: undirected, high-scale interaction networks without the constraint of having any literature reference (e.g. the unbiased human interactome screen from the Vidal lab)
data_formats.transcription_onebyone: transcriptional regulation databases (TF-target interactions) with all databases downloaded directly and processed by pypath
data_formats.transcription: transcriptional regulation only from the DoRothEA data
data_formats.mirna_target: miRNA-mRNA interactions from literature curated resources
data_formats.tf_mirna: transcriptional regulation of miRNA from literature curated resources
data_formats.lncrna_protein: lncRNA-protein interactions from literature curated datasets
data_formats.ligand_receptor: ligand-receptor interactions from both literature curated and other kinds of resources
data_formats.pathwaycommons: the PathwayCommons database
data_formats.reaction: process description databases; not guaranteed to work at this moment
data_formats.reaction_misc: alternative definitions to load process description databases; not guaranteed to work at this moment
data_formats.small_molecule_protein: signaling interactions between small molecules and proteins

To see the list of the resources in a dataset, you can check the dict keys or the name attribute of each element:

data_formats.pathway.keys()

dict_keys(['trip', 'spike', 'signalink3', 'guide2pharma', 'ca1', 'arn', 'nrf2', 'macrophage', 'death', 'pdz', 'signor', 'adhesome', 'hpmr', 'cellphonedb', 'ramilowski2015'])

[resource.name for resource in data_formats.pathway.values()]

['TRIP',
 'SPIKE',
 'SignaLink3',
 'Guide2Pharma',
 'CA1',
 'ARN',
 'NRF2ome',
 'Macrophage',
 'DeathDomain',
 'PDZBase',
 'Signor',
 'Adhesome',
 'HPMR',
 'CellPhoneDB',
 'Ramilowski2015']

6: How to access the network ¶

Once you built a network you can use it for various purposes and write your own scripts for further processing or analysis. The network is represented by an igraph object (igraph.org):

pa.graph

<igraph.Graph at 0x6ee60f2c7318>

Number of edges and nodes:

pa.ecount, pa.vcount

(22101, 5184)

The edge and vertex sequences you can access in the es and vs attributes, you can iterate these or index by integers. The edge and vertex attributes you can access by string keys. E.g. get the sources of edge 0:

pa.graph.es[81]['sources']

{'SPIKE', 'SignaLink3'}

7: Directions and signs ¶

By default the igraph object is undirected but it carries all direction information in Python objects assigned to each edge. Pypath can convert it to a directed igraph object, but you still need the Direction objects to have the signs, as igraph has no signed network representation. Certain methods need the directed igraph object and they will automatically create it, but you can create it manually:

pa.get_directed()

You find the directed network in the pa.dgraph attribute:

pa.dgraph

<igraph.Graph at 0x6ee649d04318>

Now let's take a look on the pypath.main.Direction objects which contain details about directions and signs. First as an example, select a random edge:

edge = pa.graph.es[3241]

The Direction object is in the dirs edge attribute:

d = edge['dirs']

It has a method to print its content a human readable way:

print(pa.graph.es[3241]['dirs'])

Directions and signs of interaction between Q13489 and Q13546

	Q13489 ===> Q13546 :: SPIKE, SignaLink3
	Q13489 <=== Q13546 :: SignaLink3
	Q13489 =+=> Q13546 :: SPIKE

From this we see the databases phosphoELM and Signor agree that protein P17252 has an effect on Q15139 and Signor in addition tells us this effect is stimulatory. However in your scripts you can query the Direction objects a number of ways. Each Direction object calls the two possible directions either straight or reverse:

d.straight

('Q13489', 'Q13546')

d.reverse

('Q13546', 'Q13489')

It can tell you if one of these directions is supported by any of the network resources:

d.get_dir(d.straight)

True

Or it can return those resources:

d.get_dir(d.straight, sources = True)

{'SPIKE', 'SignaLink3'}

The opposite direction is not supported by any resource:

d.get_dir(d.reverse, sources = True)

{'SignaLink3'}

Similar way the signs can be queried. The returned pair of boolean values mean if the interaction in this direction is stimulatory or inhibitory, respectively.

d.get_sign(d.straight)

[True, False]

Or you can ask whether it is inhibition:

d.is_inhibition(d.straight)

False

Or if the interaction is directed at all:

d.is_directed()

True

Sometimes resources don't agree, for example one tells an interaction is inhibition while according to others it is stimulation; or one tells A effects B and another resource the other way around. Here we preserve all these potentially contradicting information in the Direction object and at the end you decide what to do with it depending on your purpose. If you want to get rid of ambiguity there is a method to get a consensus direction and sign which returns the attributes the most resources agree on:

d.consensus_edges()

[['Q13489', 'Q13546', 'directed', 'positive']]

8: Accessing nodes in the network ¶

In igraph the vertices are numbered but this numbering can change at certain operations. Instead the we can use the vertex attributes. In PyPath for proteins the name attribute is UniProt ID by default and the label is Gene Symbol.

pa.graph.vs['name'][:5]

['P63000', 'O00161', 'Q9GZU1', 'Q96H20', 'Q9NWB7']

pa.graph.vs['label'][:5]

['RAC1', 'SNAP23', 'MCOLN1', 'SNF8', 'IFT57']

The PyPath object offers a number of helper methods to access the nodes by their names. For example, uniprot or up returns the igraph.Vertex for a UniProt ID:

type(pa.up('P00533'))

igraph.Vertex

Similarly genesymbol or gs for Gene Symbols:

type(pa.gs('ESR1'))

igraph.Vertex

Each of these has a "plural" version:

len(list(pa.gss(['MTOR', 'ATG16L2', 'ULK1'])))

3

And a generic method where you can mix UniProts and Gene Symbols:

len(list(pa.proteins(['MTOR', 'P00533'])))

2

9: Querying relationships with our without causality ¶

Above you could see how to query the directions and names of individual edges and nodes. Building on top of these, other methods give a way to query causality, i.e. which proteins are affected by an other one, and which others are its regulators. The example below returns the nodes PIK3CA is stimulated by, the gs prefix tells we query by the Gene Symbol:

pa.gs_stimulated_by('PIK3CA')

<pypath.main._NamedVertexSeq at 0x6ee604b0a8c8>

It returns a so called _NamedVertexSeq object, which you can get a series of igraph.Vertex objects or Gene Symbols or UniProt IDs from:

list(pa.gs_stimulated_by('PIK3CA').gs())[:5]

['NTRK1', 'SRC', 'GAB1', 'PTPN11', 'NRAS']

list(pa.gs_stimulated_by('PIK3CA').up())[:5]

['P04629', 'P12931', 'Q13480', 'Q06124', 'P01111']

Note, the names of these methods are a bit contraintuitive, the for example the gs_stimulates returns the genes stimulated by PIK3CA:

list(pa.gs_stimulates('PIK3CA').gs())[:5]

['MTOR', 'AKT1']

'PIK3CA' in set(pa.affected_by('AKT1').gs())

True

There are many similary methods, inhibited_by returns negative regulators, affected_by does not consider +/- signs, without gs_ and up_ prefixes you can provide either of these identifiers, neighbors does not consider the direction. At the end .gs() converts the result for a list of Gene Symbols, up() to UniProts, .ids() to vertex IDs and by default it yields igraph.Vertex objects:

list(pa.neighbors('AKT1').ids())[:5]

[0, 32, 38, 50, 69]

Finally, with neighborhood methods return the indirect neighborhood in custom number of steps (however size of the neighborhood increases rapidly with number of steps):

print(list(pa.neighborhood('ATG3', 1).gs()))

['ATG3', 'GABARAP', 'ATG5', 'GABARAPL2', 'ATG12', 'ATG7', 'CFLAR', 'MAP1LC3B', 'MAP1LC3A', 'TP63']

print(list(pa.neighborhood('ATG3', 2).gs()))

['ATG3', 'GABARAP', 'ATG5', 'GABARAPL2', 'ATG12', 'ATG7', 'CFLAR', 'MAP1LC3B', 'MAP1LC3A', 'TP63', 'TRPV1', 'CLTC', 'FNBP1', 'NBR1', 'BNIP3L', 'ATG13', 'SQSTM1', 'RB1CC1', 'FYCO1', 'ATG4B', 'ULK1', 'ULK2', 'DVL2', 'OPTN', 'IFIH1', 'BCL2L1', 'ATF4', 'TP73', 'WDFY3', 'CAPN2', 'FADD', 'CAPN1', 'ATG10', 'DDX58', 'DDIT3', 'MAVS', 'ATG16L1', 'ATG16L2', 'TECPR1', 'PPHLN1', 'COX5B', 'UBA5', 'NEK9', 'ATG4A', 'BNIP3', 'NIPSNAP2', 'EP300', 'FOXO1', 'HSF1', 'TAX1BP3', 'ITCH', 'RIPK1', 'FAS', 'NFKB1', 'PRKCB', 'RIPK2', 'TRAF2', 'AR', 'CASP8', 'AKT1', 'MAP3K14', 'CASP10', 'PRKACA', 'MAP1B', 'EGR1', 'MAPK8', 'KEAP1', 'ZKSCAN3', 'TFEB', 'P27791', 'TBC1D5', 'E2F1', 'MAP1A', 'RAB3GAP1', 'HNRNPAB', 'FBXW7', 'ATM', 'TP53', 'MDM2', 'RPS6KB1', 'CDK2', 'IKBKB', 'ATG9A', 'BECN1']

len(list(pa.neighborhood('ATG3', 3).gs()))

1735

len(list(pa.neighborhood('ATG3', 4).gs()))

5344

10: Accessing edges by identifiers ¶

Just like nodes also edges can be accessed by identifiers like Gene Symbols. get_edge returns an igraph.Edge if the edge exists otherwise None.

type(pa.get_edge('EGF', 'EGFR'))

igraph.Edge

type(pa.get_edge('EGF', 'P00533'))

igraph.Edge

type(pa.get_edge('EGF', 'AKT1'))

NoneType

print(pa.get_edge('EGF', 'EGFR')['dirs'])

Directions and signs of interaction between P00533 and P01133

	P00533 <=== P01133 :: SPIKE, HPMR, SignaLink3
	P00533 <=+= P01133 :: SPIKE, SignaLink3

11: Literature references ¶

Select a random edge and in the references attribute you find a list of references:

edge = pa.get_edge( 'MAP1LC3B', 'SQSTM1')
edge['references']

[<pypath.refs.Reference at 0x6ee605f6dd98>,
 <pypath.refs.Reference at 0x6ee605f6dd68>]

Each reference has a PubMed ID:

edge['references'][0].pmid

'17580304'

edge['references'][0].open()

These 3 references come from 3 different databases, but there must be 2 overlaps between them:

edge['refs_by_source']

{'NRF2ome': {<pypath.refs.Reference at 0x6ee605f6dd98>},
 'ELM': {<pypath.refs.Reference at 0x6ee5fdc8cd98>,
  <pypath.refs.Reference at 0x6ee605f6dd68>}}

12: Translating identifiers ¶

The pypath.mapping module is for ID translation, most of the time you can simply call the map_name method:

from pypath import mapping
mapping.map_name('P00533', 'uniprot', 'genesymbol')

{'EGFR'}

mapping.map_name('8408', 'entrez', 'uniprot')

{'O75385'}

A number of mapping tables are predefined and loaded automatically. However it does not translate in 2 steps if no direct translation table is available. For example Entrez to Gene Symbol you can translate this way:

mapping.map_names(
    mapping.map_name('8408', 'entrez', 'uniprot'),
    'uniprot',
    'genesymbol',
)

{'ULK1'}

By default the map_name function returns a set because it accounts for ambiguous mapping. However most often the ID translation is unambiguous, and you want to retrieve only one ID. The map_name0 returns a string, even in case of ambiguity, it returns a random element from the resulted set:

mapping.map_name0('GABARAPL3', 'genesymbol', 'uniprot')

'Q9BY60'

13: Enzyme-substrate interactions ¶

The pypath.ptm module builds a database of enzyme-substrate interactions.

from pypath import ptm
ptm_db = ptm.get_db()

Here you got a dictionary with pairs of UniProt IDs as keys and a list of special objects representing enzyme-substrate interactions as values:

print(ptm_db.enz_sub[('Q13177', 'P01236')][0])

Alternatively the enzyme-substrate interactions can be assigned to network edges:

pa.load_ptms2()

print(pa.graph.es['ptm'][444][0])

14: Annotations ¶

This module provides various annotations about the function and location of the proteins.

from pypath import annot
a = annot.get_db()

OmniPath contains annotations from 27 resources. These provide various information about the characteristics of the proteins, e.g. their localization or function. The AnnotationTable object loads all annotations by default, optionally you can limit this to certain resources. For example, if you only want to load the pathway membership annotations from SIGNOR, SignaLink, NetPath and KEGG, you can provide the names of the appropriate classes:

pathways = annot.AnnotationTable(
    protein_sources = (
        'SignalinkPathways',
        'KeggPathways',
        'NetpathPathways',
        'SignorPathways',
    )
)

The AnnotationTable object provides methods to query all resources together, or build a boolean array out of them. To see all annotations of one protein:

pathways.all_annotations('P00533')

[SignalinkPathway(pathway='TNF/Apoptosis', core=True),
 SignalinkPathway(pathway='RTK', core=True),
 SignalinkPathway(pathway='WNT', core=True),
 SignalinkPathway(pathway='IIP', core=True),
 KeggPathway(pathway='Proteoglycans in cancer'),
 KeggPathway(pathway='Pathways in cancer'),
 KeggPathway(pathway='Pancreatic cancer'),
 KeggPathway(pathway='Central carbon metabolism in cancer'),
 KeggPathway(pathway='Phospholipase D signaling pathway'),
 KeggPathway(pathway='Human cytomegalovirus infection'),
 KeggPathway(pathway='Oxytocin signaling pathway'),
 KeggPathway(pathway='Hepatocellular carcinoma'),
 KeggPathway(pathway='Bladder cancer'),
 KeggPathway(pathway='Endocrine resistance'),
 KeggPathway(pathway='Prostate cancer'),
 KeggPathway(pathway='Estrogen signaling pathway'),
 KeggPathway(pathway='Breast cancer'),
 KeggPathway(pathway='ErbB signaling pathway'),
 KeggPathway(pathway='Non-small cell lung cancer'),
 KeggPathway(pathway='FoxO signaling pathway'),
 KeggPathway(pathway='Glioma'),
 KeggPathway(pathway='Parathyroid hormone synthesis, secretion and action'),
 KeggPathway(pathway='Regulation of actin cytoskeleton'),
 KeggPathway(pathway='Human papillomavirus infection'),
 KeggPathway(pathway='GnRH signaling pathway'),
 KeggPathway(pathway='Relaxin signaling pathway'),
 KeggPathway(pathway='Adherens junction'),
 KeggPathway(pathway='EGFR tyrosine kinase inhibitor resistance'),
 KeggPathway(pathway='Colorectal cancer'),
 KeggPathway(pathway='HIF-1 signaling pathway'),
 KeggPathway(pathway='Hepatitis C'),
 KeggPathway(pathway='Choline metabolism in cancer'),
 KeggPathway(pathway='Epithelial cell signaling in Helicobacter pylori infection'),
 KeggPathway(pathway='Melanoma'),
 KeggPathway(pathway='Endocytosis'),
 KeggPathway(pathway='Focal adhesion'),
 KeggPathway(pathway='Cushing syndrome'),
 KeggPathway(pathway='Calcium signaling pathway'),
 KeggPathway(pathway='Endometrial cancer'),
 KeggPathway(pathway='Gap junction'),
 NetpathPathway(pathway='Leptin'),
 NetpathPathway(pathway='Epidermal growth factor receptor (EGFR)'),
 NetpathPathway(pathway='Gastrin'),
 NetpathPathway(pathway='Receptor activator of nuclear factor kappa-B ligand (RANKL)'),
 NetpathPathway(pathway='Prolactin'),
 NetpathPathway(pathway='Follicle-stimulating hormone (FSH)'),
 NetpathPathway(pathway='Advanced glycation end-products (AGE/RAGE)'),
 NetpathPathway(pathway='Tumor necrosis factor (TNF) alpha'),
 NetpathPathway(pathway='Androgen receptor (AR)'),
 NetpathPathway(pathway='Alpha6 Beta4 Integrin'),
 SignorPathway(pathway='EGFR'),
 SignorPathway(pathway='Glioblastoma Multiforme'),
 SignorPathway(pathway='PI3K/AKT')]

pathways.create_dataframe = True
pathways.make_dataframe()

pathways.df[:10]

The AnnotationTable object contains the resource specific annotation objects:

a.annots

{'CPAD': <pypath.annot.Cpad at 0x68fbbbff5dd0>,
 'DisGeNet': <pypath.annot.Disgenet at 0x68fb8e004a10>,
 'SignaLink3': <pypath.annot.SignalinkPathways at 0x68fb8e8975d0>,
 'CancerGeneCensus': <pypath.annot.CancerGeneCensus at 0x68fb9b855810>,
 'Matrisome': <pypath.annot.Matrisome at 0x68fb8e853310>,
 'KEGG': <pypath.annot.KeggPathways at 0x68fb9c004fd0>,
 'Integrins': <pypath.annot.Integrins at 0x68fb9a2903d0>,
 'Ramilowski_location': <pypath.annot.Ramilowski2015Location at 0x68fb9a4e6110>,
 'Signor': <pypath.annot.SignorPathways at 0x68fb92faf690>,
 'CancerSEA': <pypath.annot.Cancersea at 0x68fb91f236d0>,
 'CSPA': <pypath.annot.CellSurfaceProteinAtlas at 0x68fbadcf5790>,
 'Membranome': <pypath.annot.Membranome at 0x68fbb8013d10>,
 'Guide2Pharma': <pypath.annot.GuideToPharmacology at 0x68fba9a143d0>,
 'OPM': <pypath.annot.Opm at 0x68fb92f40cd0>,
 'Kirouac2010': <pypath.annot.Kirouac2010 at 0x68fbada4c990>,
 'Zhong2015': <pypath.annot.Zhong2015 at 0x68fbcb89c050>,
 'HPA': <pypath.annot.HumanProteinAtlas at 0x68fbaded3810>,
 'TopDB': <pypath.annot.Topdb at 0x68fbae51e1d0>,
 'Kinases': <pypath.annot.Kinases at 0x68fbc97a8b90>,
 'TFcensus': <pypath.annot.Tfcensus at 0x68fbc2262190>,
 'Adhesome': <pypath.annot.Adhesome at 0x68fbc8506f10>,
 'Ramilowski2015': <pypath.annot.Ramilowski2015 at 0x68fbc227ad90>,
 'Phosphatome': <pypath.annot.Phosphatome at 0x68fbc21cbad0>,
 'Vesiclepedia': <pypath.annot.Vesiclepedia at 0x68fb8e22aad0>,
 'NetPath': <pypath.annot.NetpathPathways at 0x68fb9bec5750>,
 'HGNC': <pypath.annot.Hgnc at 0x68fb96c93d50>,
 'HPMR': <pypath.annot.HumanPlasmaMembraneReceptome at 0x68fbe932b490>,
 'DGIdb': <pypath.annot.Dgidb at 0x68fbc2fc3150>,
 'Exocarta': <pypath.annot.Exocarta at 0x68fbc26a8150>,
 'CellPhoneDB': <pypath.annot.CellPhoneDB at 0x68fbc26a87d0>,
 'Locate': <pypath.annot.Locate at 0x68fb9fff2e90>,
 'Surfaceome': <pypath.annot.Surfaceome at 0x68fb9b13b110>,
 'MatrixDB': <pypath.annot.Matrixdb at 0x68fba4fa0e10>,
 'GO_Intercell': <pypath.annot.GOIntercell at 0x68fba55ae3d0>,
 'HPMR_complex': <pypath.annot.HpmrComplex at 0x68fbc26f35d0>,
 'CORUM_Funcat': <pypath.annot.CorumFuncat at 0x68fb95d2a1d0>,
 'CORUM_GO': <pypath.annot.CorumGO at 0x68fbc832add0>,
 'CellPhoneDB_complex': <pypath.annot.CellPhoneDBComplex at 0x68fbc832edd0>}

For each of these you can query the names of the fields, their possible values and the set of proteins annotated with any combination of the values:

matrisome = a.annots['Matrisome']

matrisome.get_names()

('mainclass', 'subclass', 'subsubclass')

matrisome.get_values('subclass')

{'Collagens',
 'ECM Glycoproteins',
 'ECM Regulators',
 'ECM-affiliated Proteins',
 'Proteoglycans',
 'Secreted Factors',
 'n/a'}

matrisome.get_subset(subclass = 'Collagens')

{'A2A2Y8',
 'A2A352',
 'A2AAS7',
 'A6NCT7',
 'A6NDR9',
 'A6NEQ6',
 'A6NMZ7',
 'A6PVD9',
 'A8MWQ5',
 'A8MXH5',
 'A8TX70',
 'B1AKJ1',
 'B1AKJ3',
 'B4DZ39',
 'B7ZBI4',
 'B7ZBI5',
 'C9JBL3',
 'C9JH44',
 'C9JMN2',
 'C9JNG9',
 'C9JPW4',
 'C9JTN9',
 Complex Collagen type I homotrimer: COMPLEX:P02452,
 Complex HT_DM_Cluster278: COMPLEX:P02452-P02462-P08572-P29400-P53420-Q01955-Q02388-Q14031-Q17RW2-Q8NFW1,
 Complex Collagen type I trimer: COMPLEX:P02452-P08123,
 Complex Collagen type II trimer: COMPLEX:P02458,
 Complex Collagen type XI trimer variant 1: COMPLEX:P02458-P12107-P13942,
 Complex: COMPLEX:P02458-P20908-P25067-P29400,
 Complex: COMPLEX:P02458-P25067-P29400,
 Complex Collagen type III trimer: COMPLEX:P02461,
 Complex: COMPLEX:P02462,
 Complex Collagen type IV trimer variant 1: COMPLEX:P02462-P08572,
 Complex Collagen type XI trimer variant 2: COMPLEX:P05997-P12107,
 Complex Collagen type XI trimer variant 3: COMPLEX:P05997-P12107-P20908,
 Complex Collagen type V trimer variant 1: COMPLEX:P05997-P20908,
 Complex Collagen type V trimer variant 2: COMPLEX:P05997-P20908-P25940,
 Complex: COMPLEX:P08572,
 Complex: COMPLEX:P12109-P12110,
 Complex Collagen type VI trimer: COMPLEX:P12109-P12110-P12111,
 Complex Collagen type IX trimer: COMPLEX:P20849-Q14050-Q14055,
 Complex Collagen type V trimer variant 3: COMPLEX:P20908,
 Complex: COMPLEX:P20908-P25067,
 Complex Collagen type VIII trimer variant 3: COMPLEX:P25067,
 Complex Collagen type VIII trimer variant 1: COMPLEX:P25067-P27658,
 Complex: COMPLEX:P25067-P29400,
 Complex Collagen type VIII trimer variant 2: COMPLEX:P27658,
 Complex Collagen type IV trimer variant 3: COMPLEX:P29400-P53420-Q01955,
 Complex Collagen type IV trimer variant 2: COMPLEX:P29400-Q14031,
 Complex Collagen type XV trimer: COMPLEX:P39059,
 Complex Collagen type XVIII trimer: COMPLEX:P39060,
 Complex: COMPLEX:P53420,
 Complex: COMPLEX:Q01955,
 Complex Collagen type VII trimer: COMPLEX:Q02388,
 Complex Collagen type X trimer: COMPLEX:Q03692,
 Complex Collagen type XIV trimer: COMPLEX:Q05707,
 Complex Collagen type XVI trimer: COMPLEX:Q07092,
 Complex Collagen type XIX trimer: COMPLEX:Q14993,
 Complex Collagen type XXIV trimer: COMPLEX:Q17RW2,
 Complex Collagen type XXVIII trimer: COMPLEX:Q2UY09,
 Complex Collagen type XIII trimer: COMPLEX:Q5TAT6,
 Complex Collagen type XXIII trimer: COMPLEX:Q86Y22,
 Complex Collagen type XXVII trimer: COMPLEX:Q8IZC6,
 Complex Collagen type XXII trimer: COMPLEX:Q8NFW1,
 Complex Collagen type XXVI trimer: COMPLEX:Q96A83,
 Complex Collagen type XXI trimer: COMPLEX:Q96P44,
 Complex Collagen type XII trimer: COMPLEX:Q99715,
 Complex Collagen type XXV trimer, variant 2: COMPLEX:Q9BXS0,
 Complex Collagen type XX trimer: COMPLEX:Q9P218,
 Complex Collagen type XVII trimer: COMPLEX:Q9UMD9,
 'D6R8Y2',
 'D6RGG3',
 'E7ENL6',
 'E7ENY8',
 'E7ES46',
 'E7ES47',
 'E7ES49',
 'E7ES50',
 'E7ES51',
 'E7ES55',
 'E7ES56',
 'E7EX21',
 'E9PAL5',
 'E9PCV6',
 'E9PEG9',
 'E9PNK8',
 'E9PNV9',
 'E9PP49',
 'F5GZK2',
 'F5H3Q5',
 'F5H5K0',
 'F5H851',
 'F8W6Y7',
 'F8W8G8',
 'F8WDM8',
 'G5E987',
 'H0Y393',
 'H0Y3B3',
 'H0Y3B5',
 'H0Y3M9',
 'H0Y409',
 'H0Y420',
 'H0Y4C9',
 'H0Y4P7',
 'H0Y5N9',
 'H0Y935',
 'H0Y940',
 'H0Y991',
 'H0Y998',
 'H0Y9H0',
 'H0Y9R8',
 'H0Y9T2',
 'H0YA33',
 'H0YAE1',
 'H0YAX7',
 'H0YBB2',
 'H0YCZ7',
 'H0YD40',
 'H0YDH6',
 'H0YHM5',
 'H0YHM9',
 'H0YIS1',
 'H7BXM4',
 'H7BXV5',
 'H7BY82',
 'H7BY97',
 'H7BYT9',
 'H7BZB6',
 'H7BZL8',
 'H7BZU0',
 'H7C0M5',
 'H7C381',
 'H7C3F0',
 'H7C3P2',
 'H7C435',
 'H7C457',
 'I3L392',
 'I3L3H7',
 'J3KNM7',
 'J3QT75',
 'J3QT83',
 'P02452',
 'P02458',
 'P02461',
 'P02462',
 'P05997',
 'P08123',
 'P08572',
 'P12107',
 'P12109',
 'P12110',
 'P12111',
 'P13942',
 'P20849',
 'P20908',
 'P25067',
 'P25940',
 'P27658',
 'P29400',
 'P39059',
 'P39060',
 'P53420',
 'Q01955',
 'Q02388',
 'Q03692',
 'Q05707',
 'Q07092',
 'Q14031',
 'Q14050',
 'Q14055',
 'Q14993',
 'Q17RW2',
 'Q2UY09',
 'Q4G0W3',
 'Q4VXW1',
 'Q4VXY6',
 'Q5JVU1',
 'Q5QPC7',
 'Q5QPC8',
 'Q5T1U7',
 'Q5TAT6',
 'Q86Y22',
 'Q8IZC6',
 'Q8NFW1',
 'Q96A83',
 'Q96P44',
 'Q99715',
 'Q9BXS0',
 'Q9P218',
 'Q9UMD9'}

15: Inter-cellular signaling roles ¶

pypath does not combine the annotations in the annot module, exactly what goes in goes out. For example, WNT pathway from Signor and SignaLink won't be merged automatically. However with the pypath.annot.CustomAnnotation class anyone can do it. For inter-cellular communication categories the pypath.intercell module combines the data from all the relevant resources and creates categories based on a combination of evidences.

from pypath import intercell

i = intercell.get_db() # this takes quite some time
                    # unless you load annotations from a pickle cache

i

<pypath.intercell.IntercellAnnotation at 0x666c56b9ef90>

i.make_df()

i.df[:10]

i.class_names

{'adhesion',
 'cell_surface',
 'chemokine_ligands_hgnc',
 'ecm',
 'endogenous_ligands_hgnc',
 'extracellular',
 'extracellular_enzyme',
 'extracellular_peptidase',
 'gap_junction',
 'growth_factor_binder',
 'growth_factor_regulator',
 'interleukin_receptors_hgnc',
 'interleukins_hgnc',
 'intracellular',
 'ligand',
 'receptor',
 'secreted',
 'surface_enzyme',
 'surface_ligand',
 'tight_junction',
 'transmembrane',
 'transporter'}

i.children['receptor']

{'receptor',
 'receptor_cellphonedb',
 'receptor_dgidb',
 'receptor_go',
 'receptor_guide2pharma',
 'receptor_hgnc',
 'receptor_hpmr',
 'receptor_kirouac',
 'receptor_ramilowski',
 'receptor_surfaceome'}

i.counts()

{'receptor_cellphonedb': 860,
 'receptor_surfaceome': 1563,
 'receptor_go': 594,
 'receptor_hpmr': 1276,
 'receptor_ramilowski': 979,
 'receptor_kirouac': 127,
 'receptor_guide2pharma': 393,
 'interleukin_receptors_hgnc': 78,
 'receptor_hgnc': 78,
 'receptor_dgidb': 979,
 'receptor': 2517,
 'ecm_matrixdb': 656,
 'cell_surface_surfaceome': 3544,
 'cell_surface_go': 907,
 'cell_surface_hpmr': 1276,
 'cell_surface_membranome': 2435,
 'cell_surface_cspa': 2213,
 'cell_surface_cellphonedb': 1159,
 'cell_surface_dgidb': 1384,
 'cell_surface': 6098,
 'ecm_matrisome': 1466,
 'ecm_go': 343,
 'ecm': 1912,
 'ligand_cellphonedb': 848,
 'ligand_go': 468,
 'ligand_hpmr': 393,
 'ligand_ramilowski': 976,
 'ligand_kirouac': 267,
 'ligand_guide2pharma': 441,
 'interleukins_hgnc': 64,
 'endogenous_ligands_hgnc': 367,
 'chemokine_ligands_hgnc': 65,
 'ligand_hgnc': 429,
 'ligand_dgidb': 403,
 'ligand': 1495,
 'intracellular_locate': 15474,
 'intracellular_comppi': 0,
 'intracellular_go': 14982,
 'intracellular': 22020,
 'secreted_locate': 1299,
 'extracellular_locate': 2250,
 'extracellular_surfaceome': 3544,
 'extracellular_matrixdb': 4279,
 'extracellular_membranome': 2435,
 'extracellular_cspa': 2213,
 'extracellular_hpmr': 1669,
 'extracellular_cellphonedb': 848,
 'extracellular': 9246,
 'extracellular_comppi': 0,
 'transmembrane_cellphonedb': 1147,
 'transmembrane_go': 5432,
 'transmembrane_opm': 170,
 'transmembrane_locate': 3098,
 'transmembrane_topdb': 1275,
 'transmembrane': 7022,
 'adhesion_cellphonedb': 0,
 'adhesion_go': 1099,
 'adhesion_matrisome': 109,
 'adhesion_hgnc': 232,
 'adhesion_integrins': 63,
 'adhesion_zhong2015': 676,
 'adhesion_adhesome': 398,
 'adhesion': 1774,
 'surface_enzyme_go': 583,
 'surface_enzyme_surfaceome': 131,
 'surface_enzyme': 630,
 'surface_ligand_go': 120,
 'surface_ligand_cellphonedb': 299,
 'surface_ligand': 384,
 'transporter_surfaceome': 523,
 'transporter_go': 499,
 'transporter_dgidb': 2119,
 'transporter': 2156,
 'extracellular_enzyme': 2597,
 'extracellular_peptidase': 577,
 'growth_factor_binder': 67,
 'growth_factor_regulator': 125,
 'secreted_matrisome': 805,
 'secreted_cellphonedb': 848,
 'secreted': 2362,
 'gap_junction': 33,
 'tight_junction': 130}

i.classes_by_entity('P00533')

{'adhesion',
 'adhesion_go',
 'cell_surface',
 'cell_surface_cellphonedb',
 'cell_surface_cspa',
 'cell_surface_dgidb',
 'cell_surface_go',
 'cell_surface_hpmr',
 'cell_surface_membranome',
 'cell_surface_surfaceome',
 'extracellular',
 'extracellular_cellphonedb',
 'extracellular_cspa',
 'extracellular_enzyme',
 'extracellular_hpmr',
 'extracellular_locate',
 'extracellular_matrixdb',
 'extracellular_membranome',
 'extracellular_surfaceome',
 'growth_factor_binder',
 'intracellular',
 'intracellular_go',
 'intracellular_locate',
 'ligand',
 'ligand_cellphonedb',
 'receptor',
 'receptor_cellphonedb',
 'receptor_go',
 'receptor_guide2pharma',
 'receptor_hpmr',
 'receptor_kirouac',
 'receptor_ramilowski',
 'receptor_surfaceome',
 'secreted',
 'secreted_cellphonedb',
 'secreted_locate',
 'transmembrane',
 'transmembrane_cellphonedb',
 'transmembrane_go',
 'transmembrane_opm',
 'transmembrane_topdb'}

i.class_labels

{'receptor_cellphonedb': 'Receptor',
 'receptor_surfaceome': 'Receptor',
 'receptor_go': 'Receptor',
 'receptor_hpmr': 'Receptor',
 'receptor_ramilowski': 'Receptor',
 'receptor_kirouac': 'Receptor',
 'receptor_guide2pharma': 'Receptor',
 'interleukin_receptors_hgnc': 'Interleukin receptors (HGNC)',
 'receptor_hgnc': 'Receptor',
 'receptor_dgidb': 'Receptor',
 'receptor': 'Receptor',
 'ecm_matrixdb': 'Extracellular matrix',
 'cell_surface_surfaceome': 'Cell surface',
 'cell_surface_go': 'Cell surface',
 'cell_surface_hpmr': 'Cell surface',
 'cell_surface_membranome': 'Cell surface',
 'cell_surface_cspa': 'Cell surface',
 'cell_surface_cellphonedb': 'Cell surface',
 'cell_surface_dgidb': 'Cell surface',
 'cell_surface': 'Cell surface',
 'ecm_matrisome': 'Extracellular matrix',
 'ecm_go': 'Extracellular matrix',
 'ecm': 'Extracellular matrix',
 'ligand_cellphonedb': 'Ligand',
 'ligand_go': 'Ligand',
 'ligand_hpmr': 'Ligand',
 'ligand_ramilowski': 'Ligand',
 'ligand_kirouac': 'Ligand',
 'ligand_guide2pharma': 'Ligand',
 'interleukins_hgnc': 'Interleukins (HGNC)',
 'endogenous_ligands_hgnc': 'Endogenous ligands (HGNC)',
 'chemokine_ligands_hgnc': 'Chemokine ligands (HGNC)',
 'ligand_hgnc': 'Ligand',
 'ligand_dgidb': 'Ligand',
 'ligand': 'Ligand',
 'intracellular_locate': 'Intracellular',
 'intracellular_comppi': 'Intracellular',
 'intracellular_go': 'Intracellular',
 'intracellular': 'Intracellular',
 'secreted_locate': 'Secreted',
 'extracellular_locate': 'Extracellular',
 'extracellular_surfaceome': 'Extracellular',
 'extracellular_matrixdb': 'Extracellular',
 'extracellular_membranome': 'Extracellular',
 'extracellular_cspa': 'Extracellular',
 'extracellular_hpmr': 'Extracellular',
 'extracellular_cellphonedb': 'Extracellular',
 'extracellular': 'Extracellular',
 'extracellular_comppi': 'Extracellular',
 'transmembrane_cellphonedb': 'Transmembrane',
 'transmembrane_go': 'Transmembrane',
 'transmembrane_opm': 'Transmembrane',
 'transmembrane_locate': 'Transmembrane',
 'transmembrane_topdb': 'Transmembrane',
 'transmembrane': 'Transmembrane',
 'adhesion_cellphonedb': 'Adhesion',
 'adhesion_go': 'Adhesion',
 'adhesion_matrisome': 'Adhesion',
 'adhesion_hgnc': 'Adhesion',
 'adhesion_integrins': 'Adhesion',
 'adhesion_zhong2015': 'Adhesion',
 'adhesion_adhesome': 'Adhesion',
 'adhesion': 'Adhesion',
 'surface_enzyme_go': 'Surface enzyme',
 'surface_enzyme_surfaceome': 'Surface enzyme',
 'surface_enzyme': 'Surface enzyme',
 'surface_ligand_go': 'Surface ligand',
 'surface_ligand_cellphonedb': 'Surface ligand',
 'surface_ligand': 'Surface ligand',
 'transporter_surfaceome': 'Transporter',
 'transporter_go': 'Transporter',
 'transporter_dgidb': 'Transporter',
 'transporter': 'Transporter',
 'extracellular_enzyme': 'Extracellular enzyme',
 'extracellular_peptidase': 'Extracellular peptidase',
 'growth_factor_binder': 'Growth factor binder',
 'growth_factor_regulator': 'Growth factor regulator',
 'secreted_matrisome': 'Secreted',
 'secreted_cellphonedb': 'Secreted',
 'secreted': 'Secreted',
 'gap_junction': 'Gap junction',
 'tight_junction': 'Tight junction'}

list(i.classes['adhesion'])[:10]

['Q9HBL0',
 Complex: COMPLEX:P05556-P08195-P08648-P23229,
 'O75509',
 'P13747',
 'Q9ULB1',
 Complex: COMPLEX:O94813-Q9Y6N7,
 'Q07092',
 'Q8NFY4',
 'P25105',
 'P35222']

16: Gene Ontology ¶

pypath.go is an almost standalone module for management of the Gene Ontology tree and annotations. The main objects here are GeneOntology and GOAnnotation. The former represents the ontology tree, i.e. terms and their relationships, the latter their assignment to gene products. Both provides many versatile methods for querying.

from pypath import go
goa = go.GOAnnotation()

goa.ontology # the GeneOntology object

<pypath.go.GeneOntology at 0x6ad3e1951cd0>

goa # the GOAnnotation object

<pypath.go.GOAnnotation at 0x6ad3e1999610>

Among many others, the most versatile method is select which is able to select the annotated gene products by various expressions built from GO terms or IDs. It understands AND, OR, NOT and parentheses.

query = """(cell surface OR
        external side of plasma membrane OR
        extracellular region) AND
        (regulation of transmembrane transporter activity OR
        channel regulator activity)"""
result = goa.select(query)
print(list(result)[:7])

['P80108', 'Q16623', 'Q07699', 'Q92913', 'Q8NBP7', 'Q9UKS6', 'Q9UEU0']

goa.ontology.get_all_descendants('GO:0005576')

{'GO:0001507',
 'GO:0001527',
 'GO:0003351',
 'GO:0003355',
 'GO:0005201',
 'GO:0005576',
 'GO:0005577',
 'GO:0005582',
 'GO:0005583',
 'GO:0005584',
 'GO:0005585',
 'GO:0005586',
 'GO:0005587',
 'GO:0005588',
 'GO:0005590',
 'GO:0005591',
 'GO:0005592',
 'GO:0005595',
 'GO:0005596',
 'GO:0005599',
 'GO:0005601',
 'GO:0005602',
 'GO:0005604',
 'GO:0005606',
 'GO:0005607',
 'GO:0005608',
 'GO:0005609',
 'GO:0005610',
 'GO:0005611',
 'GO:0005612',
 'GO:0005614',
 'GO:0005615',
 'GO:0005616',
 'GO:0006858',
 'GO:0006859',
 'GO:0006860',
 'GO:0009519',
 'GO:0010367',
 'GO:0016914',
 'GO:0016942',
 'GO:0020003',
 'GO:0020004',
 'GO:0020005',
 'GO:0020006',
 'GO:0030020',
 'GO:0030021',
 'GO:0030023',
 'GO:0030197',
 'GO:0030345',
 'GO:0030934',
 'GO:0030935',
 'GO:0030938',
 'GO:0031012',
 'GO:0031395',
 'GO:0032311',
 'GO:0032579',
 'GO:0033165',
 'GO:0033166',
 'GO:0034358',
 'GO:0034359',
 'GO:0034360',
 'GO:0034361',
 'GO:0034362',
 'GO:0034363',
 'GO:0034364',
 'GO:0034365',
 'GO:0034366',
 'GO:0034385',
 'GO:0035182',
 'GO:0035183',
 'GO:0035323',
 'GO:0035324',
 'GO:0035581',
 'GO:0035582',
 'GO:0035583',
 'GO:0036117',
 'GO:0038098',
 'GO:0038101',
 'GO:0038105',
 'GO:0042567',
 'GO:0042568',
 'GO:0042571',
 'GO:0042627',
 'GO:0043083',
 'GO:0043230',
 'GO:0043245',
 'GO:0043256',
 'GO:0043257',
 'GO:0043258',
 'GO:0043259',
 'GO:0043260',
 'GO:0043261',
 'GO:0043263',
 'GO:0043264',
 'GO:0043509',
 'GO:0043510',
 'GO:0043511',
 'GO:0043512',
 'GO:0043513',
 'GO:0043514',
 'GO:0043655',
 'GO:0044420',
 'GO:0044421',
 'GO:0045171',
 'GO:0045172',
 'GO:0048046',
 'GO:0048180',
 'GO:0048183',
 'GO:0055039',
 'GO:0060102',
 'GO:0060103',
 'GO:0060104',
 'GO:0060105',
 'GO:0060106',
 'GO:0060107',
 'GO:0060108',
 'GO:0060109',
 'GO:0060110',
 'GO:0060111',
 'GO:0060287',
 'GO:0061696',
 'GO:0061701',
 'GO:0061800',
 'GO:0061801',
 'GO:0062023',
 'GO:0062039',
 'GO:0062040',
 'GO:0065010',
 'GO:0070062',
 'GO:0070289',
 'GO:0070505',
 'GO:0070645',
 'GO:0070701',
 'GO:0070702',
 'GO:0070703',
 'GO:0070743',
 'GO:0070744',
 'GO:0070745',
 'GO:0071736',
 'GO:0071739',
 'GO:0071743',
 'GO:0071746',
 'GO:0071748',
 'GO:0071749',
 'GO:0071750',
 'GO:0071751',
 'GO:0071752',
 'GO:0071754',
 'GO:0071756',
 'GO:0071757',
 'GO:0071914',
 'GO:0071953',
 'GO:0072534',
 'GO:0072562',
 'GO:0072563',
 'GO:0085026',
 'GO:0085036',
 'GO:0085040',
 'GO:0090658',
 'GO:0090660',
 'GO:0090733',
 'GO:0097058',
 'GO:0097059',
 'GO:0097189',
 'GO:0097311',
 'GO:0097312',
 'GO:0097313',
 'GO:0097579',
 'GO:0097619',
 'GO:0097691',
 'GO:0098549',
 'GO:0098595',
 'GO:0098642',
 'GO:0098643',
 'GO:0098644',
 'GO:0098645',
 'GO:0098646',
 'GO:0098648',
 'GO:0098651',
 'GO:0098652',
 'GO:0098774',
 'GO:0098875',
 'GO:0098965',
 'GO:0098966',
 'GO:0099126',
 'GO:0099535',
 'GO:0099544',
 'GO:0120197',
 'GO:0150043',
 'GO:1900115',
 'GO:1900116',
 'GO:1903561',
 'GO:1990318',
 'GO:1990323',
 'GO:1990324',
 'GO:1990325',
 'GO:1990326',
 'GO:1990338',
 'GO:1990339',
 'GO:1990340',
 'GO:1990341',
 'GO:1990377',
 'GO:1990562',
 'GO:1990563',
 'GO:1990742',
 'GO:1990971',
 'GO:1990972'}

17: Protein complexes ¶

The pypath.complex module builds a non-redundant list of complexes from 10 original resources. Complexes are unique considering their set of components, and optionally carry stoichiometry information.

from pypath import complex
complexdb = complex.get_db()
complexdb.update_index()

complexdb

<pypath.complex.ComplexAggregator at 0x6ad441788e50>

To retrieve all complexes containing a specific protein, here MTOR:

complexdb.proteins['P42345']

{Complex: COMPLEX:O00141-O15530-O75879-P23443-P34931-P42345-Q6R327-Q8N122-Q9BPZ7-Q9BVC4-Q9H672,
 Complex: COMPLEX:O00141-O15530-P07900-P23443-P31749-P31751-P42345-P78527-Q05513-Q05655-Q6R327-Q8N122-Q9BPZ7-Q9BVC4,
 Complex: COMPLEX:O00141-O15530-P0CG47-P0CG48-P23443-P42345-Q15118-Q6R327-Q8N122-Q96BR1-Q9BPZ7-Q9BVC4,
 Complex: COMPLEX:O00141-O15530-P23443-P42345-Q15118-Q6R327-Q8N122-Q96BR1-Q96J02-Q9BPZ7-Q9BVC4,
 Complex: COMPLEX:O00141-O75879-P0CG48-P23443-P34931-P42345-P62753-Q6R327-Q8N122-Q9BPZ7-Q9BVC4-Q9NY26,
 Complex: COMPLEX:O00141-P0CG48-P23443-P36894-P42345-P62942-P68106-Q15427-Q6R327-Q8N122-Q9BPZ7-Q9BVC4,
 Complex: COMPLEX:O00141-P0CG48-P23443-P42345-P46781-P62753-Q6R327-Q8N122-Q96KQ7-Q9BPZ7-Q9BVC4-Q9NY26,
 Complex: COMPLEX:O00141-P0CG48-P23443-P42345-P62753-P62942-Q6R327-Q8N122-Q9BPZ7-Q9BVC4-Q9NY26,
 Complex: COMPLEX:O00141-P0CG48-P23443-P42345-P62753-Q15172-Q6R327-Q8IW41-Q9BPZ7-Q9BVC4-Q9H672,
 Complex: COMPLEX:O00141-P0CG48-P23443-P42345-P62753-Q6R327-Q70Z35-Q8N122-Q8TCU6-Q9BPZ7-Q9BVC4-Q9NY26,
 Complex: COMPLEX:O00141-P0CG48-P23443-P42345-Q13393-Q15382-Q6R327-Q8N122-Q9BPZ7-Q9BVC4-Q9NY26,
 Complex: COMPLEX:O00141-P0CG48-P23443-P42345-Q5VT52-Q6R327-Q8N122-Q9BPZ7-Q9BVC4-Q9NY26-Q9UBS3,
 Complex: COMPLEX:O00141-P23443-P42345-Q6R327-Q7L523-Q8N122-Q9BPZ7-Q9BVC4-Q9HB90-Q9NY26,
 Complex: COMPLEX:O00303-O15371-O15372-O75821-P06730-P23443-P42345-P55884-Q13542-Q6R327-Q7L2H7-Q8N122-Q9BVC4-Q9UBQ5-Q9Y262,
 Complex: COMPLEX:O00303-O15372-O75821-P23443-P42345-P55884-P62753-Q6R327-Q7L2H7-Q8N122-Q9BVC4-Q9UBQ5-Q9Y262,
 Complex phosphatidylinositol 3-kinase complex: COMPLEX:O00329-O00443-O00459-O00750-O75747-P42336-P42338-P42345-P48736-Q8NEB9-Q8WYR1-Q92569,
 Complex: COMPLEX:O15350-O43156-O95619-P04637-P42345-Q71UI9-Q92993-Q9H0E9-Q9NPF5-Q9UBU8-Q9Y230-Q9Y265-Q9Y4A5,
 Complex Yy1-Ppargc1a-Frap1 complex: COMPLEX:O15391-P25490-P42345-Q9UBK2,
 Complex: COMPLEX:O15530-O95782-P07900-P23443-P31749-P31751-P42345-P78527-Q04759-Q05513-Q05655-Q9Y243,
 Complex: COMPLEX:O15530-P06730-P23443-P42345-Q13542-Q6R327-Q8N122-Q9BVC4-Q9UBS0,
 Complex: COMPLEX:O15530-P06730-P23443-P42345-Q13542-Q6R327-Q8N122-Q9BVC4-Q9Y243,
 Complex: COMPLEX:O15530-P0CG48-P23443-P42345-P52736-Q6R327-Q8N122-Q96BR1-Q9BPZ7-Q9BVC4-Q9NY26,
 Complex: COMPLEX:O15530-P23443-P42345-P62753-Q6R327-Q8N122-Q96BR1-Q9BPZ7-Q9BVC4-Q9NY26-Q9UBS0,
 Complex: COMPLEX:O15530-P23443-P42345-Q6R327-Q8N122-Q96BR1-Q96J02-Q9BPZ7-Q9BVC4-Q9NY26,
 Complex: COMPLEX:O15530-P23443-P42345-Q6R327-Q8N122-Q96BR1-Q9BPZ7-Q9BVC4-Q9HBY8-Q9NY26,
 Complex: COMPLEX:O43156-O75925-P36508-P42345-Q9BVM2-Q9H6T3-Q9Y230-Q9Y265,
 Complex: COMPLEX:O43156-O95467-P42345-P61254-Q9BVM2-Q9H6T3-Q9Y230-Q9Y265,
 Complex: COMPLEX:O43156-O95619-P42345-Q71UI9-Q92993-Q9H0E9-Q9H6T3-Q9NPF5-Q9UBU8-Q9Y230-Q9Y265-Q9Y4A5,
 Complex: COMPLEX:O43156-O95831-P42345-Q6NUQ4-Q9BVM2-Q9H6T3-Q9Y230-Q9Y265,
 Complex: COMPLEX:O43156-P11766-P42345-Q6NXG1-Q96CM8-Q9BVM2-Q9H6T3-Q9Y230-Q9Y265,
 Complex: COMPLEX:O43156-P11766-P42345-Q96CM8-Q9BTY7-Q9BVM2-Q9H6T3-Q9Y230-Q9Y265,
 Complex: COMPLEX:O43156-P19388-P42345-P46934-Q6ZTN6-Q9BVM2-Q9H6T3-Q9Y230-Q9Y265-Q9Y4A5,
 Complex: COMPLEX:O43156-P20226-P36508-P42345-Q9BVM2-Q9H6T3-Q9Y230-Q9Y265,
 Complex: COMPLEX:O43156-P25490-P42345-Q6PI98-Q8NBZ0-Q9H6T3-Q9H981-Q9H9F9-Q9Y230-Q9Y265-Q9Y5K5,
 Complex: COMPLEX:O43156-P30533-P42345-Q14677-Q9H6T3-Q9Y230-Q9Y265,
 Complex: COMPLEX:O43156-P42345-P54278-Q6NUQ4-Q9BVM2-Q9H6T3-Q9Y230-Q9Y265,
 Complex: COMPLEX:O43156-P42345-P63104-Q6NXG1-Q9BVM2-Q9H6T3-Q9NX40-Q9Y230-Q9Y265,
 Complex: COMPLEX:O43156-P42345-Q14677-Q9H6T3-Q9Y230-Q9Y265-Q9Y4A5,
 Complex: COMPLEX:O43156-P42345-Q15029-Q96MX6-Q9H6T3-Q9Y230-Q9Y265,
 Complex: COMPLEX:P05387-P0CG48-P18124-P18621-P42345-P47914-P61254-P62750-P62899-Q02543-Q02878-Q70Z35-Q9NY93-Q9Y3U8,
 Complex: COMPLEX:P06730-P23443-P42345-P55884-Q13542-Q6R327-Q8N122-Q9BVC4,
 Complex: COMPLEX:P06730-P23443-P42345-Q13542-Q15208-Q6R327-Q8N122-Q9BVC4,
 Complex: COMPLEX:P0CG48-P23443-P42345-P62753-Q6R327-Q8N122-Q96BR1-Q96J02-Q9BPZ7-Q9BVC4-Q9NY26,
 Complex: COMPLEX:P15056-P23443-P42345-P49815-P62258-P62834-P63104-Q15382-Q6R327-Q8N122-Q9BVC4,
 Complex: COMPLEX:P23443-P42345,
 Complex: COMPLEX:P42345,
 Complex TORC2 complex: COMPLEX:P42345-P62750-Q6R327,
 Complex FKBP12-FK506 complex: COMPLEX:P42345-P62942,
 Complex: COMPLEX:P42345-P83436-Q14746-Q8WTW3-Q96JB2-Q96MW5-Q9H9E3-Q9UP83-Q9Y2V7,
 Complex: COMPLEX:P42345-Q00688,
 Complex: COMPLEX:P42345-Q02790,
 Complex: COMPLEX:P42345-Q13451,
 Complex: COMPLEX:P42345-Q13535-Q92616-Q9UIA9,
 Complex: COMPLEX:P42345-Q13535-Q96QU8,
 Complex: COMPLEX:P42345-Q13535-Q96QU8-Q9UIA9,
 Complex: COMPLEX:P42345-Q13541-Q15382-Q8N122-Q9BVC4,
 Complex: COMPLEX:P42345-Q13541-Q8N122-Q9BVC4,
 Complex TORC 2 complex: COMPLEX:P42345-Q3KP44-Q6R327-Q9BPZ7-Q9BVC4,
 Complex NSC: COMPLEX:P42345-Q6R327-Q8N122-Q9BVC4,
 Complex mTORC2 complex: COMPLEX:P42345-Q6R327-Q9BPZ7-Q9BVC4,
 Complex mTOR complex (MTOR, RICTOR, MLST8): COMPLEX:P42345-Q6R327-Q9BVC4,
 Complex mTOR complex (MTOR, RAPTOR): COMPLEX:P42345-Q8N122,
 Complex mTORC1: COMPLEX:P42345-Q8N122-Q8TB45-Q96B36-Q9BVC4,
 Complex TORC1 complex: COMPLEX:P42345-Q8N122-Q96B36,
 Complex mTOR complex (MTOR, RAPTOR, MLST8): COMPLEX:P42345-Q8N122-Q9BVC4,
 Complex: COMPLEX:P42345-Q96B36-Q9BVC4,
 Complex: COMPLEX:P42345-Q9BVC4,
 Complex: COMPLEX:P42345-Q9BVC4-Q9UJ68}

Note some of the complexes have human readable names, these are preferred at printing if available from any of the databases. Otherwise the complexes are labelled by COMPLEX:list-of-components.

Take a closer look on one complex object. The hash of the is equivalent with the string representation below, where the UniProt IDs are unique and alphabetically sorted. Hence you can look up complexes using strings as keys despite the dict keys are indeed pypath.intera.Complex objects:

cplex = complexdb.complexes['COMPLEX:P42345-Q13451']

cplex.components # stoichiometry

{'Q13451': 2, 'P42345': 2}

cplex.sources # resources

{'PDB'}

18: Saving datasets as pickles ¶

The large datasets above are compiled from many resources. Even if these are already available in the cache, the data processing often takes longer than convenient, e.g. few minutes. Most of the data integration objects in pypath provide methods to save and load their contents as pickle dumps.

# for `pypath.main.PyPath` objects:
pa.save_network('mynetwork.pickle') # save
pa.init_network(pfile = 'mynetwork.pickle') # load
# for `pypath.annot.AnnotationTable` objects:
a.save_to_pickle('myannots.pickle')
a = annot.AnnotationTable(pickle_file = 'myannots.pickle')
# for `pypath.complex.ComplexAggregator` objects:
complexdb.save_to_pickle('mycomplexes.pickle')
complexdb = complex.ComplexAggregator(pickle_file = 'mycomplexes.pickle')

19: Network in pandas.DataFrame ¶

The original implementation of the network in pypath is based on igraph. Work is ongoing to provide a new and more flexible network builder which will result pandas.DataFrame and to make pypath independent from igraph. As a temporary solution you can easily convert the network to a pandas.DataFrame using the pypath.network module.

from pypath import main
from pypath import data_formats
from pypath import network

pa = main.PyPath()
pa.init_network(data_formats.pathway_all)

net = network.Network.from_igraph(pa)

net.records[:10]

20: Log messages and sessions ¶

Now pypath has an improved logger. All modules sends messages to a log file named by default by the session ID (a 5 char random string). The default path to the log file is ./pypath_log/pypath-xxxxx.log where xxxxx is the session ID. When you import pypath the welcome message tells you the session ID and the log file location.

import pypath

Also by default this is the only message pypath prints directly to the console, otherwise it only messages to the log. Here is how you can access the session ID and the logger:

pypath.session_mod.session

pypath.session_mod.session.log.fname

pypath.session_mod.session.label

From your scripts and apps you can also easily send messages to the logfile:

pypath.session_mod.session.log.msg('Greetings from the pypath tutorial notebook! :)')

with open(pypath.session_mod.session.log.fname, 'r') as fp:
    messages = fp.read().split('\n')

print('\n'.join(messages))

If you create a class inheriting from pypath.session_mod.Logger it will be automatically connected to the session logger:

class ChildOfLogger(pypath.session_mod.Logger):
    
    def __init__(self):
        
        pypath.session_mod.Logger.__init__(self, name = 'child')
    
    def say_something(self):
        
        self._log('Have a nice day! :D')


col = ChildOfLogger()
col.say_something()

with open(pypath.session_mod.session.log.fname, 'r') as fp:
    messages = fp.read().split('\n')

print('\n'.join(messages))

Note, the log messages are flushed by default in every 2 seconds, but their timestamps always refer to the exact time the message has been sent. A second stamp shows the name of the sending submodule or class.

Finally see a log from a real pypath session:

from pypath import main
from pypath import data_formats
pa = main.PyPath()
pa.init_network(data_formats.pathway)

with open(pypath.session_mod.session.log.fname, 'r') as fp:
    messages = fp.read().split('\n')

print('\n'.join(messages[-20:]))

21: BEL export ¶

Biological Expression Language (BEL, https://bel-commons.scai.fraunhofer.de/) is a versatile description language to capture relationships between various biological entities spanning wide range of the levels of biological organization. pypath has a dedicated module to convert the network and the enzyme-substrate interactions to BEL format:

from pypath import main
from pypath import data_formats
from pypath import bel

pa = main.PyPath()
pa.init_network(data_formats.pathway)

You can provide one or more resources to the Bel class. Supported resources currently are pypath.main.PyPath and pypath.ptm.PtmAggregator.

b = bel.Bel(resource = pa)

From the resources we compile a BELGraph object which provides a Python interface for various operations and you can also export the data in BEL format:

b.main()

<pypath.bel.Bel at 0x6ad3b70cc1d0>

b.bel_graph

<pybel.struct.graph.BELGraph at 0x6ad3b70cc790>

b.bel_graph.summarize()

OmniPath vNone
Number of Nodes: 4927
Number of Edges: 70528
Number of Citations: 11930
Number of Authors: 0
Network Density: 2.91E-03
Number of Components: 84
Number of Warnings: 0

b.export_relationships('omnipath_pathways.bel')

with open('omnipath_pathways.bel', 'r') as fp:
    bel_str = fp.read()

print(bel_str[:333])

Subject	Predicate	Object
P17612	directlyDecreases	P20020
P17612	directlyDecreases	P20020
P17612	directlyIncreases	P20020
P17612	directlyIncreases	P20020
P17612	directlyDecreases	Q14643
P17612	directlyDecreases	Q14643
P17612	directlyDecreases	Q14643
P17612	directlyDecreases	Q14643
P17612	directlyIncreases	Q14643
P17612	directlyIncre

22: CellPhoneDB export ¶

CellPhoneDB is a statistical method and a database for inferring inter-cellular communication pathways between specific cell types from single-cell data. OmniPath/pypath uses CellPhoneDB as a resource for interaction, protein complex and annotation data. Apart from this, pypath is able to export its data in the appropriate format to provide input for the CellPhoneDB Python module. For this you can use the pypath.cellphonedb module:

from pypath import cellphonedb
from pypath import settings

settings.setup(network_expand_complexes = False)

Here you can provide parameters for the network or provide an already built network. Also you can provide the datasets as pickles to make them load really fast. Otherwise this step will take quite long.

c = cellphonedb.CellPhoneDB()

You can access each of the CellPhoneDB input files as a pandas.DataFrame and also they've been exported to csv files. For example the interaction_input.csv contains interactions from all the resources used for building the network (here Signor, SingnaLink, etc.):

c.interaction_dataframe[:10]

The proteins and complexes are annotated (transmembrane, peripheral, secreted, etc.) using data from the pypath.intercell module (identical to the http://omnipathdb.org/intercell query of the web service):

c.protein_dataframe[:10]

	id_cp_interaction	partner_a	partner_b	protein_name_a	protein_name_b	annotation_strategy	source
0	CPI-000001	P17612	P20020	KAPCA_HUMAN	AT2B1_HUMAN	CA1,KEGG,OmniPath,Wang	PMID: ,PMID: 9824678
1	CPI-000001	P20020	P17612	AT2B1_HUMAN	KAPCA_HUMAN	KEGG,OmniPath,Wang
2	CPI-000002	P0DP25	P20020	CALM3_HUMAN	AT2B1_HUMAN	CA1,OmniPath	PMID: 6455424
3	CPI-000003	P20020	Q13507	AT2B1_HUMAN	TRPC3_HUMAN	OmniPath,TRIP	PMID: 16887806,PMID: 18205297
4	CPI-000004	Q13976	P20020	KGP1_HUMAN	AT2B1_HUMAN	KEGG,OmniPath
5	CPI-000005	P20020	Q13976	AT2B1_HUMAN	KGP1_HUMAN	KEGG,OmniPath
6	CPI-000006	P0DP24	P20020	CALM2_HUMAN	AT2B1_HUMAN	CA1,OmniPath	PMID: 6455424
7	CPI-000007	P0DP23	P20020	CALM1_HUMAN	AT2B1_HUMAN	CA1,OmniPath,Wang	PMID: 6455424
8	CPI-000008	P20020	P0DP23	AT2B1_HUMAN	CALM1_HUMAN	OmniPath,Wang
9	CPI-000009	Q96QT4	P35579	TRPM7_HUMAN	MYH9_HUMAN	Adhesome,OmniPath,TRIP	PMID: 16407977,PMID: 18394644,PMID: 18675813

	SignaLink3	SignaLink3__	SignaLink3__BCR	SignaLink3__GPCR	SignaLink3__HH	SignaLink3__HIPPO	SignaLink3__Hedgehog	SignaLink3__Hippo	SignaLink3__IIP	SignaLink3__JAK/STAT	...	CellPhoneDB_complex__Cytokine receptor IL3 family	CellPhoneDB_complex__Cytokine receptor IL6 family	CellPhoneDB_complex__Cytokine receptor IL6 family, IL12 subfamily	CellPhoneDB_complex__Cytokine receptor family	CellPhoneDB_complex__Human IgG receptor	CellPhoneDB_complex__Receptor	CellPhoneDB_complex__T cell receptor add	CellPhoneDB_complex__TGFBeta_receptor_add	CellPhoneDB_complex__growth factor receptor	CellPhoneDB_complex__hematopoyetic receptor
A0A024RBG1	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
A0A075B6H9	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
A0A075B6I0	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
A0A075B6I1	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
A0A075B6I4	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
A0A075B6I9	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
A0A075B6J1	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
A0A075B6J6	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
A0A075B6J9	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
A0A075B6K0	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

	category	uniprot	genesymbol	entity_type
0	receptor_cellphonedb	COMPLEX:Q5KU26	COLEC12	complex
1	receptor_cellphonedb	COMPLEX:Q15223	NECTIN1	complex
2	receptor_cellphonedb	P33032	MC5R	protein
3	receptor_cellphonedb	Q13467	FZD5	protein
4	receptor_cellphonedb	P30495	HLA-B	protein
5	receptor_cellphonedb	P35916	FLT4	protein
6	receptor_cellphonedb	COMPLEX:P04629	NTRK1	complex
7	receptor_cellphonedb	Q9UKP6	UTS2R	protein
8	receptor_cellphonedb	COMPLEX:P27037-P36896-Q13705-Q8NER5	ACVR1B-ACVR1C-ACVR2A-ACVR2B	complex
9	receptor_cellphonedb	COMPLEX:Q9NZQ7	CD274	complex

	uniprot	protein_name	transmembrane	peripheral	secreted	receptor	integrin
0	P55087	AQP4_HUMAN	True	True	False	False	False
1	O43184	ADA12_HUMAN	True	True	True	False	False
2	P24001	IL32_HUMAN	True	False	False	False	False
3	Q92956	TNR14_HUMAN	True	True	False	True	False
4	P54284	CACB3_HUMAN	False	False	False	False	False
5	O60542	PSPN_HUMAN	False	False	True	False	False
6	P48426	PI42A_HUMAN	False	False	True	False	False
7	P81172	HEPC_HUMAN	False	False	True	False	False
8	Q9HD43	PTPRH_HUMAN	True	True	False	True	False
9	P01130	LDLR_HUMAN	True	True	False	True	False