networkx + Cytoscape 构建及可视化网络图

本次笔记内容：

网络图的简要结构

以相关系数表为例：networkx构建网络结构

Cytoscape可视化网络图

co-XXX network

网络图的简要结构

网络结构由点及连接点的线组成，反映了点所代表的元素之间的关系。网络图使得我们对各元素之间的关系有一个直观的认识。

点：
点的大小可以表示该元素所包含的样本数/数值大小等；
点的颜色及形状可以表示该元素的类别属性。
线：
线的粗细可以表示两元素之间关联的大小（比如相关系数的大小）；
线的颜色可以表示两元素之间关联的方向（比如相关系数的正负），或者你自定义的某些类别；
线可以包含箭头，同样可以表明方向；
线的形状可以为直线或曲线，曲线可以在两个元素之间不重合。以上需根据具体科学目的自定义其属性。

以相关系数表为例network绘制网络图

这里用的networkx是Python一个模块。我们用它来定义构成网络图的点，线，及点线的各种属性。
安装networkx : $ pip install networkx

以下为一个基础示例，可以快速了解一下：

import networkx as nx
import matplotlib.pylab as plt
G = nx.Graph()   # 生成一个空的network对象

G.add_node('a',group='t1', your_group='your_group1' )  # 添加每个点(node), group就是node的类别属性，你可以自定义每个node的属性。
G.add_node('b',group='t1', ) 
G.add_node('c',group='t2')  
print(G.node(data=True))
# 点以list储存，是有顺序的，其属性以字典储存。
# [('a', {'group': 't1', 'your_group': 'your_group1'}), ('b', {'group': 't1'}), ('c', {'group': 't2'})]

G.add_edge('a', 'b', weight=0.5,graphics={'fill' : '#CD5C5C'}) # 添加边，a,b两点之间的连线weight为0.5，填充颜色为#CD5C5C
print(G.edge(data=True))
# [('a', 'b', {'weight': 0.5, 'graphics': {'fill': '#CD5C5C'}})]
# 注意以上network各属性的存储方式，可以通过for循环来批量设置点和线的属性。

node_color = ['yellow','yellow','orange']
nx.draw(G,node_color=node_color,node_size=200, with_labels=True)
plt.show()
# 可以结合matplotlib在python里可视化

nx.write_gml(G, 'XXX.gml')
# 网络结构可以存储为.gml等格式，作为Cytoscape的input
# 另外，gml格式不允许类别名称出现下划线‘_’,即group不可以是group_a这样的写法。

以下以sklearn中 wine recognition dataset为例，
先构建相关系数矩阵：

import networkx as nx
import matplotlib.pylab as plt
from sklearn import datasets
import pandas as pd
from scipy.stats.stats import pearsonr  # 用来计算pearson相关系数

wine_data = datasets.load_wine()
wine_df = pd.DataFrame(data=wine_data['data'],columns=wine_data['feature_names'])
wine_df.head() # 行为samples, 列为features
> alcohol   malic_acid  ash alcalinity_of_ash   magnesium   total_phenols   flavanoids  nonflavanoid_phenols    proanthocyanins color_intensity hue od280/od315_of_diluted_wines    proline 
0   14.23   1.71    2.43    15.6    127.0   2.80    3.06    0.28    2.29    5.64    1.04    3.92    1065.0  
1   13.20   1.78    2.14    11.2    100.0   2.65    2.76    0.26    1.28    4.38    1.05    3.40    1050.0  
2   13.16   2.36    2.67    18.6    101.0   2.80    3.24    0.30    2.81    5.68    1.03    3.17    1185.0  
3   14.37   1.95    2.50    16.8    113.0   3.85    3.49    0.24    2.18    7.80    0.86    3.45    1480.0  
4   13.24   2.59    2.87    21.0    118.0   2.80    2.69    0.39    1.82    4.32    1.04    2.93    735.0   

corr = pd.DataFrame()
corr_p = pd.DataFrame()
for i in wine_df.columns:
    for j in wine_df.columns:
        corr.loc[i,j] = pearsonr(wine_df[i], wine_df[j])[0]
        corr_p.loc[i,j] = pearsonr(wine_df[i], wine_df[j])[1]
corr.head() # 是一个wine_df.columns对wine_df.columns的对称table

tep = abs(corr_p.values) < 0.05
temp.sum()   # 135，p<0.05的不少

然后构建网络结构：

def dataframe_to_tp(corr, row, corr_pvalue):
   '''
   :param corr: corr df
   :param row: like "alcohol" (single OTU index, for loop use)
   :param corr_pvalue: corr_p df
   :return: row_list: [('alcohol','ash', cor('alcohol','ash'))]
   很冗余，能写的更elegent请告诉我
   '''
   row_list = []
   for col in list(corr.columns):
       if (corr_pvalue.loc[row, col] < 0.05) & (abs(corr).loc[row,col] > 0.5):
           tp = (row, col, corr.loc[row, col])
           row_list.append(tp)
       else:
           pass
   for i in row_list:
       if i[0] == i[1]: # 把那些自己对自己的去掉
           row_list.remove(i)
   return row_list

print(dataframe_to_tp(corr, 'alcohol',corr_p))
# [('alcohol', 'color_intensity', 0.5463641950837036), ('alcohol', 'proline', 0.6437200371782135)]

raw_edges_list = [] # [(node1,node2,weight),(...),()]
for row in corr.index.values:
    raw_edges_list += dataframe_to_tp(corr, row, corr_p)
# 这时候的raw_edges_list是含有重复edges的，即包含(node1,node2,XX)和(node2,node1,XX)
# 后面的操作直到edge_weight_final都是为了去除重复edges...可以说是相当ugly了，如果有什么好的办法请告诉我...

edge_dict = {}
for tp in raw_edges_list:
    edge_dict[(tp[0],tp[1])] = tp[2]

not_d_edge = []
for i in [set(i) for i in edge_dict.keys()]:
    if i not in not_d_edge:
       not_d_edge.append(i)

edge_weight_final = []
for edge in not_d_edge:
    final_edge = tuple(edge) + (edge_dict[tuple(edge)],)
    edge_weight_final.append(final_edge)

print(edge_weight_final) #看一下
# [('alcohol', 'color_intensity', 0.5463641950837036), 
# ('alcohol', 'proline', 0.6437200371782135), 
# ('hue', 'malic_acid', -0.5612956886649448), 
# ('total_phenols', 'flavanoids', 0.8645635000951146), # ('proanthocyanins', 'total_phenols', 0.6124130837800361), 
# ('od280/od315_of_diluted_wines', 'total_phenols', 0.699949364791186), 
# ('nonflavanoid_phenols', 'flavanoids', -0.5378996119051982), 
# ('proanthocyanins', 'flavanoids', 0.6526917686075154), 
# ('hue', 'flavanoids', 0.5434785664899897), 
# ('od280/od315_of_diluted_wines', 'flavanoids', 0.7871939018669516), 
# ('nonflavanoid_phenols', 'od280/od315_of_diluted_wines', -0.5032695960789115), 
# ('od280/od315_of_diluted_wines', 'proanthocyanins', 0.5190670956825231), 
# ('hue', 'color_intensity', -0.5218131932287576), 
# ('od280/od315_of_diluted_wines', 'hue', 0.5654682931826592)]



# 把做好的tuple list用于构建网络：
G = nx.Graph()
# G.add_weighted_edges_from(weighted_edges_list)  这样也可以
for tp in weighted_edges_list:
    node1 = tp[0]
    node2 = tp[1]
    weight = tp[2]
    if weight > 0:   # 给正负相关系数添上不同的颜色
        G.add_edge(node1, node2, weight=abs(weight), graphics={'fill':'#CD5C5C'})
    else:
        G.add_edge(node1, node2, weight=abs(weight), graphics={'fill':'#4682B4'})

# 画图
edge_width = [G[u][v]['weight']*10 for u,v in G.edges()]
color = [G[u][v]['graphics']['fill'] for u,v in G.edges()]
nx.draw(G,width=edge_width, pos = nx.circular_layout(G), edge_color=color, with_labels=True)
# 如下图所示

nx.write_gml(G, 'XXX/wine.gml')
# wine.gml就是网络结构了，接下来把它导入Cytoscape可视化

线的粗细为相关系数的大小（不太明显，因为相关系数差不多），红色为正相关，蓝色为负相关。networkx确实也可以很好的可视化网络图，但Cytoscape的功能更完备。

Cytoscape可视化网络图

Cytoscape是一个免费软件，可以在win，macOS及Linux系统上使用。其下载和安装对java有要求，可以参考其官网下载指南。

安装好Cytoscape后，打开并导入(import)上述构建的.gml格式网络结构：

得到：

设置layout : layout是用算法将网络的形态改变，调整成合适的样子。以下为两种不同layout的效果。可以看到我们在Python里设置的edges颜色在这里出现了。
Control Panel中Style栏用于设置字体大小，node框性状及颜色，如果node有分组信息可以按组批量设置。建议自己都点着试试，很容易上手。
Table Panel中包括node table, edge table和network table。
有一篇超级详细的Cytoscape中文攻略可以参考。

按照weight的大小定义线的width，注意在style的edge栏里设置。

在style里还可以设置网络图的appearance：

导出为pdf:

co-XXX network

现在好像很流行画co-abundance/co-occurrence network，如果你找出来组间差异物种有很多，想看看这么多差异物种之间有没有什么关联。于是可以画个co-XXX network。比方说如下两篇文章的图：

2017 Nat Med. Metformin alters the gut microbiome of individuals with treatment-naive type 2 diabetes, contributing to the therapeutic effects of the drug

2019 Gut. Tracing the accumulation of in vivo human oral microbiota elucidates microbial community dynamics at the gateway to the GI tract

需要注意的是这种图是把两个（或多个）network合并在一起绘制的。两点之间除了直线连接还有曲线（避免与直线重合）。这意味着两点之间不只一个edge. 上述示例代码仍然适用，但需要使用MultiGraph()

import networkx as nx
Gm = nx.MultiGraph()
Gm.add_edge('XXX from corr_1')
Gm.add_edge('XXX from corr_2') 
# 具体内容和上述示例代码是差不多的

在Cytoscape中，如果需要设置edge的弯曲，在Stlye,Edge页面，点击Properties展开按钮，找到Bend, 可以按照提示设置边的曲率。

最后跑个题，安利一下Atom, 一个text editor. 在win上用的话可以扔掉Notepad++了。而且我感觉没用到很高级的功能的话，pycharm都可以扔掉了...R, python, js等各种语言的代码都可以往里塞，可以自行下载海量packages来满足五花八门的需求，界面如下：

像ipython一样逐行输出：

图也可以plot出来，在代表颜色的字符串上直接print出颜色（太炫了！！看这个Q萌的界面！颜控爱了！！(´Д` )）

参考：
networkx官方tutorial
network.draw的细节
 一篇超级详细的Cytoscape中文攻略
 Cytoscape tutorials官网
 知乎专栏上一个atom的介绍
 Atom官方使用指南