As was discussed in the post: “What is a Graph“; a graph is a mathematical representation of a network. So what does that have to do with Autism? I won’t go into great detail because the main topic here is network theory, and ultimately how these models have numerous applications (including grouping associated words with data). Briefly though, DNA, as many things are, is a network. This particular network contains groups of interconnected nucleotides which consist of three parts: a phosphate group, a sugar group and one of four types of nitrogen bases. A DNA strand is a linked chain of nucleotides, with a phosphate and sugar group alternating. Adenine (A), Thymine (T), Guanine (G) and Cytosine (C) are the four types of nitrogen bases found in nucleotides. The sequence of these nitrogen bases determines the biological instructions in a DNA strand, such as autism, alzheimer’s, and other genetic anomalies. In this case, the vertices are the nucleotides, and the edges represent their links. This post demonstrates how network theory can be applied to genetic analysis to help determine the genetic codes responsible for autism.
Case Study: Discovering New Genes that Cause Autism using Network Theory
This case study is part of the MIT “Data Science: Data to Insights” curriculum online course, and is presented by Professor Caroline Uhler. The R code associated with this case study can be downloaded from Github. A dataset from BioGRID, SFARI, and Gene ID table are provided for the study as well.
Introduction:
The genetic basis of Autism is poorly understood, even with several genes being identified it is believed there are still more. The purpose of this study is to use network analysis methods to identify candidate genes for additional study.
The network used for this analysis is called protein-protein interaction networks. In these networks, the nodes are genes, and an edge between gene A and gene B means that the protein produced by gene A interacts with the protein produced by gene B. Unfortunately, the process of identifying these interactions is problematic with many Type 1 and Type 2 errors. There is also a bias error since much of the research focuses on proteins that are known to be associated with a particular disease. Therefore, the resulting networks have many edges connecting highly studied genes, with many missing edges yet to be discovered.
There is a list of genes known to cause autism and it is available through the gene autism database, SFARI, and the protein-protein interaction networks are available from BioGRID. The question becomes: “how do we suggest other candidate genes that can then be experimentally validated with respect to their relationship with autism?” Interestingly enough, the process that will be followed here is an extension of the process used to identify mafia (La Cosa Nostra) members by their known associations (cell phone records possibly). Just as exists in the protein-protein interaction networks, acquaintance networks are very noisy, meaning not everyone that makes or receives a call from a mafia member is a mafia member. Just as it is with the genetics data and the SFARI data, there are known genes for autism, and these known members get highlighted. In all of the noisy data, it is reasonable to assume that some of the acquaintances are members of the mafia, or candidate genes for study. Therefore, we are interested in “central nodes in the network that lie on paths that connect the already known members” (mafia, or candidate genes). These shortest paths are called geodesic paths.
In this problem, we have a list of genes that are known to be associated with Autism, and we need what are called molar sub-graphs that connect the known and suspect nodes. This problem has a name: “minimum-weight Steiner tree.” The first step is to compute the geodesic distances in the subset. Next, we build a minimum spanning tree. With this subnetwork identified, we can subtract it from the original, subtract the known genes, and we are left with the candidate genes. The resultant network is called the interactome.
To verify the results, subsets of nodes are chosen at random from the network, and build the interactome of these subsets. The connectivity of a network is determined by the diameter of the network, or the average geodesic distances. The highly connected nodes are considered important nodes, because it is reasonable to assume that a node is important if it is central to the flow of information through the network. The basis of the conclusion is that a node’s importance is proportional to the number of geodesic paths that go through that node, and these nodes are identified by applying the “betweenness centrality” measure.
Data Analysis and Preparation:
The data is loaded into R with the following commands:
library(igraph) set.seed(449) biogrid <- read.delim("./data/BIOGRID.txt",stringsAsFactors = F) names(biogrid) attach(biogrid) HSnet <- graph.data.frame( data.frame(Entrez.Gene.Interactor.A,Entrez.Gene.Interactor.B), directed=F)
Once the data is loaded into R, we need a way of representing the graph. In the earlier post on graphs, we discussed the various types of graphs. One question you might ask is how do you represent a graph. Plotting a large network literally looks like a large ball of yarn as shown in Figure 1; its not of much use. Visually finding the influential nodes in this network would be impossible.
Rather than plotting the graph as we have done in Figure 1, we can represent a network as an adjacency matrix. This is a 2 dimensional square array where the rows and columns are indexed by the vertices in the graph as shown below. The entries of the matrix show the number of edges between the two nodes. The values in the matrix can be weighted too. Here is a subset of the dataset(A[1:15, 1:15]) in an adjacency matrix:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 6416 1 . . . . . . . 1 . . . . 2 . 2 84665 . . . . . . . . . . . . . . . 3 90 . . 1 . . . . . . . . . . . . 4 2624 . . . . . . . . . . . 1 . . . 5 6118 . . . . . . . 1 1 . . . . . . 6 375 . . . . . 1 1 . . . . . . . . 7 377 . . . . . 1 . . . . . . . . . 8 54464 . . . . 1 . . . . . . . . . . 9 351 1 . . . 1 . . . 1 . . . . . . 10 333 . . . . . . . . . . . . . . . 11 10370 . . . . . . . . . . . 8 . . . 12 2033 . . . 1 . . . . . . 8 17 . . . 13 338 . . . . . . . . . . . . 1 . . 14 409 2 . . . . . . . . . . . . 1 . 15 1436 . . . . . . . . . . . . . . .
From the adjacency matrix above, you can see that node 6416 has one link to itself, one link to node 351, and two links with node 409. Notice node 2033 (row 12) has 17 links to itself, and 8 links with node 10370.
After looking at this graph in Figure 2, it is easy to see that there are multiple edges and self-loops and we need to remove them for this application. In particular, nodes with a high degree; defined as the number of edges associated with a node. For the study, this is more of a housekeeping exercise and it is assumed that these genes are necessary to keep a cell alive and are not specific to a particular disease. The simplified graph is depicted below:After the data has been simplified, a table of data with gene scores for being associated with autism is loaded. The genes with significant scores are identified. The list used in this case study for significant scores are: c(“3″,”1S”,”1″,”2S”,”2″,”3S”). In the R code, the protein interaction network HSnet, which was created from the BioGRID data above, is used to determine the genes that are present in the network and known to cause autism.
The steiner_tree function is then called:
steiner_tree <- function(terminals, graph){ A = shortest.paths(graph, terminals, terminals) diag(A) = Inf inds <- arrayInd(which.min(A), dim(A)) nodes = get.shortest.paths(graph, terminals[inds[1,1]], terminals[inds[1,2]])$vpath[[1]] free_terminals = terminals[!terminals %in% nodes] for (i in 1:length(free_terminals)){ A = shortest.paths(graph, nodes, free_terminals) inds <- arrayInd(which.min(A), dim(A)) new_nodes = get.shortest.paths(graph, nodes[inds[1,1]], free_terminals[inds[1,2]])$vpath[[1]] free_terminals = free_terminals[!free_terminals %in% new_nodes] nodes = unique(c(nodes, new_nodes)) } steinertree = induced.subgraph(graph, nodes) V(steinertree)$name = sort(nodes) return(steinertree) }
The input into this function is the graph of vectors called “terminals” containing the indices of seed nodes. There are 95 nodes from the BioGRID/HSnetN data which is the overlap between genes significant for autism and vertices in the Steiner Tree. The genes with significant scores are assigned the color red. The plotted graph is provided in Figure 4:
Last part of the analysis requires the Social Network Analysis (sna) package which is loaded and imported. The sna
package is used to calculate the shortest paths, neighbors, and efficiency
of the network. The efficiency function takes one or more graphs (dat
) and returns the Krackhardt efficiency scores for the graphs selected by g
, which are index values for the graphs to be utilized; by default, all graphs are selected. The scores are diameters of the network: the average geodesic distance between any two nodes as discussed above. The total number of nodes is 17796. The code generates N randomly chosen subnetworks, the interactomes as discussed above. For this case study the value of N is set to 50. The larger N is, the more computing resources and time required.
The efficiency scores are returned for the N random graphs, and the efficiency score and connectivity scores for the autism interactome; the “betweeness centrality scores” for each node. Once identified, we identify only those nodes that are NOT already known to be significant. The result is 49 genes that are not already known to be significant and are now identified as genes that are potentially associated with autism.
Conclusion:
Using network theory, and the analogous problem of identifying mafia members by known members, and their associations, we have successfully applied the concepts of network theory in this case study and identified 49 candidate genes as being associated with the brain disorder called autism.
So where else might this model be applied? What other use cases exist for this process?
Leave a Reply
Your email is safe with us.