Please use this identifier to cite or link to this item:

`http://hdl.handle.net/10397/83610`

Title: | Discovering patterns in complex networks with applications to link analysis and clustering |

Authors: | Hu, Lun |

Degree: | Ph.D. |

Issue Date: | 2015 |

Abstract: | A network consists of a set of objects and their connections and a complex network is a network that has a non-trivial topology. A computational technique that can discover interesting patterns in complex networks can have many applications in a variety of research areas. For example, it can be used to discover protein complexes in protein-protein interaction networks, or to identify online user communities in social networks. Networks can be represented as graphs with vertices representing objects and edges representing connections between objects. Hence, to discover patterns in networks, graph mining techniques have therefore been used. For many of them to work effectively, patterns are required to have specific topological properties in terms of density, maximal k-cliques, or betweenness centrality. But the attributes associated with the objects in a complex network are usually ignored, or treated separately, during the graph mining process. According to empirical studies on complex networks, associations are believed to be existed between the attributes of objects and the links between objects and thus they may provide valuable information for discovering of interesting graph patterns. In this regard, we propose in this thesis a technique that can discover associative patterns from complex networks by taking into consideration the associations between attribute and topology information during the pattern discovery process. This technique works with what are called attributed graphs (AGs). Associated with each vertex in such a graph is an attribute set where each of attribute can take more than one value.Obviously, to discover associative patterns is to discover regularities between attribute and topology information of AGs. A simple but feasible way to represent them is to make use of pairwise attribute values that are significantly observed in connecting vertices in the AG given. That is to say, if the frequency of co-occurrence of the respective attribute values in two connecting vertices is significantly higher, the co-occurrence of the two attribute values is the associative pattern of interest. Hence, for two attribute values, to determine if the frequency of their co-occurrences is significantly higher, we make use of statistical analysis to determine if the conditional probability of one attribute value given the other is significantly higher from the a priori probability of the attribute value occurring irrespective of other attribute values. If the difference is verified to be statistically significant, then the frequency of co-occurrences of the two attribute values can be considered as significantly higher. In this case, the co-occurrence of these two values constitutes an associative pattern. Once such an associative pattern is identified, we further make use of an information theoretic measure to indicate how significant this pattern is. Hence, for two inter-connected objects that are represented as two vertices connected by a link in an AG, the association between them can be determined by the number and the amount of significances of association patterns found in between them. The proposed technique can hence discover associative patterns in AGs based on both topological and attribute information. Then a Degree of Association (DOA) measure is introduced to compute the association between vertices based on the amount and the significances of associative patterns found in their attributes. The introduction of associative patterns allows us to fully utilize the potential knowledge in AGs in an efficient way and we can use them to tackle problems in a diversity of graph mining problems. For performance evaluation, we have used it to solve problems in link analysis and graph clustering. For link analysis, associative patterns have been used to predict Protein-Protein Interactions (PPIs) in PPINs based on the protein sequences as attributes for the proteins in the network. An algorithm, VLASPD, has been developed based on the proposed technique to consider variable-length segments of each pair of interacting protein sequences to determine the association relationship that exist between these proteins. Unlike other sequence-based approach, VLASPD is able to discover patterns in interacting proteins by considering association between variable-length segments. As a result, it is able to make use of such patterns to more accurately predict if two proteins may interact with each other. We have tested VLASPD with different real data sets and the experimental results show that VLASPD can predict PPIs accurately and can be a promising approach for PPI prediction.For AG clustering, we first propose a fuzzy-based clustering approach, namely FC-AG, by combining the topology and attribute information of AGs with the DOA measure. The adoption of fuzzy clustering allows FC-AG to identify clusters in a natural manner. However, since there are also applications whose number of clusters is unknown, we further develop an unsupervised clustering algorithm, namely MCL-AG, to identify clusters through a markov clustering process. Integrated with the DOA measure, MCL-AG is able to discover dense graph clusters consisting of vertices whose attribute values may have significantly closer association with each other. However, based on the experimental results of MCL-AG, we note that vertices in the same cluster have not to be similar over all attributes. Therefore, if we have a way to perform the unsupervised clustering by resting on attributes that are more similar while ignoring those with less similarity, clusters can be identified more accurately and efficiently. To do so, we propose an algorithm, namely CAP-AG, so that the attribute preferences can be considered during clustering. To evaluate the performance of FC-AG, MCL-AG and CAP-AG, we have applied them to several practical problems, including document classification, social community identification and the prediction of protein complexes. The experimental results show the promising performances of these three approaches. |

Subjects: | Neural networks (Computer science) Cluster analysis. System analysis. Hong Kong Polytechnic University -- Dissertations |

Pages: | xvi, 136 pages : illustrations ; 30 cm |

Appears in Collections: | Thesis |

###### Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/8081

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.