Random matrix improved community detection in realistic networks

Friday, June 3, 2016 - 11:30 to 12:00
Télécom Paris Tech Room B310

Community detection in large dimensional networks is a subject of heavy investigation these days with applications in social, technological, biological networks. Most real world networks are usually: (i) sparse in the sense that the degree of each node vanish with the network size and (ii) heterogeneous in the sense that the node degrees are highly varying. Spectral clustering, which classifies vertices according to eigenvectors corresponding to outlying eigenvalues of some matrix representation of the network, is one of the prominent methods to perform community detection tasks. A currently breakthrough spectral method is based on a new operator coming from statistical physics, the Bethe Hessian (BH) matrix. The latter and also most existing spectral methods are motivated by the underlying assumption of a stochastic block model (SBM) and by Newman's definition of the graph modularity matrix (derived from the adjacency matrix). As the SBM does not allow for degree heterogeneity inside clusters, we work on the degree-corrected stochastic block model (DCSBM), which generalises the SBM for graphs with heterogeneous degree distributions. In addition to studying realistic degree corrected dense structured-graph models, our investigation is motivated by the lack of unification of the spectral methods for community detection in the literature, based on various normalizations of the adjacency/modularity matrices. In this talk, we study a novel \alpha-regularization of the modularity matrix. By leveraging tools from random matrix theory, we study the eigenstructure of this matrix and the consequences to community detection. In particular, (a) We provide a consistent estimator for the choice of inducing the most favorable community detection in the hard regime where the clusters are not easily separable. (b) We further prove that spectral clustering ought to be performed on a regularization of the dominant eigenvectors (rather than on the eigenvectors themselves) to compensate for biases due to degree heterogeneity. Our approach largely outperforms the state of the art spectral methods, on synthetic graphs generated from the DCSBM with highly varying node degrees. In particular, the BH approach creates additional artificial clusters which are due to biases induced by the degree heterogeneity and this completely alter the classification performance in those scenarios. Futhermore, although based on dense graph models, our clustering method largely outperforms the BH method on some real world benchmarks and has competitive performances on others.