Chun Li*, Jialing Zhao, Changzhong Wang and Yuhua Yao
The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. In this paper, based on two physicochemical properties of amino acids, a protein primary sequence is converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix are obtained. A generalized PseAAC (pseudo amino acid composition) model is thus constructed to numerically characterize a protein sequence. By using this mathematical descriptor of a protein sequence, we compare similarities among β-globin proteins of 17 species and 72 spike proteins of coronaviruses, respectively. Furthermore, by employing SVM (support vector machine) as the classifier, we make a series of experiments of DNA-binding protein identification on three datasets. Comparison results show that the proposed method is very competitive by outperforming some existing methods.
Adjacency matrix, Generalized PseAAC, graph, identification of DNA-binding proteins, phylogenetic analysis.
Department of Mathematics, Bohai University, Jinzhou 121013, Department of Mathematics, Bohai University, Jinzhou 121013, Department of Mathematics, Bohai University, Jinzhou 121013, School of Mathematics and Statistics, Hainan Normal University, Haikou 571158