Hierarchical protein classification based on gene ontology and decision trees
Date Issued
2010-09
Author(s)
Mirceva, Georgina
Abstract
Proteins are the most important cell parts, therefore, knowing their
exact function is of a great significance. However, the function of large amount
of proteins is still unknown. In addition, today, biologists persist on hierarchical
organization the living world, and thus in protein databases also. There are
many protein classification algorithms proposed determining the protein
function, but, only a few of them take into consideration these hierarchical
structures. The Gene Ontology (GO) is a protein and gene database structured
as a controlled hierarchical vocabulary of terms to describe protein functions.
This paper introduces a new hierarchical multi-label protein classifier that uses
the relationships among the GO terms. First, protein descriptors are extracted
from the structural coordinates stored in the Protein Data Bank (PDB) files.
Then, a modified C4.5 algorithm is applied to select the most appropriate
descriptor features for protein classification based on the GO hierarchy. An
evaluation of this approach is presented, and the results show that the
hierarchical structure of GO is important for improving the accuracy of the
classification problem at higher levels.
exact function is of a great significance. However, the function of large amount
of proteins is still unknown. In addition, today, biologists persist on hierarchical
organization the living world, and thus in protein databases also. There are
many protein classification algorithms proposed determining the protein
function, but, only a few of them take into consideration these hierarchical
structures. The Gene Ontology (GO) is a protein and gene database structured
as a controlled hierarchical vocabulary of terms to describe protein functions.
This paper introduces a new hierarchical multi-label protein classifier that uses
the relationships among the GO terms. First, protein descriptors are extracted
from the structural coordinates stored in the Protein Data Bank (PDB) files.
Then, a modified C4.5 algorithm is applied to select the most appropriate
descriptor features for protein classification based on the GO hierarchy. An
evaluation of this approach is presented, and the results show that the
hierarchical structure of GO is important for improving the accuracy of the
classification problem at higher levels.
Subjects
File(s)![Thumbnail Image]()
Loading...
Name
DraskoNakikICTIn.pdf
Size
7.31 MB
Format
Adobe PDF
Checksum
(MD5):e68f94e9699b3f3ec0392c6582dc5c05
