Lailan Sahrina Hasibuan, Lailan Sahrina
Department of Computer Science, Bogor Agricultural University

Published : 2 Documents

Found 2 Documents

Improving DNA Barcode-based Fish Identification System on Imbalanced Data using SMOTE Kusuma, Wisnu Ananta; Noviana, Nurdevi; Hasibuan, Lailan Sahrina; Nurilmala, Mala
TELKOMNIKA (Telecommunication Computing Electronics and Control) Vol 15, No 3: September 2017
Publisher : Universitas Ahmad Dahlan

Show Abstract | Original Source | Check in Google Scholar | Full PDF (457.977 KB)


Problem in imbalanced data is very common in classification or identification. The problem is raised when the number of instances of one class far exceeds the other. In the previous research, our DNA barcode-based Identification System of Tuna and Mackerel was developed in imbalanced dataset. The number of samples of Tuna and Mackerel were much more than those of other fish samples. Therefore, the accuracy of the classification model was probably still in bias. This research aimed at employing Synthetic Minority Oversampling Technique (SMOTE) to yield balanced dataset. We used k-mers frequencies from DNA barcode sequences as features and Support Vector Machine (SVM) as classification method. In this research we used trinucleotide (3-mers) and tetranucleotide (4-mers). The training dataset was taken from Barcode of Life Database (BOLD). For evaluating the model, we compared the accuracy of model using SMOTE and without SMOTE in order to classify DNA barcode sequences which is taken from Department of Aquatic Product Technology, Bogor Agricultural University. The results showed that the accuracy of the model in the species level using SMOTE was 7% and 13% higher than those of non-SMOTE for trinucleotide (3-mers) and tetranucleotide (4-mers), respectively. It is expected that the use of SMOTE, as one of data balancing technique, could increase the accuracy of DNA barcode based fish classification system, particularly in the species level which is difficult to be identified.
Evaluation of F-Measure and Feature Analysis of C5.0 Implementation on Single Nucleotide Polymorphism Calling Hasibuan, Lailan Sahrina; Nabila, Sita; Hudachair, Nurul; Istiadi, Muhammad Abrar
Indonesian Journal of Artificial Intelligence and Data Mining Vol 1, No 1 (2018): March
Publisher : UIN Sultan Syarif Kasim Riau

Show Abstract | Original Source | Check in Google Scholar | Full PDF (566.185 KB)


Data growing in molecular biology has increased rapidly since Next-Generation Sequencing (NGS) technology introduced in 2000, the latest technology used to sequence DNA with high throughput. Single Nucleotide Polymorphism (SNP) is a marker based on DNA which can be used to identify organism specifically. SNPs are usually exploited for optimizing parents selection in producing high-quality seed for plant breeding. This paper discusses SNP calling underlying NGS data of cultivated soybean (Glycine max [L]. Merr) using C5.0, an improved rule-based algorithm of C4.5. The evaluation illustrated that C5.0 is better than the other rule-based algorithm CART based on f-measure. The value of f-measure using C5.0 and CART are 0.63 and 0.58. Besides of that, C5.0 is robust for imbalanced training dataset up to 1:17 but it is suffer in large training dataset. C5.0’s performance may be increased by applying bagging or the other ensemble technique as improvement of CART by applying bagging in final decision. The other important thing is using appropriate features in representing SNP candidates. Based on information gain of C5.0, this paper recommends error probability, homopolymer left, mismatch alt and mean nearby qual as features for SNP calling.