doi: 10.15389/agrobiology.2017.1.63eng

UDC 633.491:004.65:[631.5+632.9

Acknowledgements:
Supported by Budget Project of Institute of Cytology and Genetics SB RAS for Potato Program.

 

DEVELOPMENT OF METHODS FOR AUTOMATIC EXTRACTION OF KNOWLEDGE FROM TEXTS OF SCIENTIFIC PUBLICATIONS FOR THE CREATION OF A KNOWLEDGE BASE SOLANUM TUBEROSUM

O.V. Saik1, P.S. Demenkov1, 2, T.V. Ivanisenko1, N.A. Kolchanov1,
V.A. Ivanisenko1

1Federal Research Center Institute of Cytology and Genetics SB RAS, Federal Agency of Scientific Organizations, 10, prosp. Akademika Lavrent’eva, Novosibirsk, 630090 Russia, e-mail saik@bionet.nsc.ru, demps@bio-net.nsc.ru, itv@bionet.nsc.ru, kol@bionet.nsc.ru, salix@bionet.nsc.ru;
2Novosibirsk State University, 2, ul. Pirogova, Novosibirsk, 630090 Russia

ORCID:
Demenkov P.S. orcid.org 0000-0001-9433-8341
Ivanisenko T.V. orcid.org 0000-0002-0005-9155
Kolchanov N.A. orcid.org 0000-0001-6800-8787
Ivanisenko V.A. orcid.org 0000-0002-1859-4631

Received November 30, 2016

 

Currently there are hundreds of scientific journals that publish research results in various fields of plant biology and agrobiology. Hundreds of thousands of international patents contain a variety of information on agricultural biotechnology. The number of articles and patents is increasing over time in an exponential progression. For example, there are more than 1.5 million publications devoted to the study of Solanum tuberosum that is one of the most important crops in the world. Analysis of such huge number of experimental facts presented in text sources (scientific publications and patents), requires the use of automated methods for knowledge extraction (text-mining). Intelligent automatic text analysis techniques are already widely used in biology and medicine to extract information about the properties and functions of molecular genetic objects. Unlike search engines such as Google, Yandex and others, that search documents by keywords, such text-mining methods are aimed at the automatic extraction of knowledge presented in the documents, knowledge integration and formalization according to the defined ontology. Among the known systems for intelligent knowledge extraction from scientific publications STRING, LMMA, ConReg, GeneMania and others can be listed. For the first time in Russia, we have previously developed a system, named ANDSystem, for automatic intelligent knowledge extraction in biomedicine. ANDSystem contains more than 10 million facts about molecular-genetic interactions extracted from more than 25 million scientific publications. For knowledge extraction in ANDSystem, specially developed semantic and linguistic rules are used for recognition of interactions between biological objects such as, proteins, genes, metabolites, drugs, miсroRNA, biological processes, diseases and others in natural language texts. However, the problem of development of methods for automatic knowledge extraction from the texts in plant biology, agrobiology and agrobiotechnology remains still unsolved and has a high relevance. The aim of this work was to adapt the methods of automatic knowledge extraction, presented in ANDSystem, to the field of crop production and to create on this basis a SOLANUM TUBEROSUM knowledge base, containing information on genetics, markers, breeding and selection of potatoes, its pathogens and pests, storage and processing technologies and others. The knowledge base ontology contains dictionaries, corresponding to more than 20 types of objects, including molecular genetic objects (proteins, genes, metabolites, microRNA, biological processes, biomarkers, etc.), potato varieties and their phenotypic traits, diseases and pests of potato, biotic and abiotic environmental factors, biotechnologies of cultivation, processing and storage of potato, and others. Also, the ontology contains more than 25 types of interactions that describe various relationships between the above listed objects, including molecular interactions, regulatory events and associative links. More than 5 thousand semantic templates were created to extract information about the interactions. The accuracy and recall of knowledge extraction by the developed method were assessed with the expert manual analysis of the text corpus and reached more than 65 % and 70 %, respectively. The full-scale version of the knowledge base SOLANUM TUBEROSUM will be created on the basis of the developed approaches.

Keywords: Solanum tuberosum, ANDSystem, text-mining, database, automatic knowledge extraction from texts.

 

Full article (Rus)

Full text (Eng)

 

REFERENCES

  1. Fiehn O. Metabolomics — the link between genotypes and phenotypes. Plant Mol. Biol., 2002, 48: 155-171 CrossRef
  2. Kristensen T.N., Pedersen K.S., Vermeulen C.J., Loeschcke V. Research on inbreeding in the «omic» era. Trends Ecol. Evol., 2010, 25(1): 44-52 CrossRef
  3. Weckwerth W. Green systems biology — from single genomes, proteomes and metabolomes to ecosystems research and biotechnology. J. Proteomics, 2011, 75(1): 284-305 CrossRef
  4. Kumar A., Pathak R.K., Gupta S.M., Gaur V.S., Pandey D. Systems biology for smart crops and agricultural innovation: filling the gaps between genotype and phenotype for complex traits linked with robust agricultural productivity and sustainability. OMICS: A Journal of Integrative Biology, 2015, 19(10): 581-601 CrossRef
  5. Lachowiec J., Queitsch C., Kliebenstein D.J. Molecular mechanisms governing differential robustness of development and environmental responses in plants. Ann. Bot., 2016, 117(5): 795-809 CrossRef
  6. Lee T., Kim H., Lee I. Network-assisted crop systems genetics: network inference and integrative analysis. Curr. Opin. Plant Biol., 2015, 24: 61-70 CrossRef
  7. Hammer G., Cooper M., Tardieu F., Welch S., Walsh B., van Eeuwijk F., Chapman S., Podlich D. Models for navigating biological complexity in breeding improved crop plants. Trends Plant Sci., 2006, 11(12): 587-593 CrossRef
  8. Vanhaeren H., Inzé D., Gonzalez N. Plant growth beyond limits. Trends Plant Sci., 2016, 21(2): 102-109 CrossRef
  9. Potato Genome Sequencing Consortium. Genome sequence and analysis of the tuber crop potato. Nature, 2011, 475(7355): 189-195 CrossRef
  10. Rensink W.A., Iobst S., Hart A., Stegalkina S., Liu J., Buell C.R. Gene expression profiling of potato responses to cold, heat, and salt stress. Funct. Integr. Genomics, 2005, 5(4): 201-207 CrossRef
  11. Ou Y., Liu X., Xie C., Zhang H., Lin Y., Li M., Song B., Liu J. Genome-wide Identification of microRNAs and their targets in cold-stored potato tubers by deep sequencing and degradome analysis. Plant Mol. Biol. Rep., 2015, 33(3): 584-597 CrossRef
  12. Petek M., Rotter A., Kogovšek P., Baebler Š., Mithöfer A., Gruden K. Potato virus Y infection hinders potato defence response and renders plants more vulnerable to Colorado potato beetle attack. Mol. Ecol., 2014, 23(21): 5378-5391 CrossRef
  13. Chae L., Kim T., Nilo-Poyanco R., Rhee S.Y. Genomic signatures of specialized metabolism in plants. Science, 2014, 344(6183): 510-513 CrossRef
  14. Dreher K. Putting the plant metabolic network pathway databases to work: going offline to gain new capabilities. In: Plant metabolism: methods and protocols. Ser. Methods in Molecular Biology. G. Sriram (ed.). Springer Science+Business Media, NY, 2014, V. 1083: 151-171 CrossRef
  15. Chae L., Lee I., Shin J., Rhee S.Y. Towards understanding how molecular networks evolve in plants. Curr. Opin. Plant Biol., 2012, 15(2): 177-184 CrossRef
  16. Zhang P., Dreher K., Karthikeyan A., Chi A., Pujar A., Caspi R., Karp P., Kirkup V., Latendresse M., Lee C., Mueller L.A. Creation of a genome-wide metabolic pathway database for Populus trichocarpa using a new approach for reconstruction and curation of metabolic pathways for plants. Plant Physiol., 2010, 153(4): 1479-1491 CrossRef
  17. Gonzalez G.H., Tahsin T., Goodale B.C., Greene A.C., Greene C.S. Recent advances and emerging applications in text and data mining for biomedical discovery. Brief. Bioinform., 2016, 17(1): 33-42 CrossRef
  18. Wu H.Y., Chiang C.W., Li L. Text mining for drug—drug interaction. In: Biomedical Literature Mining. Ser. Methods in molecular biology. V.D. Kumar, H.J. Tipney (eds.). Springer Science+Business Media, NY, 2014, V. 1159: 47-75 CrossRef
  19. Piedra D., Ferrer A., Gea J. Text mining and medicine: usefulness in respiratory diseases. Archivos de Bronconeumología (Engl. Ed.), 2014, 50(3): 113-119 CrossRef
  20. Fluck J., Hofmann-Apitius M. Text mining for systems biology. Drug Discov. Today, 2014, 19(2): 140-144 CrossRef
  21. Krallinger M., Erhardt R.A., Valencia A. Text-mining approaches in molecular biology and biomedicine. Drug Discov. Today, 2005, 10(6): 439-445 CrossRef
  22. Szklarczyk D., Franceschini A., Wyder S., Forslund K., Heller D., Huerta-Cepas J., Simonovic M., Roth A., Santos A., Tsafou K.P., Kuhn M. STRING v10: protein—protein interaction networks, integrated over the tree of life. Nucl. Acids Res., 2014, 28: gku1003 CrossRef
  23. Von Mering C., Huynen M., Jaeggi D., Schmidt S., Bork P., Snel B. STRING: a database of predicted functional associations between proteins. Nucl. Acids Res., 2003, 31(1): 258-261 CrossRef
  24. Snel B., Lehmann G., Bork P., Huynen M.A. STRING: a web-server to retrieve and display the repeatedly occurring neighborhood of a gene. Nucl. AcidsRes., 2000, 28(18): 3442-3444 CrossRef
  25. Li S., Wu L., Zhang Z. Constructing biological networks through combined literature mining and microarray analysis: a LMMA approach. Bioinformatics, 2006, 22(17): 2143-2150 CrossRef
  26. Pesch R., Böck M., Zimmer R. ConReg: Analysis and visualization of conserved regulatory networks in eukaryotes (In: German Conference on Bioinformatics, 2012). Dagstuhl research Online Publication Server, 2012, 26: 69-81 CrossRef
  27. Shannon P., Markiel A., Ozier O., Baliga N.S., Wang J.T., Ramage D., Amin N., Schwikowski B., Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. GenomeRes., 2003, 13: 2498-2504 CrossRef
  28. Demenkov P.S., Ivanisenko T.V., Kolchanov N.A., Ivanisenko V.A. ANDVisio: a new tool for graphic visualization and analysis of literature mined associative gene networks in the ANDSystem. In Silico Biology, 2012, 11(3, 4): 149-161 CrossRef
  29. Ivanisenko V.A., Saik O.V., Ivanisenko N.V., Tiys E.S., Ivanisenko T.V., Demenkov P.S., Kolchanov N.A. ANDSystem: an Associative Network Discovery System for automated literature mining in the field of biology. BMC Syst. Biol., 2015, 9(Suppl. 2): S2 CrossRef
  30. Saik O.V., Ivanisenko T.V., Demenkov P.S., Ivanisenko V.A. Interactome of the hepatitis C virus: literature mining with ANDSystem. Virus Res., 2016, 218: 40-48 CrossRef
  31. Yu B. Role of in silico tools in gene discovery. Mol. Biotechnol., 2009, 41(3): 296-306 CrossRef
  32. Li J., Lin X., Teng Y., Qi S., Xiao D., Zhang J., Kang Y. A Comprehensive evaluation of disease phenotype networks for gene prioritization. PloS ONE, 2016, 11(7): e0159457 CrossRef
  33. Guney E., Oliva B. Exploiting protein—protein interaction networks for genome-wide disease-gene prioritization. PloS ONE, 2012, 7(9): e43557 CrossRef
  34. Huang D.W., Sherman B.T., Lempicki R.A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc., 2008, 4(1): 44-57 CrossRef
  35. Thomas P.D., Kejariwal A., Guo N., Mi H., Campbell M.J., Muruganujan A., Lazareva-Ulitsky B. Applications for protein sequence–function evolution data: mRNA/protein expression analysis and coding SNP scoring tools. Nucl. Acids Res., 2006, 34(Suppl 2): W645-W650 CrossRef
  36. Mi H., Poudel S., Muruganujan A., Casagrande J.T., Thomas P.D. PANTHER version 10: expanded protein families and functions, and analysis tools. Nucl. Acids Res., 2015, 44(D1): D336-D342 CrossRef
  37. Eden E., Lipson D., Yogev S., Yakhini Z. Discovering motifs in ranked lists of DNA sequences. PLoS Comput. Biol., 2007, 3(3): e39 CrossRef
  38. Eden E., Navon R., Steinfeld I., Lipson D., Yakhini Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics, 2009, 10: 48 CrossRef
  39. Hämäläinen J.H., Watanabe K.N., Valkonen J.P.T., Arihara A., Plaisted R.L., Pehu E., Miller L., Slack S.A. Mapping and marker-assisted selection for a gene for extreme resistance to potato virus Y. Theor. Appl. Genet., 1997, 94(2): 192-197.

back