Bio4j: A pioneer graph based database for the integration of biological Big Data
Bio4j: A pioneer graph based database for the integration of biological Big Data Pablo Pareja1; Eduardo Pareja-Tobes1, Marina Manrique1, Eduardo Pareja1, Raquel Tobes1 1: Oh no sequences! group. Era7 Bioinformatics, Pza Campoverde 3, Atico. Granada, Spain. ppareja@era7.com The main aims of bio4j open source project are to integrate commonly used biological data resources (Uniprot, Uniref, Genbank, GeneOntology, RefSeq, Interpro...) into a property graph data model and to build corresponding fully cloud-enabled aggregated data distributions. To integrate the current contents (Uniprot, GeneOntology, Uniref(50, 90, 100), and Refseq) we have used Neo4j graph database as the backend, an OS pure-java transactional graph db engine. We have also developed a simple data model (the API github repository can be found here: https://github.com/pablopareja/Bio4jModel), staying as semantically close as possible to the original, which integrates the above sources into a property graph. Bio4j database currently includes more than 500.000.000 relationships and 50.000.000 nodes and keeps growing every day, setting a precedent in bioinformatics Big Data modelled as a graph. We developed an AWS CloudFormation template which leverages the necessary steps for automated cloud deployment. This template creates an EC2 instance from AWS Linux AMI (user-conf instance type) and EBS volumes containing the data and all required libraries. Apart from this AMI, there also are deployment scripts available which would work on any Linux instance with java technology available. There exist several entry points to the database based in both exact and full-text indexes, allowing the user to start traversals in almost any point of the graph. Almost all the attributes of the different entities modelled are stored as node/relationship properties. The only exception are Refseq sequences, which are stored as independent S3 files. These files are easily accessible by their Genome Element ID and can even be directly queried with a specific range of positions. Things are done that way so that the database is not overload with static content which does not involve any interconnectivity. What do you get? - Graph database query capabilities, and neo4j particularly gives huge query language expression power (see http://arxiv.org/abs/1004.1001). - Integration of commonly used biological data sets in a single database. Complex queries can be achieved programmatically in a fairly simple graph traversal URL for the overall project web site: http://www.bio4j.com The particular Open Source License being used: AGPLv3
- by Pablo Pareja Tobes
Bioinformatics researcher/consultant/developer of Oh no sequences! (Era7 Bioinformatics)
Currently working as Bioinformatics consultant/developer/researcher at http://www.ohnosequences.com Amateur musician, traveller, and always eager to learn more about languages, plants... too many things to do and too little time for it!