I work on the CCDH Tools and Data Quality team at RENCI. We aim to develop software tools that will allow the validation of biomedical data against the CRDC-H data model to store cancer-related biomedical data, to publish and transform the data model into different formats as needed, and to fulfil any other software development tasks that are needed to complete this project.
- Software: csv2caDSR§ (Jul 2020 to Dec 2020): A tool for harmonizing biomedical data using the Cancer Data Standards Registry and Repository (caDSR) as a source of validation information.
- Technologies used: Scala, caDSR, PFB, CEDAR Workbench.
- Provides the following features:
- Harmonize input data against the caDSR.
- Store harmonization information (such as mappings from values in data files to concepts in vocabularies) in a JSON format that can be converted into other mapping formats if needed.
- Export harmonized data in a variety of formats, such as the Portable Format for Biomedical Data (PFB) and the CEDAR Instance format.
- Source code available at https://github.com/cancerDHC/csv2caDSR.
- Software: UMLS-RRF-Scala§ (Mar 2020 to Nov 2020): A set of tools for identifying mappings from terms in one vocabulary to others, intending to be used to provide mappings from closed-source vocabularies to open-access vocabularies.
- Technologies used: Scala, UMLS, SSSOM.
- Provides the following features:
- Extract individual mappings from the Unified Medical Language System (UMLS) Metathesaurus as SSSOM files.
- Use web services (such as BioPortal and the EMBL-EBI Ontology Lookup Service) to find additional mappings between vocabularies.
- Source code available at https://github.com/cancerDHC/umls-rrf-scala.
While working as a Semantic Web technologist at RENCI, I developed a number of ad-hoc software tools to meet various needs from other RENCI teams.
- Software: Omnicorp and OmniCORD§ (Oct 2019 to Nov 2020): I built upon the existing Omnicorp tool to improve the RDF data being produced by it, such as authorship information. I also wrote OmniCORD, a variant of Omnicorp for extracting entities from the COVID-19 Open Research Dataset (CORD-19).
- Technologies used: Scala, RDF, SHACL.
- Provides the following features:
- Extract terminologies from the MEDLINE/PubMed Baseline Repository and export them as RDF.
- Extract terminologies from the COVID-19 Open Research Dataset (CORD-19) and export them as RDF.
- Validate produced RDF information against SHACL shapes using SHACLI.
- Source code available at https://github.com/NCATS-Gamma/omnicorp.
- Software: SHACLI§ (Oct 2019 to May 2020): A command-line interface (CLI) for the Shapes Constraint Language (SHACL) to validate RDF data against SHACL shapes with improved error messages. This was part of a project to simplify data model creation, so that one document can be used to produce both the documentation and can be used to validate input data.
- Technologies used: Scala, Apache Maven, Coursier, SHACL, RDF, web APIs.
- Provides the following features:
- Validate RDF data against SHACL shapes.
- Improved error messages as compared to other SHACL validators.
- Published to the Maven repository, so that it is easy to install via Coursier or other Maven tools.
- Source code available at https://github.com/gaurav/shacli.
- §Citation: Daniel Korn, Tesia Bobrowski, Michael Li, Yaphet Kebede, Patrick Wang, Phillips Owen, Gaurav Vaidya, Eugene Muratov, Rada Chirkova, Chris Bizon, Alexander Tropsha (November 11, 2020) COVID-KOP: integrating emerging COVID-19 data with the ROBOKOP database. Bioinformatics .
In my PhD dissertation, I developed methods for quantifying the rate of change with taxonomic checklists, and the effect such changes can have on the interpretation of biodiversity data.
- Software: SciNames§ (Jan 2017 to May 2019): A graphical user interface (GUI) for processing changes among taxonomic checklists. SciNames imports multiple taxonomic checklists into a single XML file, and then generates lists of changes between those checklists. Users can then use the GUI to annotate these changes, recording why each change had occurred and what sort of change it was. SciNames can then calculate some statistics on the changes.
- Technologies used: Java, Java Swing, XML.
- Provides the following features:
- Convert taxonomic checklists into XML representations that store both multiple taxonomic checklists as well as annotated changes between them.
- Provides a graphical user interface (GUI) showing changes between checklists and allowing those changes to be modified or annotated.
- Calculates measures of stability among a series of checklists over time and exports data in formats amenable for rendering as graphs.
- Source code available at https://github.com/gaurav/scinames.
- §Citation: Gaurav Vaidya (January 19, 2018) Taxonomic Checklists as Biodiversity Data: How Series of Checklists can Provide Information on Synonymy, Circumscription Change and Taxonomic Discovery. Ph.D. Dissertation .
- §Citation: Gaurav Vaidya, Denis Lepage, Robert Guralnick (April 19, 2018) The tempo and mode of the taxonomic correction process: How taxonomists have corrected and recorrected North American bird species over the last 127 years. PLOS ONE 13(4):e0195736.
- §Citation: Denis Lepage, Gaurav Vaidya, Robert Guralnick (June 25, 2014) Avibase – a database system for managing and organizing taxonomic concepts. ZooKeys 420:117–135.
The Phyloreferencing project aims to digitize definitions of clades (groups of related taxa that are a key concept in evolutionary biology) as semantically rich OWL ontologies, which are both human-readable and computer-interpretable.
- Software: Phyx.js§ (from Jan 2019): A software library for reading and validating Phyx files for storing clade definitions, which can convert them into OWL ontologies for publication and reasoning.
- Technologies used: JavaScript, NPM, Web Ontology Language, JSON, JSON-LD.
- Provides the following features:
- Read and validate Phyx files.
- Convert Phyx files into OWL ontologies for publication and reasoning.
- Includes a test of phyloreferences on every possible topology up to 6 leaf nodes.
- Source code available at https://github.com/phyloref/phyx.js.
- Software: Phyloref Ontology§ (Jun 2013 to Nov 2020): An ontology containing terms used to concisely and precisely define phyloreferences.
- Technologies used: Web Ontology Language, ontologies.
- Provides the following features:
- My contribution here was mainly to suggest some additional terms.
- Software: Clade Ontology§ (Jun 2017 to May 2020): An ontology containing phyloreferences translated from published clade definitions.
- Technologies used: JavaScript, Web Ontology Language, ontologies.
- Provides the following features:
- A Node.js-based workflow for converting folders of Phyx files into a single large OWL ontology.
- Includes tools for testing resolution of all included Phyx files.
- Software: JPhyloRef§ (Sep 2017 to May 2020): A Java-based command line tool for reasoning over and testing ontologies that include phyloreferences.
- Technologies used: Java, Web Ontology Language, JSON-LD, Software testing, web APIs.
- Provides the following features:
- Reason over ontologies containing phyloreferences and phylogenies to report on which nodes each phyloreference resolves to.
- Test whether ontologies containing phyloreferences and reference phylogenies resolve as expected.
- Run as a web server to serve a web API as a backend for Klados and other services.
- Source code available at https://github.com/phyloref/jphyloref.
- Software: Open Tree Resolver§ (Feb 2019 to Sep 2020): A single-page application for testing phyloreference resolution on the Open Tree of Life.
- Technologies used: JavaScript, Vue.js, web APIs.
- Provides the following features:
- Upload OWL ontologies representing phyloreferences as JSON-LD files.
- Reason over the phyloreferences to resolve them on the relevant section of the Synthetic Tree as downloaded via Open Tree of Life APIs.
- Software: Klados§ (Oct 2017 to Aug 2020): A single-page application for curating, testing and resolving phyloreferences.
- Technologies used: JavaScript, Vue.js, JSON, web APIs.
- Provides the following features:
- Provides a graphical user interface for reading and writing Phyx files describing phyloreferences.
- Allows users to add and visualize phylogenies, and test resolution of phyloreferences on those phylogenies.
- Export Phyx files as OWL ontologies.
Some small software projects I worked on during my PhD.
- Software: BibURI§ (Oct 2013 to Jan 2014): A Ruby Gem to extract BibTeX information from a number of bibliographic resources, such as DOIs and COinS webpages. I wrote this at the first TaxonWorks hackathon.
I led three lab sections of eighteen students each, taking my students on a tour of the tree of life while reinforcing concepts in evolutionary biology, ecology, anatomy and physiology.
I led three lab sections of eighteen students each, teaching the philosophical underpinings of science through hands-on experiments in cellular and molecular biology.
I led three lab sections of twenty students each, teaching evolutionary biology through R-based statistics and modeling labs, measurements and phylogenetics.
A project organized by the Biodiversity Heritage Library (BHL) to identical and annotate hundreds of thousands of illustrations from the documents in this digital library.
- §Link: Art of Life Schema (Apr 2015): I worked with my PhD advisor and some librarians at the BHL to develop a data schema for annotating biological illustrations in a way that would make them useful for biodiversity researchers.
An NSF-funded project to synthesize different kinds of biodiversity data — from occurrences to rangemaps — into a single, easy-to-use tool. I worked on the Map of Life project, usually during summer holidays, under the supervision of my PhD advisor, Rob Guralnick.
- Software: Vernacular Names§ (Feb 2014 to Jul 2015): A web application for managing the synthesis and verification of vernacular name information for Map of Life.
- Technologies used: Python, PostgreSQL.
- Provides the following features:
- Lists all vernacular names across multiple languages for each taxonomic name.
- Includes scripts for importing particular vernacular name databases into this system.
- Allow curators to improve vernacular names.
- Produce reports on vernacular name coverage across important languages.
- Use regular expressions to make multiple simulatenous changes.
- Source code available at https://github.com/MapofLife/vernacular-names.
- Software: TaxRefine§ (Jun 2013 to May 2014): Provides an OpenRefine reconcilation service API for matching taxonomic names against several services.
- Technologies used: Perl, OpenRefine, web APIs.
- Provides the following features:
- Queries multiple taxonomic name resolution services to provide several possible resolutions to a user.
- Source code available at https://github.com/gaurav/taxrefine.
- §Link: Validating scientific names with the GBIF Checklist Bank (Jul 2013): A blog post describing TaxRefine.
For a Google Summer of Code project in summer 2014, I worked on extending the DBpedia Extraction Framework to be able to extract information from the Wikimedia Commons database dumps and make them available as RDF.
- Software: DBpedia Extraction Framework§ (May 2014 to Aug 2014): I made some improvements to the DBpedia Extraction Framework to allow them to extract information from the Wikimedia Commons.
- Technologies used: Scala, RDF, MediaWiki, Wikimedia Commons.
- Provides the following features:
- Added support for extracting metadata regarding media files in the Wikimedia Commons, including links to the raw files themselves.
- Added support for identifying templates that indicated open access licenses so that these could be included in the RDF generated.
- Added support for extracting annotations for parts of an image and exporting those in RDF.
- Source code available at https://github.com/dbpedia/extraction-framework/commits?author=gaurav.
- §Citation: Gaurav Vaidya, Dimitris Kontokostas, Magnus Knuth, Jens Lehmann, Sebastian Hellmann (October 24, 2015) DBpedia Commons: Structured Multimedia Metadata from the Wikimedia Commons. The Semantic Web - ISWC 2015: Lecture Notes in Computer Science 9367:281-289.
- §Link: Project description on Google Summer of Code's website
- §Link: Project report on the Wikimedia Commons
This project began as an idea that my collaborators (Andrea Thomer and Rob Guralnick) discussed regarding the annotation of historical field notebooks to extract the biodiversity data found in them. I suggested that we use Wikisource, a part of Wikipedia, to crowdsource the annotation process and to develop the software tools needed to carry out such annotations of Wikisource content. We published our findings as a Darwin Core Archive and in the journal ZooKeys in 2012.
- §Citation: Andrea Thomer, Gaurav Vaidya, Robert Guralnick, David Bloom, Laura Russell (July 20, 2012) From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks. ZooKeys 209:235-253.
- Software: The Junius Henderson Field Notebook Project source code§ (Jan 2012 to Mar 2012): I wrote a series of scripts for extracting biodiversity data annotated on WikiSource into the Darwin Core biodiversity data sharing format.
- Technologies used: Perl, Wikisource, MediaWiki, Darwin Core.
- Provides the following features:
- Download annotated text as XML from WikiSource via WikiSources' APIs.
- Extract MediaWiki templates that correspond to biodiversity observations and record the information as Darwin Core.
- Calculate summary statistics on each notebook, including information on the editors who worked on the most number of files.
- Source code available at https://github.com/gaurav/henderson.
A start-up that focussed on providing services around optical character recognition (OCR) technology.
- Software: OCR Terminal§ (Feb 2008 to Dec 2010):
OCR Terminal was an online optical character recognition (OCR) service: it read text from uploaded images and provided the image files in an editable format such as Microsoft Word, Adobe PDF or plain text. Between mid-2008 and 2011, tens of thousands of user accounts have been created and over 100,000 documents had been processed on this website. Apart from the website itself, the service featured a simple API which can be used to submit documents for processing programmatically.
I was lead developer of OCR Terminal right from project inception. I have written all of OCR Terminal's underlying code, first as a Perl/CGI application, and later as a Perl/Catalyst application. I am particularly proud of designing the public API, which is used by our own desktop client, several in-house tools, and several clients of ours who use it for both bulk processing and as a backend for their own software.
I also manage OCR Terminal's main server administrator, responsible for maintaining all the servers and backend components. I was able to learn about server monitoring with tools such as Munin. Since early 2009, OCR Terminal has been hosted on the Amazon EC2 processing cloud, giving me experience with setting up, bundling and managing EC2 instances.</p>
- Technologies used: Perl, Amazon Web Services, web APIs.
- Provides the following features:
- A web application allowing users to register an account, OCR a small number of documents for free, and then pay to OCR additional images.
- Included a job management system, allowing jobs uploaded to OCR Terminal to be processed by the ABBYY OCR Engine we used for OCR.
- Included a web API, allowing customers to batch submit multiple jobs by processing.
- Source code available at https://github.com/gaurav/ocrterminal.
Genetic analysis softwares at this time were designed to compare a single genetic or protein sequence between different taxa to perform comparative analyses. In the Evolutionary Biology lab, we would often perform multi-gene analyses, requiring aligned genetic sequences to be concatenated together. Modifying such datasets after concatenation could be problematic, especially if one of the constituent sequences turned out to be contamination. Sequence Matrix was intended to simplify the process of contatenation such that preserved gene boundaries could be unconcatenated if necessary, and included some tools for detecting and removing contamination from multi-gene, multi-taxon datasets.
- Software: SequenceMatrix§ (Aug 2006 to Jun 2015): Sequence Matrix facilitates the assembly of phylogenetic data matrices with multiple genes. Files for individual genes are dragged and dropped into a window and the sequences are concatenated. A table provides an overview over how much sequence information is available for the different genes and species. The user can request Sequence Matrix generate a wide variety of character and taxon sets (e.g. a taxon set with all species that have more than a specified number of genes or basepairs). The concatenated sequences can be exported in NEXUS or TNT format. Individual sequences can be excluded from being exported.
- Technologies used: Java, Apache Ant, Java Swing.
- Provides the following features:
- Concatenating sequences from multiple genes for multiple taxa.
- Records sequence start and end points so that they can be un-concatenated later if needed.
- Includes tools for identifying unusual sequences and suppress them on export.
- Source code available at http://github.com/gaurav/taxondna.
- Released under the GNU General Public License, version 2.0 or later.
- §Citation: Gaurav Vaidya, David J. Lohman, Rudolf Meier (March 8, 2011) SequenceMatrix: concatenation software for the fast assembly of multi‐gene datasets with character set and codon information. Cladistics 27(2):171–180.
This project came about in response to a need for a series of projects by the Evolutionary Biology lab to investigate whether genetic distance methods could be used to correctly identify the species that a genetic sequence had been sequenced from. The tool we built allowed us to compare several different approaches to investigate this approach, and then to replicate these analyses on larger datasets than we had originally envisioned.
- Software: Species Identifier§ (Aug 2006 to Jun 2015): Species Identifier provides a set of tools for exploring intra- and interspecific genetic distances, matching sequences, and clustering sequences based on pairwise distances. It helps determine whether two sequences are likely conspecific based on pairwise distances, and can calculate pairwise distances for large datasets. It was designed to provide the analyses presented in Meier et al, 2006 (Best match, Best close match and All Species Barcodes).
- Technologies used: Java, Apache Ant.
- Provides the following features:
- Replicate the techniques used in Meier et al. 2006 on your own datasets.
- Source code available at http://github.com/gaurav/taxondna.
- Released under the GNU General Public License, version 2.0 or later.
- §Citation: Rudolf Meier, Kwong Shiyang, Gaurav Vaidya, Peter K. L. Ng (October 1, 2006) DNA Barcoding and Taxonomy in Diptera: A Tale of High Intraspecific Variability and Low Identification Success. Systematic Biology 55(5):715–728.