I work on several projects to facilitate the re-use of biomedical data by other scientists, with a particular focus on using Semantic Web technologies to make scientific data findable, accessible, interoperable, and reusable. To achieve this, I work with tools that use standards such as RDF and JSON-LD to store and transmit semantically rich data, use ontologies written in OWL to reason over data, and use descriptions in SHACL and ShEx to validate data and document data models. Where existing tools are unavailable or inadequate, I write software tools to fill this need. My current projects focus on genetic, disease and other biomedical data, but I look forward to working with other kinds of data in the future. I work primarily in Scala to build these tools.
- §Project: Center for Cancer Data Harmonization (CCDH) (Oct 2019 to Dec 2020): I work on the CCDH Tools and Data Quality team at RENCI. We aim to develop software tools that will allow the validation of biomedical data against the CRDC-H data model to store cancer-related biomedical data, to publish and transform the data model into different formats as needed, and to fulfil any other software development tasks that are needed to complete this project.
- Software: csv2caDSR (Jul 2020 to Dec 2020): A tool for harmonizing biomedical data using the Cancer Data Standards Registry and Repository (caDSR) as a source of validation information.
- Software: UMLS-RRF-Scala (Mar 2020 to Nov 2020): A set of tools for identifying mappings from terms in one vocabulary to others, intending to be used to provide mappings from closed-source vocabularies to open-access vocabularies.
- §Project: Small projects at RENCI (Oct 2019 to Dec 2020): While working as a Semantic Web technologist at RENCI, I developed a number of ad-hoc software tools to meet various needs from other RENCI teams.
- Software: SHACLI (Oct 2019 to May 2020): A command-line interface (CLI) for the Shapes Constraint Language (SHACL) to validate RDF data against SHACL shapes with improved error messages. This was part of a project to simplify data model creation, so that one document can be used to produce both the documentation and can be used to validate input data.
- Software: Omnicorp and OmniCORD (Oct 2019 to Nov 2020): I built upon the existing Omnicorp tool to improve the RDF data being produced by it, such as authorship information. I also wrote OmniCORD, a variant of Omnicorp for extracting entities from the COVID-19 Open Research Dataset (CORD-19).
- §Citation: Daniel Korn, Tesia Bobrowski, Michael Li, Yaphet Kebede, Patrick Wang, Phillips Owen, Gaurav Vaidya, Eugene Muratov, Rada Chirkova, Chris Bizon, Alexander Tropsha (November 11, 2020) COVID-KOP: integrating emerging COVID-19 data with the ROBOKOP database. Bioinformatics .
- §Link: My page on the RENCI website
I worked full-time as the lead software developer on the Phyloreferencing project. Our goal is to build an ontology of definitions for groups of related biological organisms, as well as the software and ontological infrastructure needed to create, edit, organize and test these definitions. We are building a demonstration website that will allow users to resolve these definitions on any evolutionary hypothesis. I built these tools using JavaScript in Node.js, Vue CLI, Java and Python.
- §Project: The Phyloreferencing Project (Jan 2018 to Oct 2019): The Phyloreferencing project aims to digitize definitions of clades (groups of related taxa that are a key concept in evolutionary biology) as semantically rich OWL ontologies, which are both human-readable and computer-interpretable.
- Software: Phyx.js (from Jan 2019): A software library for reading and validating Phyx files for storing clade definitions, which can convert them into OWL ontologies for publication and reasoning.
- Software: JPhyloRef (Sep 2017 to May 2020): A Java-based command line tool for reasoning over and testing ontologies that include phyloreferences.
- Software: Klados (Oct 2017 to Aug 2020): A single-page application for curating, testing and resolving phyloreferences.
- Software: Open Tree Resolver (Feb 2019 to Sep 2020): A single-page application for testing phyloreference resolution on the Open Tree of Life.
- Software: Phyloref Ontology (Jun 2013 to Nov 2020): An ontology containing terms used to concisely and precisely define phyloreferences.
- Software: Clade Ontology (Jun 2017 to May 2020): An ontology containing phyloreferences translated from published clade definitions.
- §Citation: Brian Stucky, James Balhoff, Narayani Barve, Vijay Barve, Laura Brenskelle, Matthew Brush, Gregory Dahlem, James Gilbert, Akito Kawahara, Oliver Keller, Andrea Lucky, Peter Mayhew, David Plotkin, Katja Seltmann, Elijah Talamas, Gaurav Vaidya, Ramona Walls, Matt Yoder, Guanyang Zhang, Rob Guralnick (March 13, 2019) Developing a vocabulary and ontology for modeling insect natural history data: example data, use cases, and competency questions. Biodiversity Data Journal 7.
- §Citation: Gaurav Vaidya, Hilmar Lapp, Nico Cellinese (October 2, 2020) Enabling Machines to Integrate Biodiversity Data with Evolutionary Knowledge. Biodiversity Information Science and Standards 4.
- §Citation: Gaurav Vaidya, Guanyang Zhang, Hilmar Lapp, Nico Cellinese (May 21, 2018) All the Clades in the World: Building a Semantically-Rich and Testable Ontology of Phylogenetic Clade Definitions. Biodiversity Information Science and Standards 2:e25776.
I went into graduate school with two goals: (1) to become a scientific software developer, and (2) to spend a period of time doing a focussed study on the informatics of taxonomic names. Taxonomic names are one of the oldest information management schemes in science, and have had to change dramatically over the last 285 years as biologists' understanding of species and evolution has changed. I wondered what we could learn about this old-yet-new information management system. Apart from the projects listed here that were directly relevant to my PhD, I also spent my time in graduate school learning about several other tools and techniques in a variety of other roles, listed elsewhere.
- §Project: PhD dissertation (Aug 2011 to Dec 2020): In my PhD dissertation, I developed methods for quantifying the rate of change with taxonomic checklists, and the effect such changes can have on the interpretation of biodiversity data.
- Software: SciNames (Jan 2017 to May 2019): A graphical user interface (GUI) for processing changes among taxonomic checklists. SciNames imports multiple taxonomic checklists into a single XML file, and then generates lists of changes between those checklists. Users can then use the GUI to annotate these changes, recording why each change had occurred and what sort of change it was. SciNames can then calculate some statistics on the changes.
- §Citation: Gaurav Vaidya (January 19, 2018) Taxonomic Checklists as Biodiversity Data: How Series of Checklists can Provide Information on Synonymy, Circumscription Change and Taxonomic Discovery. Ph.D. Dissertation .
- §Citation: Gaurav Vaidya, Denis Lepage, Robert Guralnick (April 19, 2018) The tempo and mode of the taxonomic correction process: How taxonomists have corrected and recorrected North American bird species over the last 127 years. PLOS ONE 13(4):e0195736.
- §Citation: Denis Lepage, Gaurav Vaidya, Robert Guralnick (June 25, 2014) Avibase – a database system for managing and organizing taxonomic concepts. ZooKeys 420:117–135.
- §Project: Small projects during my PhD (Jan 2011 to Jan 2017): Some small software projects I worked on during my PhD.
- Software: BibURI (Oct 2013 to Jan 2014): A Ruby Gem to extract BibTeX information from a number of bibliographic resources, such as DOIs and COinS webpages. I wrote this at the first TaxonWorks hackathon.
- §Citation: Arlin Stoltzfus, Hilmar Lapp, Naim Matasci, Helena Deus, Brian Sidlauskas, Christian M Zmasek, Gaurav Vaidya, Enrico Pontelli, Karen Cranston, Rutger Vos, Campbell O Webb, Luke J Harmon, Megan Pirrung, Brian O'Meara, Matthew W Pennell, Siavash Mirarab, Michael S Rosenberg, James P Balhoff, Holly M Bik, Tracy A Heath, Peter E Midford, Joseph W Brown, Emily Jane McTavish, Jeet Sukumaran, Mark Westneat, Michael E Alfaro, Aaron Steele, Greg Jordan (May 13, 2013) Phylotastic! Making tree-of-life knowledge accessible, reusable and convenient. BMC Bioinformatics 14(1):158.
- §Citation: Andrea Thomer, Gaurav Vaidya, Robert Guralnick, David Bloom, Laura Russell (July 20, 2012) From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks. ZooKeys 209:235-253.
While a graduate student, I taught three Evolutionary Biology and General Biology labs, in which I led classes of up to eighteen undergraduate students through hands-on exercises to build their understanding of biology and evolution. CU Boulder asks students to evaluate their instructors on a scale from 1 (worst) to 6 (best); my reviews increased from 4.8-5.5 in my first semester to 5.3-5.6 in my last semester.
- §Project: General Biology Labs 2 (EBIO 1240) (Jan 2016 to May 2016): I led three lab sections of eighteen students each, taking my students on a tour of the tree of life while reinforcing concepts in evolutionary biology, ecology, anatomy and physiology.
- §Project: General Biology Labs 1 (EBIO 1230) (Aug 2015 to Dec 2015): I led three lab sections of eighteen students each, teaching the philosophical underpinings of science through hands-on experiments in cellular and molecular biology.
- §Project: Evolutionary Biology (EBIO 3080) (Jan 2015 to May 2015): I led three lab sections of twenty students each, teaching evolutionary biology through R-based statistics and modeling labs, measurements and phylogenetics.
During the first four years of my PhD, I worked on the Map of Life and VertNet projects, where I developed a web application for synthesizing and managing vernacular names extracted from multiple sources, a web API for efficiently searching a large database of taxonomic names, and a Python tool to identify errors in species names (now deprecated).
- §Project: Art of Life (May 2012 to Apr 2015): A project organized by the Biodiversity Heritage Library (BHL) to identical and annotate hundreds of thousands of illustrations from the documents in this digital library.
- §Link: Art of Life Schema (Apr 2015): I worked with my PhD advisor and some librarians at the BHL to develop a data schema for annotating biological illustrations in a way that would make them useful for biodiversity researchers.
- §Project: Map of Life (Jan 2011 to Jan 2015): An NSF-funded project to synthesize different kinds of biodiversity data — from occurrences to rangemaps — into a single, easy-to-use tool. I worked on the Map of Life project, usually during summer holidays, under the supervision of my PhD advisor, Rob Guralnick.
- Software: Vernacular Names (Feb 2014 to Jul 2015): A web application for managing the synthesis and verification of vernacular name information for Map of Life.
- Software: TaxRefine (Jun 2013 to May 2014): Provides an OpenRefine reconcilation service API for matching taxonomic names against several services.
- §Link: Validating scientific names with the GBIF Checklist Bank (Jul 2013): A blog post describing TaxRefine.
- §Project: The Junius Henderson Field Notebook Project (Nov 2011 to Jul 2012): This project began as an idea that my collaborators (Andrea Thomer and Rob Guralnick) discussed regarding the annotation of historical field notebooks to extract the biodiversity data found in them. I suggested that we use Wikisource, a part of Wikipedia, to crowdsource the annotation process and to develop the software tools needed to carry out such annotations of Wikisource content. We published our findings as a Darwin Core Archive and in the journal ZooKeys in 2012.
- §Citation: Andrea Thomer, Gaurav Vaidya, Robert Guralnick, David Bloom, Laura Russell (July 20, 2012) From documents to datasets: A MediaWiki-based method of annotating and extracting species observations in century-old field notebooks. ZooKeys 209:235-253.
- Software: The Junius Henderson Field Notebook Project source code (Jan 2012 to Mar 2012): I wrote a series of scripts for extracting biodiversity data annotated on WikiSource into the Darwin Core biodiversity data sharing format.
I extended DBpedia's fact extraction software to support extracting facts in RDF from the Wikimedia Commons, an online repository that then contained around 25 million media files across a number of formats and licenses.
- §Project: DBpedia Commons at Google Summer of Code (May 2014 to Aug 2014): For a Google Summer of Code project in summer 2014, I worked on extending the DBpedia Extraction Framework to be able to extract information from the Wikimedia Commons database dumps and make them available as RDF.
- Software: DBpedia Extraction Framework (May 2014 to Aug 2014): I made some improvements to the DBpedia Extraction Framework to allow them to extract information from the Wikimedia Commons.
- §Citation: Gaurav Vaidya, Dimitris Kontokostas, Magnus Knuth, Jens Lehmann, Sebastian Hellmann (October 24, 2015) DBpedia Commons: Structured Multimedia Metadata from the Wikimedia Commons. The Semantic Web - ISWC 2015: Lecture Notes in Computer Science 9367:281-289.
- §Link: Project description on Google Summer of Code's website
- §Link: Project report on the Wikimedia Commons
This fellowship allowed me to work exclusively on my PhD for one semester with a mentor at the National Evolutionary Synthesis Center (NESCent) in Durham, North Carolina. My mentor at NESCent, Hilmar Lapp, would later hire me to work on the Phyloreferencing project.
I was the primary software developer, responsible for developing new web applications from prototypes to final deployment, including OCR Terminal, my company's flagship product. I also managed our computer systems on both on-site and Amazon EC2 cloud plaform across multiple operating systems.
- §Project: OCR Terminal (Jan 2008 to Jan 2011): A start-up that focussed on providing services around optical character recognition (OCR) technology.
- Software: OCR Terminal (Feb 2008 to Dec 2010):
OCR Terminal was an online optical character recognition (OCR) service: it read text from uploaded images and provided the image files in an editable format such as Microsoft Word, Adobe PDF or plain text. Between mid-2008 and 2011, tens of thousands of user accounts have been created and over 100,000 documents had been processed on this website. Apart from the website itself, the service featured a simple API which can be used to submit documents for processing programmatically.
I was lead developer of OCR Terminal right from project inception. I have written all of OCR Terminal's underlying code, first as a Perl/CGI application, and later as a Perl/Catalyst application. I am particularly proud of designing the public API, which is used by our own desktop client, several in-house tools, and several clients of ours who use it for both bulk processing and as a backend for their own software.
I also manage OCR Terminal's main server administrator, responsible for maintaining all the servers and backend components. I was able to learn about server monitoring with tools such as Munin. Since early 2009, OCR Terminal has been hosted on the Amazon EC2 processing cloud, giving me experience with setting up, bundling and managing EC2 instances.</p>
- §Link: Description of OCR Terminal on ABBYY's website
I helped manage computer-related infrastructure, from sending computers for servicing to installing scientific software on both local hardware and remote computing clusters. I also finished work on several scientific tools, which I have documented under my educational history below.
- §Project: SequenceMatrix (Aug 2006 to Nov 2010): Genetic analysis softwares at this time were designed to compare a single genetic or protein sequence between different taxa to perform comparative analyses. In the Evolutionary Biology lab, we would often perform multi-gene analyses, requiring aligned genetic sequences to be concatenated together. Modifying such datasets after concatenation could be problematic, especially if one of the constituent sequences turned out to be contamination. Sequence Matrix was intended to simplify the process of contatenation such that preserved gene boundaries could be unconcatenated if necessary, and included some tools for detecting and removing contamination from multi-gene, multi-taxon datasets.
- Software: SequenceMatrix (Aug 2006 to Jun 2015): Sequence Matrix facilitates the assembly of phylogenetic data matrices with multiple genes. Files for individual genes are dragged and dropped into a window and the sequences are concatenated. A table provides an overview over how much sequence information is available for the different genes and species. The user can request Sequence Matrix generate a wide variety of character and taxon sets (e.g. a taxon set with all species that have more than a specified number of genes or basepairs). The concatenated sequences can be exported in NEXUS or TNT format. Individual sequences can be excluded from being exported.
- §Citation: Gaurav Vaidya, David J. Lohman, Rudolf Meier (March 8, 2011) SequenceMatrix: concatenation software for the fast assembly of multi‐gene datasets with character set and codon information. Cladistics 27(2):171–180.
- §Citation: Shiyang Kwong, Amrita Srivathsan, Gaurav Vaidya, Rudolf Meier (December 11, 2011) Is the COI barcoding gene involved in speciation through intergenomic conflict?. Molecular Phylogenetics and Evolution 62(3):1009-1012.
- §Link: My description on our lab website
- §Project: Species Identifier (Jan 2003 to Jan 2006): This project came about in response to a need for a series of projects by the Evolutionary Biology lab to investigate whether genetic distance methods could be used to correctly identify the species that a genetic sequence had been sequenced from. The tool we built allowed us to compare several different approaches to investigate this approach, and then to replicate these analyses on larger datasets than we had originally envisioned.
- Software: Species Identifier (Aug 2006 to Jun 2015): Species Identifier provides a set of tools for exploring intra- and interspecific genetic distances, matching sequences, and clustering sequences based on pairwise distances. It helps determine whether two sequences are likely conspecific based on pairwise distances, and can calculate pairwise distances for large datasets. It was designed to provide the analyses presented in Meier et al, 2006 (Best match, Best close match and All Species Barcodes).
- §Citation: Rudolf Meier, Kwong Shiyang, Gaurav Vaidya, Peter K. L. Ng (October 1, 2006) DNA Barcoding and Taxonomy in Diptera: A Tale of High Intraspecific Variability and Low Identification Success. Systematic Biology 55(5):715–728.
- §Citation: Torsten Dikow, Rudolf Meier, Gaurav G. Vaidya, Jason G. H. Londt (March 25, 2009) Chapter Twelve. Biodiversity Research Based On Taxonomic Revisions — A Tale Of Unrealized Opportunities. Diptera Diversity: Status, Challenges and Tools :323-346.