The Common Metadata Elements for Cataloging Biomedical Datasets outlines a proposed set of core, minimal metadata elements that can be used to describe biomedical datasets, such as those resulting from research funded by the National Institutes of Health. It can inform efforts to better catalog or index such data to improve discoverability. The proposed metadata elements are based on an analysis of the metadata schemas used in a set of NIH-supported data sharing repositories. Common elements from these data repositories were identified, mapped to existing data-specific metadata standards from to existing multidisciplinary data repositories, DataCite and Dryad, and compared with metadata used in MEDLINE records to establish a sustainable and integrated metadata schema.


Created in 2013

Common Metadata Elements for Cataloging Biomedical Datasets.

Related Databases (37)
GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. The complete release notes for the current version of GenBank are available on the NCBI ftp site. A new release is made every two months. GenBank growth statistics for both the traditional GenBank divisions and the WGS division are available from each release. GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC), which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at the NCBI. These three organizations exchange data on a daily basis.

Genetic, genomic and molecular information pertaining to the model organism Drosophila melanogaster and related sequences. This database also contains information relating to human disease models in Drosophila, the use of transgenic constructs containing sequence from other organisms in Drosophila, and information on where to buy Drosophila strains and constructs.

Mouse Phenome Database
Characterizations of hundreds of strains of laboratory mice to facilitate translational discoveries and to assist in selection of strains for experimental studies. Data sets are voluntarily contributed by researchers or retrieved by us from public sources. MPD has three major types of strain-centric data sets: phenotype strain surveys, SNP and variation data, and gene expression strain surveys.

WormBase is an international consortium of biologists and computer scientists dedicated to providing the research community with accurate, current, accessible information concerning the genetics, genomics and biology of C. elegans and related nematodes.

Eukaryotic Pathogen, Vector and Host Informatics Resource
The Eukaryotic Pathogen, Vector and Host Informatics Resource (VEuPathDB) focuses on eukaryotic pathogens and invertebrate vectors of infectious diseases, , encompassing data from prior resources devoted to parasitic species (EuPathDB), fungi (FungiDB) and vector species (VectorBase). While each of the taxonomic groups within this resource is supported by a taxon-specific database built upon the same infrastructure, the EuPathDB portal offers an entry point to all of these resources, and the opportunity to leverage orthology for searches across genera.

NeuroMorpho.Org is a centrally curated inventory of 3D digitally reconstructed neurons associated with peer-reviewed publications. The goal of NeuroMorpho.Org is to provide dense coverage of available reconstruction data for the neuroscience community.

Rat Genome Database
The Rat Genome Database stores genetic, genomic, phenotype, and disease data generated from rat research. It provides access to corresponding data for eight other species, allowing cross-species comparison. Data curation is performed both manually and via an automated pipeline, giving RGD users integrated access to a wide variety of data to support their research.

Mouse Genome Database - a Mouse Genome Informatics (MGI) Resource
MGI is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease. Data includes gene characterization, nomenclature, mapping, gene homologies among mammals, sequence links, phenotypes, allelic variants and mutants, and strain data.

Database of Single Nucleotide Polymorphism
dbSNP contains human single nucleotide variations, microsatellites, and small-scale insertions and deletions along with publication, population frequency, molecular consequence, and genomic and RefSeq mapping information for both common variations and clinical mutations.

Gene Expression Omnibus
The Gene Expression Omnibus (GEO) is a public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomic data submitted by the scientific community. In addition to data storage, a collection of web-based interfaces and applications are available to help users query and download the studies and gene expression patterns stored in GEO.

PubChem is organized as three linked databases within the NCBI's Entrez information retrieval system. These are PubChem Substance, PubChem Compound, and PubChem BioAssay. PubChem also provides a fast chemical structure similarity search tool. More information about using each component database may be found using the links in the homepage.

PubMed is a search engine of biomedical literature, provided as a service of the U.S. National Library of Medicine and includes more than 25 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites.

Database of Genotypes and Phenotypes
The Database of Genotypes and Phenotypes (dbGaP) archives and distributes the results of studies that have investigated the interaction of genotype and phenotype. Such studies include genome-wide association studies, medical sequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits.

Database of genomic structural VARiation
dbVar is a database of human genomic structural variation where users can search, view, and download data from submitted studies. dbVar stopped supporting data from non-human organisms in 2017, however existing non-human data remains available. In keeping with the common definition of structural variation, most variants are larger than 50 basepairs in length - however a handful of smaller variants may also be found. dbVar provides access to the raw data whenever available, as well as links to additional resources, from both NCBI and elsewhere. It can accept diverse types of events, including inversions, insertions and translocations. Additionally, both germline and somatic variants are accepted.

Dryad is an open-source, community-led data curation, publishing, and preservation platform for CC0 publicly available research data. Dryad has a long-term data preservation strategy, and is a Core Trust Seal Certified Merritt repository with storage in US and EU at the San Diego Supercomputing Center, DANS, and Zenodo. While data is undergoing peer review, it is embargoed if the related journal requires / allows this. Dryad is an independent non-profit that works directly with: researchers to publish datasets utilising best practices for discovery and reuse; publishers to support the integration of data availability statements and data citations into their workflows; and institutions to enable scalable campus support for research data management best practices at low cost. Costs are covered by institutional, publisher, and funder members, otherwise a one-time fee of $120 for authors to cover cost of curation and preservation. Dryad also receives direct funder support through grants. is a registry and results database of publicly and privately supported clinical studies of human participants conducted around the world.

Biologic Specimen and Data Repository Information Coordinating Center
The goal of Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC) is to facilitate and coordinate the existing activities of the NHLBI Biorepository and the Data Repository and to expand their scope and usability to the scientific community through a single web-based user interface. BioLINCC provides a wealth of information on historical NHLBI clinical and epidemiologic studies which have data or biospecimens in the NHLBI repositories, and includes study summaries, references, and study operational documents.

Integrating Data for Analysis, Anonymization, and Sharing (iDASH)
Integrating Data for Analysis, Anonymization and SHaring (iDASH) is one of the National Centers for Biomedical Computing (NCBC) under the NIH Roadmap for Bioinformatics and Computational Biology. Founded in 2010, the iDASH center is hosted on the campus of the University of California, San Diego and addresses fundamental challenges to research progress and enables global collaborations anywhere and anytime. Driving biological projects motivate, inform, and support tool development in iDASH. iDASH collaborates with other NCBCs and disseminates tools via annual workshops, presentations at major conferences, and scientific publications. iDASH offers a secure cyberinfrastructure and tools to support a privacy-preserving data repository and open source software. iDASH also is active in research and training in its mission area.

The Cardiovascular Research Grid
The CardioVascular Research Grid (CVRG) project is creating an infrastructure for sharing cardiovascular data and data analysis tools. CVRG tools are developed using the Software as a Service model, allowing users to access tools through their browser, thus eliminating the need to install and maintain complex software.

GeneNetwork is a group of linked data sets and tools used to study complex networks of genes, molecules, and higher order gene function and phenotypes. GeneNetwork combines more than 25 years of legacy data generated by hundreds of scientists together with sequence data (SNPs) and massive transcriptome data sets (expression or eQTL data sets). GeneNetwork was created in 1994 as The Portable Dictionary of the Mouse Genome, and became WebQTL in 2001. In 2005 it was renamed GeneNetwork.

BEI Resource Repository
BEI Resources provides reagents, tools and information for studying Category A, B, and C priority pathogens, emerging infectious disease agents, non-pathogenic microbes and other microbiological materials of relevance to the research community.

Neuroimaging Informatics Tools and Resources Collaboratory Resources Registry
Neuroimaging Informatics Tools and Resources Collaboratory Resources Registry (NITRC-R) describes software tools and resources, vocabularies, test data, and databases. It is intended to extend the impact and longevity of previously funded neuroimaging informatics contributions. NITRC-R gives researchers access to tools and resources, categorization and organization of existing tools and resources, facilitation of interactions between researchers and developers, and promotion of best practices through enhanced documentation and tutorials. NITRC’s scientific focus includes: MR, PET/SPECT, CT, EEG/MEG, optical imaging, clinical neuroimaging, computational neuroscience, and imaging genomics software tools, data, and computational resources.

National Addiction & HIV Data Archive Program
NAHDAP acquires, preserves and disseminates data relevant to drug addiction and HIV research. The scope of the data housed at NAHDAP covers a wide range of legal and illicit drugs (alcohol, tobacco, marijuana, cocaine, synthetic drugs, and others) and the trajectories, patterns, and consequences of drug use as well as related predictors and outcomes.

Neuroscience Information Framework (NIF)
NIF maintains the largest searchable collection of neuroscience data, the largest catalog of biomedical resources, and the largest ontology for neuroscience on the web.

The PhysioNet Resource is intended to stimulate current research and new investigations in the study of complex biomedical and physiologic signals. It offers free web access to large collections of recorded physiologic signals and related open-source software. Data includes well-characterized digital recordings of physiologic signals, time series, and related data for use by the biomedical research community. PhysioNet includes collections of cardiopulmonary, neural, and other biomedical signals from healthy subjects and patients with a variety of conditions with major public health implications, including sudden cardiac death, congestive heart failure, epilepsy, gait disorders, sleep apnea, and aging.

Transporter Classification Database
This freely accessible database details a comprehensive IUBMB approved classification system for membrane transport proteins known as the Transporter Classification (TC) system. The TC system is analogous to the Enzyme Commission (EC) system for classification of enzymes, except that it incorporates both functional and phylogenetic information for organisms of all types. As of Oct. 1, 2020, TCDB consists of 20,653 proteins classified in 15,528 non-redundant transport systems with 1,567 tabulated 3D structures, 18,336 reference citations describing 1,536 transporter families, of which 26% are members of 82 recognized superfamilies. Overall, this is an increase of over 50% since the last published update of the database in 2016. The most recent update of the database contents and features include (1) adoption of a chemical ontology for substrates of transporters, (2) inclusion of new superfamilies, (3) a domain-based characterization of transporter families (tcDoms) for the identification of new members as well as functional and evolutionary relationships between families, (4) development of novel software to facilitate curation and use of the database, (5) addition of new subclasses of transport systems including 11 novel types of channels and 3 types of group translocators, and (6) the inclusion of many man-made (artificial) transmembrane pores/channels and carriers.

Quantitative Trait Loci Archive
This site provides access to raw data from various QTL (quantitative trait loci) studies using rodent inbred line crosses. Data are available in the .csv format used by R/qtl and pseudomarker programs. In some cases analysis scripts and/or results are posted to accompany the data.

Biological General Repository for Interaction Datasets
The Biological General Repository for Interaction Datasets (BioGRID) is a public database that archives and disseminates genetic and protein interaction data from model organisms and humans. BioGRID currently holds over 1,740,000 interactions curated from both high-throughput datasets and individual focused studies, as derived from over 70,000+ publications in the primary literature. Complete coverage of the entire literature is maintained for budding yeast (S. cerevisiae), fission yeast (S. pombe) and thale cress (A. thaliana), and efforts to expand curation across multiple metazoan species are underway. All data are freely provided via our search index and available for download in many standardized formats.

Biospecimens/Biorepositories: Rare Disease Hub (RD-HUB)
The Biospecimens/Biorepositories Website: Rare Disease-HUB (RD-HUB) contains a searchable database of biospecimens collected, stored, and distributed by biorepositories in the United States and around the globe. RD-HUB is designed to help and assist interested parties and investigators search, locate, and identify desired biospecimens needed for the research and to facilitate collaboration and sharing of material and data among investigators across the globe.

The Cancer Imaging Archive
The Cancer Imaging Archive (TCIA) is a service which de-identifies and hosts medical images of cancer for public download. The data are organized as “collections”; typically patients’ imaging related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus. DICOM is the primary file format used by TCIA for radiology imaging. Supporting data related to the images such as patient outcomes, treatment details, genomics and expert analyses are also provided when available.

Xenopus laevis and tropicalis biology and genomics resource
Xenbase is the model organism database for Xenopus laevis and X. (Silurana) tropicalis which was created to improve knowledge of developmental and disease processes. Through curation and automated data provisioning from various sources, Xenbase aims to integrate the body of knowledge on Xenopus genomics and biology together with the visualization of biologically-significant interactions.

Database of Interacting Proteins
The database of interacting protein (DIP) database stores experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions. The data stored within the DIP database were curated, both manually by expert curators and automatically using computational approaches that utilize the the knowledge about the protein-protein interaction networks extracted from the core DIP data.

VectorBase is a web-accessible data repository for information about invertebrate vectors of human pathogens. VectorBase annotates and maintains vector genomes (as well as a number of non-vector genomes for comparative analysis) providing an integrated resource for the research community. VectorBase contains genome information for organisms such as Anopheles gambiae, a vector for the Plasmodium protozoan agent causing malaria, and Aedes aegypti, a vector for the flaviviral agents causing Yellow fever and Dengue fever. Hosted data range from genome assemblies with annotated gene features, transcript and protein expression data to population genetics including variation and insecticide-resistance phenotypes.

Nuclear Receptor Signaling Atlas
The mission of NURSA is to accrue, develop, and communicate information that advances our understanding of the roles of nuclear receptors (NRs) and coregulators in human physiology and disease.

Worldwide Protein Data Bank
The Worldwide PDB (wwPDB) organization manages the PDB archive and ensures that the PDB is freely and publicly available to the global community. The mission of the wwPDB is to maintain a single Protein Data Bank Archive of macromolecular structural data that is freely and publicly available to the global community. The wwPDB is composed of the RCSB PDB, PDBe, PDBj and BMRB.

Inter-university Consortium for Political and Social Research
ICPSR is a data archive of behavioral and social science research data. An international consortium of more than 750 academic institutions and research organizations, ICPSR provides leadership and training in data access, curation, and methods of analysis for the social science research community. ICPSR is a CoreTrustSeal core certified repository and was a 2019 United States National Medal for Museum and Library Service recipient, the nation’s highest honor given to museums and libraries that make significant and exceptional contributions to their communities.

DataCite Repository
DataCite is a leading global non-profit organisation that provides persistent identifiers (DOIs) for research data. Their goal is to help the research community locate, identify, and cite research data with confidence. They support the creation and allocation of DOIs and accompanying metadata. They provide services that support the enhanced search and discovery of research content. They also promote data citation and advocacy through community-building efforts and responsive communication and outreach materials. DataCite gathers metadata for each DOI assigned to an object. The metadata is used for a large index of research data that can be queried directly to find data, obtain stats and explore connections. All the metadata is free to access and review. To showcase and expose the metadata gathered, DataCite provides an integrated search interface, where it is possible to search, filter and extract all the details from a collection of millions of records.

