Metadata & Data Flow

To validate metadata prior to submitting massively parallel sequencing data to the NCBI Short Read Archive or GEO, go here and click on "EDACC" item under Tools in the top left corner of the page.

For more details on metadata definitions and data flow, check out the EDACC Recommendations document. The following is an excerpt from the document.

Metadata
EDACC recommends the use of core SRA XML elements and additional attributes defined for purposes of Epigenomics Roadmap XML metadata submissions as described below.

Core SRA XML elements are described here:

http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=rfc&m=doc&s=rfc

Additional attributes are listed in Appendix 1 of the EDACC Recommendations document

Data Flow
The following data model (Fig. 1) will be used by EDACC to capture the data from REMCs and generate data and metadata for submission to the GEO/SRA archives at the NCBI. The arrows indicate references ("foreign key" references). Specifically, the arrows indicate that Results refer to a specific Run, an event which denotes production of a specific unit of data (such as the production of DNA sequencing reads from a single lane of an Illumina sequencer or hybridization of a sample to a specific arrays ) in the context of a specific Experiment on a specific Sample in the context of a specific Study.

Data and Metadata for Run, Experiment, Sample, and Study
Metadata for Run, Experiment, Sample, and Study contain reference to "EDACC Recommendation version <X>", where X denotes the version number of the EDACC Recommendation document describing the semantics of Data elements.

Results: Data and Metadata Levels 0-4
Metadata for Level 0-4 Results consists of the following elements:

  1. Reference to "EDACC Recommendation version <X>", where X denotes the version number of the EDACC Recommendation document describing the semantics of Data elements.
  2. When specific formats (BED, Wiggle, Genboree LFF, or others) are used, the Metadata also describes how the Data elements described in this document are encoded in the specific formats. Some format-specific encoding methods will be described in this document, in which case a reference to this document may suffice.
  3. Additional information about the method/program used to obtain the Results and other information as described in sections below

Level 0-4 Data correspond to various stages of processing, as illustrated in Table 2.

  "chip" Data "seq" Data
Level 0 image reads
Level 1 extracted features anchor reads
Level 2 normalized intensities read density map
Level 3 epigenomic state of an individual sample
Level 4 integrated analysis involving multiple samples

The issues of data flow between REMCs, EDACC, and GEO/SRA at the NCBI are addressed in the Data flow section below.

Level 0-1 Data is stored in Genboree transiently and is archived at GEO/SRA.
Level 2-4 Data is stored in Genboree permanently.

Level 1 "seq" data may be derived using the Pash program, which is specifically designed for high-volume anchoring.

Assay-specific algorithms are applied to derive Level 2 and 3 data from Level 1 data. In contrast to Level 2 data, which is dependent on particular methodology or technology, Level 3 assertions are to a significant degree meaningful outside of specific methodology or technology.

Level 2 Data is verifiable by performing technical or biological replicates using the same methodology by the same or different Center. Confidence intervals and false discovery rates at Level 2 refer to replication of experiments on the same sample using the same method/ technology.

Level 3 data is validatable using different methodology that provides information about the same epigenomic state in the same biological sample. Confidence intervals and false discovery rates at Level 3 refer to the assertions about epigenomic state such as methylation or histone modification. Level 3 data includes results of integrative analysis of the results obtained by performing various genome-wide assays on the same sample.

Level 4 data is obtained by an integrative analysis of data from multiple Samples. One example of Level 4 data are significant differences detected by comparing genome-wide a H3K4me3 signal between two different biological samples. Level 4 data may also be derived using publicly available annotation tracks/ genomic databases. There is typically a wide variety of integrated analyses that can lead to Level 4 data.

Data Flow

All the data will eventually be stored in GEO/SRA archives. NCBI will annotate and accession all submissions and Human Epigenome Atlas releases. The data flow is illustrated in the following figure:

For each Run a Genboree Project will be created and used to track the status of the Run. Multiple Projects will be grouped in various ways into higher-level projects for the purpose of integrated analyses producing Level 4 data and other results.

In addition to performing the editorial process, EDACC may curate old submissions - say by bringing the metadata up to higher level of quality and by bringing in other data sets (such as ENCODE or 1000 Genomes) for the purpose of integrated analyses. The data enhancements will also be generally deposited to GEO/SRA as new versions.

Submissions from REMCs to EDACC
The EDACC uses a similar web interface for submissions as used by the ENCODE DCC. EDACC hosts a separate instance of the interface which will perform data file validation according to current EDACC Recommendations upon submission. One advantage of reusing the ENCODE DCC web interface is that several of the REMCs have experience with it and the validation code can be customized to fit EDACC Recommendations.

Submissions from EDACC to NCBI
EDACC uploads the data to NCBI either via the Aspera system. The data includes Levels 0-4. EDACC and uses SRA-XML for metadata transfer, NCBI extracts the needed information and distributes it as needed between GEO and SRA and provides accessions associated with each submission to EDACC. Level 3 or 4 can be added as a later step to existing records as required.