Biological data

Biological data are data or measurements collected from biological sources, which are often stored or exchanged in a digital form. Biological data are commonly stored in files or databases. Examples of biological data are DNA base-pair sequences, and population data used in ecology.

Data File Formats

Each file format has been designed for specific needs and outputs in mind.

GFF
VCF
AB1 – In DNA sequencing, chromatogram files used by instruments from Applied Biosystems
ACE – A sequence assembly format
BAM – Binary (compressed) Alignment/Map format based on SAM – Sequence Alignment/Map format
BED – The browser extensible display format is used for describing genes and other features of DNA sequences
CAF – Common Assembly Format for sequence assembly
EMBL – The flatfile format used by the EMBL to represent database records for nucleotide and peptide sequences from EMBL databases
FASTA – The FASTA file format, for sequence data. Sometimes also given as FNA or FAA (Fasta Nucleic Acid or Fasta Amino Acid).
FASTQ – The FASTQ file format, for sequence data with quality. Sometimes also given as QUAL.
GenBank – The flatfile format used by the NCBI to represent database records for nucleotide and peptide sequences from the GenBank and RefSeq databases
GFF – The General feature format is used for describing genes and other features of DNA, RNA and protein sequences
GTF – The Gene transfer format is used to hold information about gene structure.
NEXUS – The Nexus file encodes mixed information about genetic sequence data in a block structured format.
NWK – The Newick tree format is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. It is useful to hold phylogenetic trees.
PDB – structures of biomolecules deposited in Protein Data Bank. Also used for exchanging protein/nucleic acid structures.
PHD – Phred output, from the basecalling software Phred
SAM – Sequence Alignment/Map format, in which the results of the 1000 Genomes Project will be released.
SCF – Staden chromatogram files used to store data from DNA sequencing
SBML – The Systems Biology Markup Language is used to store biochemical network computational models
SFF - Standard Flowgram Format
Stockholm – The Stockholm format for representing multiple sequence alignments
Swiss-Prot – The flatfile format used to represent database records for protein sequences from the Swiss-Prot database
VCF – Variant Call Format, a standard created by the 1000 Genomes Project that lists and annotates the entire collection of human variants (with the exception of approximately 1.6 million variants).

Biological Data Sharing

Genomics data sharing
TransPLANT data

Biological data

Data File Formats

Biological Data Sharing

See also