Accepted raw microarray files formats < Guide < Annotare

Accepted sequencing file formats

Raw data files

Please provide individual unprocessed raw data files for each sample, in FASTQ or BAM format, and prepare your files according to ENA specifications. This is a quickly developing field so please do check the specifications every time you submit a new experiment. Data files which do not satisfy ENA's requirements will not be accepted.

FASTQ specifications

Each file must be compressed by gzip or bzip2.
Submit individual files per sample and lane (if applicable). Do not bundle multiple FASTQ files into one archive, or split a file into smaller sized chunks.
Multiplexed libraries should be demultiplexed into separate files.
No technical adapter sequences are allowed. But do not remove entire sequence reads or trim by quality score.
For paired-end experiments, the mate pairs must be split into two separate files (one file for the forward reads, one for the reverse reads). The two files must be named with the same root and end with extensions such as _1.fq.gz and _2.fq.gz. Examples of naming styles supported by the ENA:
- sampleA_R1.fq.gz / sampleA_R2.fq.gz
- sampleA_1.fq.gz / sampleA_2.fq.gz
- sampleA_F.fq.gz / sampleA_R.fq.gz
Check ENA specifications for additional information about the accepted FASTQ format.

BAM specifications

Each file must contain all reads from the sequencing machine and all reads should be unaligned. The reason for this is that we expect the BAM file to be used to regenerate all the sequencing reads.
The phred quality score for each base should be included in the file.
If you have data from paired-end sequencing libraries, for each sequencing run, include data for both mate reads in one single BAM file.
Check ENA specifications for additional information about the accepted BAM format.

To ensure your BAM files contain unaligned reads, you can run the following commands:

samtools view -c -F 4 bam_file (counts how many reads are aligned and should return 0)
samtools view -c -f 4 bam_file (counts how many reads are unaligned and should return at least 1)

If your BAM files contain mapped reads, then please either create unmapped BAM files, or send us the original read files (e.g. fastq.gz files) as raw data files. BAM files containing mapped reads can be included in your submission as processed files, as long as they satisfy ENA's specification and that the reference genome used for alignment has been accessioned in the International Nucleotide Sequence Database Collaboration (INSDC, involving DDBJ, ENA, and GenBank).

Single cell raw data files

If you are submitting single cell sequencing data, please check the single-cell raw data file requirements, as a few special rules apply for certain types of single cell experiments.

Processed data files

We accept all commonly used processed sequencing data or analysis files. There is no need to compress or zip up these files one by one or as a bundle. Upload them in Annotare and assign to your samples in the same way as you would for raw files, choosing "Processed" or "Processed Matrix" as file type.

Matrix files

Data analysis commonly produces data matrices, e.g. a table with FPKM values, raw count values or output from differential expression analysis, with genes in rows and samples in columns.
Please save any matrix files as tab-delimited text format (not Excel) and use the file extension .txt. Also make sure that the sample names in your matrix file match with the sample names used in Annotare.

Alignment files

We also accept BAM alignment files, bed/bigwig files, and any other commonly used alignment data format.

Other accepted processed formats

As this is a rapidly evolving field we also welcome other types of processed sequencing data. Ideally the file formats are a standard in the field and non-proprietary.