Accepted processed microarray files formats

Important notes on file preparation:

  1. Do not compress your microarray data files.
  2. Make sure the file names are constructed only from alphanumerals [A-Z,a-z,0-9], underscores [_] and dots [.], with no whitespaces, brackets, other punctuations or symbols.
  3. Any spreadsheet/matrix file should be saved in tab-delimited text (*.txt) format and not Excel format (*.xls, *.xlsx). If you're unfamiliar with this format, please see OpenOffice Calc or Microsoft Excel guide.

 


What are "processed" files?

Processed files are generated from raw files by procedures such as background correction, normalisation, and further statistical analyses (e.g. calculating fold-changes and associated p-values). We accept either "native" processed files from microarray scanner software (e.g. ".chp" files from Affymetrix scanners, output files from GenomeStudio software for Illumina BeadChip), or two-dimensional spreadsheet files in tab-delimited text (.txt) format. For the latter, the probes/probesets/gene names are in rows, and data from one or more hybridisations are in columns. We accept processed files from the following scenarios:

  • one processed file per hybridisation, i.e. you have a series of processed files;
  • one spreadsheet ("matrix") file containing normalised data from all hybridisations;
  • several spreadsheet ("matrix") files containing normalised data from different stages of data processing, e.g. one file containing normalised probe intensities and another containing fold-change data summarised at the gene level.

 


What should a processed text file look like?

In the two-dimensional table, you should have probes/genes in rows and samples/data in columns:

  • Probes/genes in rows: Where possible, as row headers, you should use official probe names/identifiers, matching those in the array design file, so one can map each row of data to the correct probe. Put the probe identifiers in the first column under a heading Reporter Identifier (for probes) or CompositeSequence Identifier (for "composite" collation of probes, most common example being Affymetrix probe sets). If probe identifiers are not available, try to use proper gene symbols or other identifiers (e.g. GenBank cDNA accession, UniProt protein accession).
  • Samples/Data in columns: Where possible, label each data column with the same sample names as you declare on the Annotare forms, or use terms as they appear in your manuscript. This would allow mapping of a column of data to correct sample(s).

A processed .txt file containing data from one single hybridisation should look like this:

Reporter Identifier sample 1 normalised intensity sample 1 background
probe_name_1 233.5 69.1
probe_name_2 129.4 27.6

And here is an example where gene names are used as row headings:

Human HGNC gene name sample 1 normalised intensity sample 1 background
CDKN2A 233.5 69.1
BRCA2 129.4 27.6

 

Processed "matrices" summarising data from multiple hybridisations should look like the following. Again, as for per-hybridisation processed files, if probe identifiers are not available, try to use proper gene symbols or other identifiers (e.g. GenBank cDNA accession, UniProt protein accession).

Matrix of normalised values per sample:

Reporter Identifier sample 1 normalised sample 2 normalised sample 3 normalised sample 4 normalised
probe_name_1 26.9 44.3 62.3 58.5
probe_name_2 22.9 43.7 58.2 67.4

 

GenBank accession sample 1 normalised sample 2 normalised sample 3 normalised sample 4 normalised
BC000578 26.9 44.3 62.3 58.5
M31642 22.9 43.7 58.2 67.4

 

Matrix of summarised values (one column of data maps to multiple samples):

Reporter Identifier drug A treated average drug B treated average untreated control average
probe_name_1 44.6 89.3 290.15
probe_name_2 98.3 36.7 100.52

 


Processed matrix files in strict MAGE-TAB format (for advanced users)

For submitters who are familiar with MAGE-TAB specification, we also accept matrix files in strict MAGE-TAB format, which allows each data point in the file (in a given row and a given column) to be mapped to a particular assay in the experiment and to a particular probe/probe set in the array design file in a human readable way and also programmatically. Check out this guide on the strict matrix format for more information.