Accepted raw microarray files formats

Important notes on file preparation:

  1. Do not compress your microarray data files.
  2. Submit raw or raw matrix file(s) for every sample/hybridisation of your experiment.
  3. Make sure the file names are constructed only from alphanumerals [A-Z,a-z,0-9], underscores [_] and dots [.], with no whitespaces, brackets, other punctuations or symbols.
  4. Any spreadsheet/matrix file should be saved in tab-delimited text (*.txt) format and not Excel format (*.xls, *.xlsx)

 


What are "raw" files?

  • Raw (recommended): The "native" files generated by the microarray scanner software. Make sure you do not change/edit the native files in any way, and submit one raw file per hybridisation assay. (One assay can consist of just one channel, as in Affymetrix experiments, or two channels, as in spotted arrays often with red and green channels from two different dyes/fluorophores.)

    Commercial microarray manufacturers have developed different raw data file formats over the years. If you are unsure about whether your raw files are in an accepted format, please check the list below.

  • Raw Matrix: A raw file in tab-delimited text (.txt) format*, that contains data from more than one hybridisation assay (probes in rows and assays in columns). The format requirements are strict (except for Illumina GenomeStudio data files). See matrix guidelines and examples.

 


Accepted formats by platform

ArrayExpress submission system recognises the raw data file platform using column headings in the file's header:


Affymetrix

Our submission system recognises .CEL and .EXP files using both the old GDAC formats and the newer GCOS/XDA formats.

Agilent

A file containing these headings is recognised as an Agilent format file:

Row Col PositionX PositionY

Illumina

Illumina raw data files are usually either in plain text or binary format. Plain text files are generated by the Illumina GenomeStudio software . The binary "IDAT" files (stands for "intensity data file") are generated by the scanner and can be parsed using R/BioConductor packages such as illuminaio). IDAT is the preferred file format, as it is a binary format, containing all information required to analyse the data. In contrast, plain-text files can be missing information such as which are the control probes; this is sometimes provided in a separate file, but not always, and heterogeneity in raw data file formats makes systematic analysis of data difficult. Another disadvantage of plain-text files is that they are susceptible to human-introduced errors, as it is easy for someone to open the file in a text editor or spreadsheet program and accidentally change its content. If you're submitting a GenomeStudio text file, below is an example of the expected column headings:

PROBE_ID Assay_Name_1.QT1 Assay_Name_1.QT2 Assay_Name_2.QT1 Assay_Name_2.QT2

PROBE_IDs are always in the format of "ILMN_123456". Assay Names are automatically generated by Annotare and can be found in the SDRF Preview tab while you are preparing your submission. QT stands for quantitation type, i.e. the type of measurement recorded in the column, e.g. AVGSignal. You can have as many quantitation types as required. Order the columns by sample names, then by quantitation types.

GenePix

GenePix format files (usually with file extension .gpr or .txt) are recognised using the following column headings:

Block Column Row X Y

NimbleScan

NimbleScan files (Feature, Probe and Pair) all contain the following headings:

PROBE_ID X Y

ScanAlyze

The following column headings are recognised as being from a ScanAlyze format file:

GRID COL ROW LEFT TOP RIGHT BOT

ScanArray/QuantArray

ScanArray Express files are recognised from the following headings:

Array Column Array Row Spot Column Spot Row X Y

while the older QuantArray format has these headings:

Array Column Array Row Column Row

ArrayVision

The following column headings are recognised as indicating an ArrayVision format file:

Primary Secondary

Newer "lg2" ArrayVision files are identified by the following column headings:

Spot labels

Spotfinder

Spotfinder files are recognised by the following column headings:

MC MR SC SR

BlueFuse

A file containing the following headings is recognised as a BlueFuse file:

COL ROW SUBGRIDCOL SUBGRIDROW

UCSF Spot

UCSF Spot files are recognised by the following column headings:

Arr-colx Arr-coly Spot-colx Spot-coly

Applied Biosystems

Files generated by Applied Biosystems software have the following headings:

Probe_ID Gene_ID

CodeLink Expression Analysis files are identified using the following:

Logical_row Logical_col Center_X Center_Y

ImaGene

ImaGene files are recognised using the following columns:

Meta Column Meta Row Column Row Field Gene ID

The ImaGene 3.0 format is also supported:

Meta_col Meta_row Sub_col Sub_row Name Selected

CSIRO Spot

CSIRO Spot files contain the following columns:

grid_c grid_r spot_c spot_r indexs

Generic (for spotted arrays, non-platform specific)

If your raw data file contains BlockColumn/BlockRow/Column/Row fields denoting probe location on a spotted array, you can use this generic format with four columns followed by columns of data:

MetaColumn MetaRow Column Row