Accepted raw microarray files formats < Guide < Annotare

Accepted raw microarray files formats

Important notes on file preparation:

Do not compress your microarray data files.
Submit raw or raw matrix file(s) for every sample/hybridisation of your experiment.
Make sure the file names are constructed only from alphanumerals [A-Z,a-z,0-9], underscores [_] and dots [.], with no whitespaces, brackets, other punctuations or symbols.
Any spreadsheet/matrix file should be saved in tab-delimited text (*.txt) format and not Excel format (*.xls, *.xlsx)

What are "raw" files?

Raw (recommended): The "native" files generated by the microarray scanner software. Make sure you do not change/edit the native files in any way, and submit one raw file per hybridisation assay. (One assay can consist of just one channel, as in Affymetrix experiments, or two channels, as in spotted arrays often with red and green channels from two different dyes/fluorophores.)

Commercial microarray manufacturers have developed different raw data file formats over the years. If you are unsure about whether your raw files are in an accepted format, please check the list below.

Raw Matrix: A raw file in tab-delimited text (.txt) format*, that contains data from more than one hybridisation assay (probes in rows and assays in columns). The format requirements are strict (except for Illumina GenomeStudio data files). See matrix guidelines and examples.

Accepted formats by platform

ArrayExpress submission system recognises the raw data file platform using column headings in the file's header:

Common platforms:

Affymetrix
Agilent
Illumina
GenePix
NimbleScan

Others:

ScanAlyze
ScanArray
QuantArray
Arrayvision
Spotfinder
BlueFuse
UCSF Spot
Applied Biosystems
CodeLink
Imagene
CSIRO Spot
Generic (for spotted arrays, non-platform specific)

Affymetrix

Our submission system recognises .CEL and .EXP files using both the old GDAC formats and the newer GCOS/XDA formats.

Agilent

A file containing these headings is recognised as an Agilent format file:

Row

Col

PositionX

PositionY

Illumina

Illumina raw data files are usually either in plain text or binary format. Plain text files are generated by the Illumina GenomeStudio software . The binary "IDAT" files (stands for "intensity data file") are generated by the scanner and can be parsed using R/BioConductor packages such as illuminaio). IDAT is the preferred file format, as it is a binary format, containing all information required to analyse the data. In contrast, plain-text files can be missing information such as which are the control probes; this is sometimes provided in a separate file, but not always, and heterogeneity in raw data file formats makes systematic analysis of data difficult. Another disadvantage of plain-text files is that they are susceptible to human-introduced errors, as it is easy for someone to open the file in a text editor or spreadsheet program and accidentally change its content. If you're submitting a GenomeStudio text file, below is an example of the expected column headings:

PROBE_ID

Assay_Name_1.QT1

Assay_Name_1.QT2

Assay_Name_2.QT1

Assay_Name_2.QT2

PROBE_IDs are always in the format of "ILMN_123456". Assay Names are automatically generated by Annotare and can be found in the SDRF Preview tab while you are preparing your submission. QT stands for quantitation type, i.e. the type of measurement recorded in the column, e.g. AVGSignal. You can have as many quantitation types as required. Order the columns by sample names, then by quantitation types.

GenePix

GenePix format files (usually with file extension .gpr or .txt) are recognised using the following column headings:

Block

Column

Row

NimbleScan

NimbleScan files (Feature, Probe and Pair) all contain the following headings:

PROBE_ID

ScanAlyze

The following column headings are recognised as being from a ScanAlyze format file:

GRID

COL

ROW

LEFT

TOP

RIGHT

BOT

ScanArray/QuantArray

ScanArray Express files are recognised from the following headings:

Array Column

Array Row

Spot Column

Spot Row

while the older QuantArray format has these headings:

Array Column

Array Row

Column

Row

ArrayVision

The following column headings are recognised as indicating an ArrayVision format file:

Primary

Secondary

Newer "lg2" ArrayVision files are identified by the following column headings:

Spot labels

Spotfinder

Spotfinder files are recognised by the following column headings:

BlueFuse

A file containing the following headings is recognised as a BlueFuse file:

COL

ROW

SUBGRIDCOL

SUBGRIDROW

UCSF Spot

UCSF Spot files are recognised by the following column headings:

Arr-colx

Arr-coly

Spot-colx

Spot-coly

Applied Biosystems

Files generated by Applied Biosystems software have the following headings:

Probe_ID

Gene_ID

CodeLink

CodeLink Expression Analysis files are identified using the following:

Logical_row

Logical_col

Center_X

Center_Y

ImaGene

ImaGene files are recognised using the following columns:

Meta Column

Meta Row

Column

Row

Field

Gene ID

The ImaGene 3.0 format is also supported:

Meta_col

Meta_row

Sub_col

Sub_row

Name

Selected

CSIRO Spot

CSIRO Spot files contain the following columns:

grid_c

grid_r

spot_c

spot_r

indexs

Generic (for spotted arrays, non-platform specific)

If your raw data file contains BlockColumn/BlockRow/Column/Row fields denoting probe location on a spotted array, you can use this generic format with four columns followed by columns of data:

MetaColumn

MetaRow

Column

Row