Important notes on file preparation:
*.txt
) format and not Excel
format (*.xls
, *.xlsx
)
ArrayExpress submission system recognises the raw data file platform using column headings in the file's header:
Our submission system recognises .CEL and .EXP files using both the old GDAC formats and the newer GCOS/XDA formats.
A file containing these headings is recognised as an Agilent format file:
Row | Col | PositionX | PositionY |
Illumina raw data files are usually either in plain text or binary format. Plain text files are generated by the Illumina GenomeStudio software . The binary "IDAT" files (stands for "intensity data file") are generated by the scanner and can be parsed using R/BioConductor packages such as illuminaio). IDAT is the preferred file format, as it is a binary format, containing all information required to analyse the data. In contrast, plain-text files can be missing information such as which are the control probes; this is sometimes provided in a separate file, but not always, and heterogeneity in raw data file formats makes systematic analysis of data difficult. Another disadvantage of plain-text files is that they are susceptible to human-introduced errors, as it is easy for someone to open the file in a text editor or spreadsheet program and accidentally change its content. If you're submitting a GenomeStudio text file, below is an example of the expected column headings:
PROBE_ID | Assay_Name_1.QT1 | Assay_Name_1.QT2 | Assay_Name_2.QT1 | Assay_Name_2.QT2 |
PROBE_IDs are always in the format of "ILMN_123456". Assay Names are automatically generated by Annotare
and can be found in the SDRF Preview
tab while you are preparing your submission. QT
stands for
quantitation type, i.e. the type of measurement recorded in the column, e.g. AVGSignal
.
You can have as many quantitation types as required. Order the columns by sample names, then by quantitation types.
GenePix format files (usually with file extension .gpr or .txt) are recognised using the following column headings:
Block | Column | Row | X | Y |
NimbleScan files (Feature, Probe and Pair) all contain the following headings:
PROBE_ID | X | Y |
The following column headings are recognised as being from a ScanAlyze format file:
GRID | COL | ROW | LEFT | TOP | RIGHT | BOT |
ScanArray Express files are recognised from the following headings:
Array Column | Array Row | Spot Column | Spot Row | X | Y |
while the older QuantArray format has these headings:
Array Column | Array Row | Column | Row |
The following column headings are recognised as indicating an ArrayVision format file:
Primary | Secondary |
Newer "lg2" ArrayVision files are identified by the following column headings:
Spot labels |
Spotfinder files are recognised by the following column headings:
MC | MR | SC | SR |
A file containing the following headings is recognised as a BlueFuse file:
COL | ROW | SUBGRIDCOL | SUBGRIDROW |
UCSF Spot files are recognised by the following column headings:
Arr-colx | Arr-coly | Spot-colx | Spot-coly |
Files generated by Applied Biosystems software have the following headings:
Probe_ID | Gene_ID |
CodeLink Expression Analysis files are identified using the following:
Logical_row | Logical_col | Center_X | Center_Y |
ImaGene files are recognised using the following columns:
Meta Column | Meta Row | Column | Row | Field | Gene ID |
The ImaGene 3.0 format is also supported:
Meta_col | Meta_row | Sub_col | Sub_row | Name | Selected |
CSIRO Spot files contain the following columns:
grid_c | grid_r | spot_c | spot_r | indexs |
If your raw data file contains BlockColumn/BlockRow/Column/Row fields denoting probe location on a spotted array, you can use this generic format with four columns followed by columns of data:
MetaColumn | MetaRow | Column | Row |