This guide explains the SVAP input and output data format.
-
GFF
-
GFF is abbreviation of General Feature Format.
"GFF is a format for describing genes and other features associated with DNA,
RNA and Protein sequences." -- from Sanger Center's webpage.
GFF has two versions -- version 1 and version 2.
SVAP only supports version 2 GFF format. To know more about GFF, please refer to
Sanger Center GFF Specification
.
SVAP defined tags are tid, ide, tis, org, chr, polya_signal,
polya_num, clone, mgc, s_site, and e_site etc.
NOTES: As a SVAP supported GFF file,
fields <seqname> <source> <feature> <start> <end>
<score> <strand> <frame> and [attribute] must be seperated by
tab, and tag-value pairs must be seperated by
; (semicolon) . The tag and its value must be seperated by blankspace.
Here is two examples for your consulting: examlple 1,
expamle 2 .
-
PSL
-
SVAP supports two kinds of PSL, one is
UCSC blat
program results format, the other is data dumped from its
genome browser.
Each has an example, blat results psl example,
genome browser table psl example.
-
FASTA
-
SVAP can read Standard FASTA format data, and the ID lines should not have blankspaces.
-
ECS
-
ECS is SVAP self-defined data format which has similar function as gff.
Each KEY-VALUE pair of ECS hold a single line which is seperated by unix
line seperator "\n"( for the SVAP web tools, it is
also OK if you are using Windows OS). The KEY and its VALUE is seperated
by blankspace. Each entry of ECS is seperated by //
which holds a line. The standard KEY of ECS are ID, EXONS,
STRAND, EST, CHR, ANNO_STRAND,
SPLICESITES, CLONE, ORGAN, TISSUE,
and IDENTITY. The required fields are ID, EXONS,
STRAND, EST and CHR.
Here is an examples for your consulting:
expamle.
-
ALT
-
ALT is SVAP assembling results format. Lines are seperated by unix line seperator
"\n". Each line of ALT file describes an assembled isoform.
The fields of a line is seperated by tab . Let's look into an example
( This is a fake isoform only for explaining the ALT data format ):
alt_example
The 1st field isoform.m.0.18 is the isoform's ID.
the .m. indicates
that it is an isoform transcripted by the minus DNA strand.
The 2nd field
15926994,15929394,15930986,15931639,15936755,15938740 is
the isoform's exons' start coordinates on chromosome. The six numbers seperated by
five comma indicate the isoform has six exons.
Accordingly, the 3rd field
15929329,15929626,15931508,15934285,15936989,15938878
is the isoform's exons' end coordinates on chromosome.
The 4th field
-1 is the isoform's strand. 1 means plus, -1 means minus.
The 5th field AATAAA is the reliable polyA signal of
the isoform.
The 6th field 44 is the polyA tail length
of the isoform.
And what about the 7th field GTAG,GTAG,,ATAC, ?
It looks so strange. Well, this is the standard splice donor-acceptor sites
of the isoform's intron. Each donor-acceptor pair is seperated by comma. Since
we have known that this isoform has six exons, it should have 5 introns. Then the 7th field
of this record means its 1,2,4 introns are standard spliced, the donor and acceptor pair
are GTAG, GTAG and ATAC. And the 3,5 introns are non-standard spliced.
The 8th field 0.773,0.75,0.0455,0.273,0.0682,0.0682
is each exons'
EST Density.
The 9th field
BI482228,BI578494,NM_132878 is the evidence sequence of this
isoform.
The 10th field embryo/2 is expression pattern of
the isoform in organ. In this example, it means there are 2 evidence sequences of the isoform are
from embryo.
Similar, the 11th field embryo/2 is expression pattern of
the isoform in tissue.
The 12th field chrX is the chromosome the isoforms belongs to.
NOTES: Some fields may be leave blank for lack of information.