Splice Variant Analysis Platform
HOME WEB TOOLS DATABASE DOCUMENTS DOWNLOAD
This guide explains the SVAP input and output data format.


term GFF
explain GFF is abbreviation of General Feature Format. "GFF is a format for describing genes and other features associated with DNA, RNA and Protein sequences." -- from Sanger Center's webpage.
GFF has two versions -- version 1 and version 2. SVAP only supports version 2 GFF format. To know more about GFF, please refer to Sanger Center GFF Specification . SVAP defined tags are tid, ide, tis, org, chr, polya_signal, polya_num, clone, mgc, s_site, and e_site etc.

NOTES: As a SVAP supported GFF file, fields <seqname> <source> <feature> <start> <end> <score> <strand> <frame> and [attribute] must be seperated by tab, and tag-value pairs must be seperated by ; (semicolon) . The tag and its value must be seperated by blankspace. Here is two examples for your consulting: examlple 1, expamle 2 .

term PSL
explain SVAP supports two kinds of PSL, one is UCSC blat program results format, the other is data dumped from its genome browser. Each has an example, blat results psl example, genome browser table psl example.
term FASTA
explain SVAP can read Standard FASTA format data, and the ID lines should not have blankspaces.
term ECS
explain ECS is SVAP self-defined data format which has similar function as gff. Each KEY-VALUE pair of ECS hold a single line which is seperated by unix line seperator "\n"( for the SVAP web tools, it is also OK if you are using Windows OS). The KEY and its VALUE is seperated by blankspace. Each entry of ECS is seperated by // which holds a line. The standard KEY of ECS are ID, EXONS, STRAND, EST, CHR, ANNO_STRAND, SPLICESITES, CLONE, ORGAN, TISSUE, and IDENTITY. The required fields are ID, EXONS, STRAND, EST and CHR. Here is an examples for your consulting: expamle.
term ALT
explain ALT is SVAP assembling results format. Lines are seperated by unix line seperator "\n". Each line of ALT file describes an assembled isoform. The fields of a line is seperated by tab . Let's look into an example ( This is a fake isoform only for explaining the ALT data format ): alt_example

The 1st field isoform.m.0.18 is the isoform's ID. the .m. indicates that it is an isoform transcripted by the minus DNA strand.

The 2nd field 15926994,15929394,15930986,15931639,15936755,15938740 is the isoform's exons' start coordinates on chromosome. The six numbers seperated by five comma indicate the isoform has six exons.

Accordingly, the 3rd field 15929329,15929626,15931508,15934285,15936989,15938878 is the isoform's exons' end coordinates on chromosome.

The 4th field -1 is the isoform's strand. 1 means plus, -1 means minus.

The 5th field AATAAA is the reliable polyA signal of the isoform.

The 6th field 44 is the polyA tail length of the isoform.

And what about the 7th field GTAG,GTAG,,ATAC, ? It looks so strange. Well, this is the standard splice donor-acceptor sites of the isoform's intron. Each donor-acceptor pair is seperated by comma. Since we have known that this isoform has six exons, it should have 5 introns. Then the 7th field of this record means its 1,2,4 introns are standard spliced, the donor and acceptor pair are GTAG, GTAG and ATAC. And the 3,5 introns are non-standard spliced.

The 8th field 0.773,0.75,0.0455,0.273,0.0682,0.0682 is each exons' EST Density.

The 9th field BI482228,BI578494,NM_132878 is the evidence sequence of this isoform.

The 10th field embryo/2 is expression pattern of the isoform in organ. In this example, it means there are 2 evidence sequences of the isoform are from embryo.

Similar, the 11th field embryo/2 is expression pattern of the isoform in tissue.

The 12th field chrX is the chromosome the isoforms belongs to.

NOTES: Some fields may be leave blank for lack of information.