title Daphnia pulex Gene Set 2.0 beta3 source dpulex_aug26_mixin19h, aka best3 date 2010.04.07 author Don Gilbert, gilbertd at indiana edu Best3/beta3 (dpulex_aug26_mixin19h): 47712 mRNA in mixin19h set, no alt-transcripts yet 43059 mRNA have evidence (homology, est, parology or tile expression) 4653 mRNA have no evidence, but protein >= 40aa minimum 5230 mRNA are non-coding by weak cds criteria (protein <40aa or <30% coding) 22128 have protein homology (e-value <= 1e-5) 18509 have EST evidence 11384 have differential expression 6047 have tar-gene expression (outside JGI set) 9687 have protein paralogy only (>=80% identity) Notable stats for Genes2010 vs JGI_V11 genes: -- Genes2 recovers 82% of protein homology, versus only 42% for JGI - Genes2 has 44% best matches among arthropods to HUMAN genes, vs 21% for 2007 Daphnia genes -- Genes2 recovers 90% of EST, versus 75% for JGI -- Genes2 recovers 66% (5Mb) of 7.6 Mb tile expression not in JGI set (tar_genes) JGI_V11 is official gene set 1, 2007 for Daphnia pulex Map views are at http://server7.wfleabase.org/cgi-bin/gbrowse/daphnia_pulex/ Tracks Prediction/Genes 2010 see also Evidence tracks Protein_Analysis/Arthropod genes EST assembly (Dpulex) and Dmagna EST 09 DGC Tile expression Preliminary best3/beta3 gene files are at http://server7.wfleabase.org/prerelease4/gene-predictions/ daphnia_genes2010_beta3.gff.gz : gene locations, annotated daphnia_genes2010_beta3.aa.gz : protein sequence daphnia_genes2010_beta3.tr.gz : transcript dna daphnia_genes2010_beta3.cds.gz : protein coding seq dna, excluding ncRNA genes daphnia_genes2010_beta3.annotation.txt.gz : gene annotation table (from .gff) #........... beta3.annotation.txt ....................... Table of annotations per gene, including following fields. PredictID AAsize Flags Homolog Paralog Dappu1 : Daphnia pulex gene set 1 ID(s) DiffX : Differential expression summary scores Location : Genome location Name : consensus gene name from arthropod/uniprot descriptions UnipAC UnipDE UnipGO UnipKW : Uniprot reference annotation from 6 model organisms, where AC=accession, DE=description, GO=Gene ontology (goslim_generic), KW=Uniprot keywords Uniprot fields are inferred from annotations for Arthropod consensus gene ortholog (ARP2) hxNCBI_GNO_338014 1298 Homolog,Paralog,EST,Expressed tribolium_TC002175/1203 Omcl463,9 JGI_V11_231978,JGI_V11_39825 none scaffold_1:198888-206910:+ atrial natriuretic peptide receptor/ARP2_G205 Q07553 EC=4.6.1.2,Flags: Precursor,Guanylate cyclase 32E GO:0000166/F:nucleotide binding,GO:0003824/F:catalytic activity,GO:0004672/F:protein kinase activity,... Disulfide bond,GTP-binding,Glycoprotein,Lyase,Membrane,Nucleotide-binding,Receptor,Signal,Transmembrane,.. #-------------------------------------------------------------------- Daphnia pulex gene model quality, all scaffolds Evidence Nevd Statistic jgiv11 gnomon aug25 best2 best3 ======== Exon Sensitivity ======== est_dpulex 64302 poverlap 0.877 0.893 0.973 0.973 0.973 est_dpulex 64302 poverbase 0.731 0.732 0.898 0.887 0.892 est_dpulex 64302 overlaps 56622 57626 62847 62771 62768 est_dmagna 89949 poverlap 0.816 0.848 0.958 0.945 0.944 est_dmagna 89949 poverbase 0.783 0.829 0.931 0.913 0.915 est_dmagna 89949 overlaps 73875 76632 86527 85302 85198 protein_arp2 181577 poverlap 0.659 0.761 0.864 0.914 0.906 protein_arp2 181577 poverbase 0.473 0.579 0.800 0.841 0.822 protein_arp2 181577 overlaps 120824 139364 158556 167812 166250 tar_genes 34850 poverlap 0.000 0.143 0.712 0.595 0.594 tar_genes 34850 poverbase 0.000 0.129 0.778 0.651 0.659 tar_genes 34850 overlaps 0 5146 25480 21320 21266 (tar_genes = 7.6 Megabases of tile expression not in JGI genes) ======== Exon Specificity ======== all_evd_specif 145342 poverlap 0.614 0.630 0.616 0.573 0.567 all_evd_specif 145342 poverbase 0.566 0.543 0.375 0.408 0.411 all_evd_specif 145342 overlaps 90173 98410 107499 122782 119855 ======== Gene model Accuracy ======== All scaffolds jgiv11 gnomon aug25 best2 best3 protein_arp2 11126 found gene 9476 10459 10567 11033 10989 protein_arp2 11126 CDS match 0.610 0.712 0.818 0.873 0.861 homology-human found gene 7036 7217 7188 7499 7481 ave bitscore 367 373 345 376 373 % best match of 14 arthropods -- 21 -- -- 44 homology-tribolium found gene 7036 7140 7268 7564 7565 ave bitscore 409 420 389 427 424 ------------------- total coding bases 30Mb 36Mb 41Mb ~48Mb 48Mb total genes count 31K 37K 36K ~45K 48K --------------------------------------------------------------------------- Gene models tested are D. pulex JGI V11 (official release 1 from 2007), D pulex NCBI Gnomon (2007), and several augustus runs (Aug25, Aug25r, Aug26u, Aug21, 2010). "best3" is selected from same 5 sources as best2, keeping 12593 AUG25, 12134 NCBI_GNO, 9102 AUG26u, 7764 AUG26re, 6467 JGI models. best3 is a bit less sensitive of homology than best2, a side effect of model corrections. "best2" is selected from 5 sources, keeping 12531 AUG25, 10392 NCBI_GNO, 9616 AUG26u, 8060 AUG26re, 5561 JGI models. "best1" is selected from 3 predictors (Augustus25, JGI, Gnomon) to maximize evidence scores. This improves overall quality. The best1 sources are 25030 AUG25, 10688 JGI, 9760 NCBI_GNO. Human gene homology A. Percent best match to Human proteins Daphnia Best3 : 44%, vs 15% for Ixodes, 10% for Tribolium, 2% for DrosMel Daphnia 2007 : 21%, vs 16% for Ixodes, 13% for Tribolium, 4% for DrosMel B. Alignment to Human proteins Daphnia Best3 : 191, vs 188 for Daphnia07, 187 for Tribolium, 175 for DrosMel C. Percent of Human proteins found (of 20276) Daphnia Best3 : 70.7%, vs 68.6% for Daphnia07, 67.9% for Tribolium, 66.3% for DrosMel Differential expression in gene sets (gene counts for t-stat >= 2) Genes with any DE on CDS, all treatments DE jgi gno aug25 best2 best3 Change (b3-jgi) cad+ 148 130 138 146 146 -2 cad- 84 56 82 77 76 -10 cha+ 438 389 541 541 562 +130 cha- 567 530 554 554 579 +10 met+ 696 574 1415 1363 1376 +650 met- 2223 2064 2167 2295 2363 +100 sexf 4196 4089 4193 4587 4662 +450 sexm 2894 2569 3498 3361 3454 +550 nul 23226 30273 26889 37071 38779 Genes with DE on CDS, single effects only DE jgi gno aug25 best2 best3 cad+ 44 43 46 48 48 +4 cad- 50 32 47 44 44 -6 cha+ 171 157 236 242 255 +80 cha- 352 334 355 348 366 +10 sexf 4103 4013 4104 4495 4563 +450 sexm 2456 2177 3030 2892 2978 +500 ... mix 554 487 573 580 595 +40 nul 23210 30223 27357 37511 39211 (mixed metal ignored here) Evidence used for Daphnia Genes2 modeling ------------------------------------------ Evidence includes (1,2) EST assemblies for Daphnia pulex and D. magna, (3) the complete protein sets from six closest arthropod genomes (aphid, apis, crab, ixodes, pediculus, tribolium), (4) genome tiling expression, (5) introns from RNA-seq and ESTs, (6) tandem duplicate gene boundaries. Genes2 are selected from six predictors (4 Augustus runs, JGI 1.1 combined set, and NCBI Gnomon) to maximize evidence scores. Basic Recipe for Daphnia Genes2 Prediction ------------------------------------------ Take a genome assembly, and a good set of gene evidence from EST sequences, proteins of related species, and next generation data of tiling and RNA-seq expression, then one can model genes rather accurately according to gene evidence. PASA is used for EST assembly and gene validation. BLAST is used to locate related proteins (tblastn), and annotate predicted genes (blastp). Exonerate refines protein gene mappings. Augustus predictor models genes using all evidence of ESTs, mapped proteins, genome tiling and RNA-Seq expression. In practice, several prediction runs with Augustus are used with different evidence sets and weightings, to fully model genes. One prediction version weights transcript expression evidence strongly, but lacks good gene span boundaries. Another weights protein homology with full gene boundaries strongly, and includes tandem duplication evidence. Other predictors, such as fgenesh, GeneID, SNAP, Gnomon, are valuable additions. Methods for combining predictions to one best set are still problematic. Several of the available gene combining programs fail to use evidence of gene expression, homology and tandem duplications in determining best models. To overcome this a new, evidence-weighted combining program has been developed to merge several prediction runs into a final best set. Gene models are first annotated with all evidence they recover, then combined into a best set using an evidence weighting heuristic algorithm that matches experience-based techniques used by expert gene annotators.