Data Sources
Our lab has made genomic datasets publicly available to the scientific community. These data are available from the following sources.
Whole Genome Sequencing
As part of our NIH-funded genome sequencing studies of autism (the REACH Project), we have made the variant calls and whole genome sequence data available through the National Database for Autism Research (NDAR).
Psychiatric Genomics Consortium
The PGC CNV resource is now publicly available through the PGC CNV browser. The rare CNV call set from the PGC schizophrenia CNV study can be obtained from the European Genome-Phenome Archive (Study accession #EGAS00001001960).
Genome Wide Estimates of Site Specific Mutability
(Michealson et al, Cell 2012). Mutability index (MI) values for each position in the (hg18) genome are listed. They are bundled by chromosome, and each .Rdata file (an R workspace file) contains an Rle object called 's', which contains the MI values. The log10 MI has been multiplied by 100 (to facilitate being stored as integers). If you were divided by 100, a value of 0 indicates a prediction of ~ genome average mutation rate, and 1 and -1 represent, respectively, a 10x increase and 10x decrease in mutation rate relative to the genome average (since again these are on the log10 scale).
Software
Our lab has developed bioinformatic tools for the analysis of whole genome sequence data. These software are available from the following sources.
Sebatlab Github
Most software applications that are being developed in our laboratory are available on our GitHub page.
Determining Parent of Origin
Parent of origin code for de novo SNV sites.
forestSV
A statistical learning approach, based on Random Forests, that integrates prior knowledge about the characteristics of structural variants and leads to improved discovery in high-throughput sequencing data. The implementation of this technique, forestSV, offers high sensitivity and specificity coupled with the flexibility of a data-driven approach.
forestDNM
An R package built around a classifier that was trained to predict true de novo germline mutations (DNMs), using features derived from family genotype data contained in a VCF. The classifier was trained on 10 families with monozygotic twins, whose putative DNMs had undergone extensive experimental validation (the classifier was trained to predict validation status). In an independent test set of held-out data from the 10 families, sensitivity was > 95