DIST by dleelab

What's New?

[07/14/2015] Recently we developed a novel method/software for directly imputing summary statistics for unmeasured SNPs from mixed ethnicity cohorts (DISTMIX). DISTMIX extends DIST capabilities to multi-ethnic cohorts analysis while 1) maintaining imputation accuracy and speed, 2) not requiring genetic data and 3) being applicable to pedigree data. A pre-compiled executable for DISTMIX (v0.2.0) built under Linux is available now!

Introduction

"DIST is a software program for directly imputing the normally distributed summary statistics of unmeasured SNPs in a GWAS/meta-analysis without first imputing subject level genotypes. This direct imputation of the normally distributed statistics (Z-scores) is achieved by employing the conditional expectation formula for multivariate normal variates and using the correlation structure from a relevant reference population (e.g., 1000 Genomes data set). When compared to genotype imputation methods, DIST i) requires only a fraction of their computational resources, ii) has comparable imputation accuracy for independent subjects and iii) is readily applicable to the imputation of association statistics coming from large pedigree data. Thus, DIST is useful for a fast imputation of summary results for i) studies of unrelated subjects which a) do not provide subject level genotypes or b) have a large size and ii) family association studies. DIST is written in C++ and can be easily used under Linux, Windows, MacOS etc."

Citing DIST

DIST: Direct imputation of summary statistics for unmeasured SNPs, Donghyung Lee; T. Bernard Bigdeli; Brien Riley; Ayman Fanous; Silviu-Alin Bacanu; Bioinformatics (2013) 29 (22): 2925-2927. doi: 10.1093/bioinformatics/btt500

Download DIST

The current release (Version 1.0.0) of DIST is for a Linux user. The pre-compiled executables for other operating systems (e.g., Windows, MacOS) will be available soon. The latest source codes of DISTMIX are available upon request (dustorm_at_gmail_dot_com).

Direct download link	Version	Release Date
DIST for a Linux user	v1.0.0	11/05/2013

Download Reference Panels

Direct download link	NCBI build	Release Date
1000 Genomes Phase1 Release3 European	build 37 (hg19)	Nov. 23 2010
UK10K	build 37 (hg19)

DIST input file format

DIST takes as input a plain text file with rows and columns denoting SNPs and variables, respectively. The file contains six columns: 1) rsid, 2) chromosome number, 3) base pair position, 4) reference allele, 5) alternative allele and 6) normally distributed GWAS/meta-analysis summary statistic (Z-score). DIST does not require the input data to be sorted in ascending order by chromosome number and base pair position or rsid. The first line of the input file should be column names/headers. Data entries on each line should be separated by white space. Here is a sample DIST input file, with 5 SNPs:

rsid        chr  bp         a1  a2   z

rs1000109   9    117908733  C   T   -0.714464

rs10001109  4    44904653   C   T    0.721919

rs1000112   22   28273339   C   G   -0.583666

rs10001127  4    130547376  T   C    0.069329

rs1000113   5    150240076  T   C   -1.447288

Note: If the input data comes a different genome assembly (e.g. NCBI build 36 (hg18)) than NCBI build 37 (hg19), it needs to be converted to NCBI build 37 (hg19) using a software, called liftOver, from UCSC.

DIST output file format

The DIST output file has ten columns: 1) rsid, 2) chromosome number, 3) base pair position, 4) reference allele, 5) alternative allele, 6) reference allele frequency, 7) Z-score, 8) DIST imputation information, 9) p-value and 10) type. The first line of the output file is column names/headers. The tenth column, type, contains 0, 1 or 2. 0 means the corresponding SNP is unmeasured (thus, imputed) one but has been genotyped in the reference panel; 1 means the corresponding SNP is measured one and has been genotyped in the reference panel; 2 means the corresponding SNP is measured one but has not been genotyped in the reference panel. Here is a sample DIST output file, with 5 SNPs:

rsid        chr  bp     a1  a2  af1         z           info        pval      type

rs74970982  1    61442  G   A   0.951444    0.00678502  0.252236    0.994586  0

rs62637820  1    61920  A   G   0.0629921  -0.0502051   0.490445    0.959959  0

rs76735897  1    61987  G   A   0.440945   -0.0614578   0.528252    0.950995  0

rs77573425  1    61989  C   G   0.458005   -0.0552563   1           0.955934  1

rs28599927  1    62271  G   A   NA         -0.0314447   1           0.974915  2

Options

Option	Short Flag	Parameter	Default	Description
--version	-v	none	none	Prints version information.
--help	-h	none	none	Outputs a full description of all DIST options.
--config	-f	filename	none	The configuration filename. DIST reads user-defined settings in this file at startup.
--reference	-r	filename	none	The filename of the reference data set.
--output	-o	filename	out.dst	The output filename. This file has a header and is space-delimited. Rows correspond to SNPs and columns to variables.
--chromosome	-c	positive integer between 1 and 22	none	The chromosome number.
--num-pre	-n	positive integer	1	The number of measured SNPs to be contained in the prediction window.
--num-ext	-m	positive integer	50	The number of measured SNPs included in the extended window to the left and right of the prediction window.
--info-cutoff	-i	subunitary decimal	0.99	The information cutoff value for numerical integration. If information of an imputed SNP falls below it, DIST computes the imputed p-value by using the numerical integration method.
--absz-cutoff	-z	decimal	3.0	The absolute value for z-score cutoff for numerical integration. If the absolute value of the imputed Z-score rises above this value and the corresponding imputation information falls below a user-defined cutoff value for imputation information (0.99 by default), DIST computes the imputed p-value by using the numerical integration method.
--num-quant	-q	positive integer	1,000	The number of equally distant quantiles used for the numerical integration of the expected p-value for imputed statistics.

Note: Results from our simulation studies show that DIST achieves the highest accuracy when num-pre = 1 and num-ext = 50. Thus, it is highly recommended to use the default setting for both options.

Getting started with examples

The users are required to download a pre-compiled DIST executable (dist) and a sample input file (sample_input_chr22) from the Download DIST section above. A tgz-compressed directory with 1000 Genomes European or UK10K reference panel datasets also should be downloaded from Download Reference Panels section above and can be uncompressed using tar with the -zxvf options. Let’s assume that the reference panel datasets named as "chr#_foo.gz" are stored in a directory /path/to/reference/.

Scenario 1: We want to impute unmeasured SNPs on chromosome 22 in the sample input file and store all imputation results in a text file named as "sample_output_chr22".

>> ./dist -c 22 sample_input_chr22 -o sample_output_chr22 -r /path/to/reference/chr22_foo.gz

Scenario 2: Here, we want to impute unmeasured SNPs on chromosomes 11 in a GWAS/meta-analysis (assume the input file name is "allchr_diabetes_hg19".) and store the imputation results in a text file named as "chr11_diabetes_hg19_imputed". We want to set the number of measured SNPs in the prediction window (num-pre) to be 3 and the number of measured SNPs in each side region of the extended window (num-ext) to be 30.

>> ./dist -c 11 -n 3 -m 30 -o chr11_diabetes_hg19_imputed -r /path/to/reference/chr11_foo.gz allchr_diabetes_hg19

Scenario 3: Let's assume that we use an input file (wtccc_scz) stored in a directory (/path/to/input/) and set num-pre (n) to be 2, num-ext (m) to be 60, info-cutoff (i) to be 0.90, absz-cutoff (z) to be 3.5 and num-quant (q) to be 5000 and keep using these settings. To run DIST with these settings, we should specify these options on the command line as follows:

>> ./dist -n 2 -m 60 -i 0.90 -z 3.5 -q 5000 /path/to/input/wtccc_scz ...

Like the above example, entering frequently used options on the command line every time we run DIST is an irksome job. We can avoid this by specifying the options in a DIST configuration file, which is read by DIST at start-up. Here is a sample configuration file containing all the options:

num-pre = 2
num-ext = 60
info-cutoff = 0.90
absz-cutoff = 3.5
num-quant = 5000
input-file = /path/to/input/wtccc_scz

We named the above configuration file as "config_wtccc.dst" stored it in /path/to/config/. We can run DIST with the configuration file as follows:

>> ./dist -f /path/to/config/config_wtccc.dst ...

If we want to impute missing SNPs on chromosome 7 in the input data file, we just specify the chromosome number option (c) to be 7:

>> ./dist -f /path/to/config/config_wtccc.dst -c 7 -r /path/to/reference/chr7_foo.gz -o /path/to/output/chr7_foo_imputed

To impute unmeasured SNPs on all autosomes (chromosomes 1-22) in the above input file at once, we can use a shell script (e.g. bash):

#!/bin/bash
for i in {1..22}
do
./dist -f /path/to/config/config_wtccc.dst -c $i -r /path/to/reference/chr$i_foo.gz -o /path/to/output/chr$i_foo_imputed    
done

DIST

direct imputation of summary statistics for unmeasured SNPs