Statistics Seminar Fall 2012
  Department of Math Sciences, IUPUI
  Wednesday 12:15-1:30PM, LD 265

  (*Abstracts can be found below)  

08/22. Hanxiang Peng, Department of Math Sciences,  IUPUI
Organizing Meeting
Empirical Likelihood Tests of Multivariate Symmetries.*

08/31. Ping Ma, Department of Math Statistics, UIUC
Nonparametric Analysis of  RNA Sequence.*

09/05. Arup Bose, Indian Statistical Institute, Kolkata, India.
Non-random and random Toeplitz matrices.*

09/12. Wei Zheng, Department of Math Sciences, IUPUI
Universally Optimal Crossover Designs under Subject Dropout.*

09/19. Samarjit Das, Indian Statistical Institute, Kolkata, India.
Intercept Poolability Test under Cross-sectional Dependence in Panel Data.*  

09/21. Lisa Jones, Department of Chemistry and Chemical Biology, IUPUI
Protein Footprinting Coupled with Mass Spectrometry for the Investigation of Protein-Protein Interactions.*

09/28. Seongho Kim, Department of Bioinformatics and Biostatistics, University of Louisville.
Constructing Metabolic Association Networks Using High-Dimensional Mass Spectrometry Data.* 







11/21. Thanksgiving, No seminar




Title: Empirical Likelihood Tests of Multivariate Symmetries. (Hanxiang Peng)
Abstract: In this talk, we will first introduce a class of tests for testing multivariate central symmetry. The empirical likelihood (EL) versions of these tests are then presented. An EL-based test about spherical symmetry is also given. Because symmetries are infinite dimensional, our tests are based on random vectors with growing dimension. We will discuss the asymptotic behaviors of these tests and local alternatives.

Title:  Nonparametric Analysis of RNA-Seq (Ping Ma)
Abstract: With the rapid development of second-generation sequencing technologies, RNA-Seq has become a popular tool for transcriptome analysis. It offers the chance to detect novel transcripts by obtaining tens of millions of short reads. After mapped to the genome and/or to the reference transcripts, RNA-Seq data can be summarized by a tremendous number of short-read counts. The huge number of short-read counts enables researchers to make transcript quantification in ultra-high resolution. Recent work found that short-read counts have significant sequence bias, which makes simple transcript quantification methods questionable. Thus, more elaborate statistical models that can effectively remove the sequence bias of the short-read counts are highly desirable to make transcript quantification more accurate. In this talk, I will present some nonparametric statistical analysis for bias correction in RNA-Seq short-read counts. Since the sample size is over tens of millions, fitting regular nonparametric model is infeasible. I will present a novel scalable algorithm. Real RNA-Seq examples will also be presented to demonstrate the empirical performance of our method.

Title:  Non-random and random Toeplitz matrices (Arup Bose)
Abstract: It is known that the spectrum of these infinite dimensional operators may be approached via the discrete Fourier transform of the entries of an appropriate approximating circulant matrix of increasing dimension. The random symmetric Toeplitz matrices cannot however be approximated by circulant matrices. Nevertheless, when the entries are i.i.d., their limiting spectrum turns out to be non-random, unbounded, and have sub-Gaussian tails. They are also universal, not depending on the exact distribution of the entries but only on their variance. It is noteworthy that the corresponding result for non-symmetric matrices is not yet proven, although simulations do indicate it to be true. The sample autocovariance matrix in time series is a random Toeplitz matrix but with entries which have dependence. Its limiting spectral distribution also exists but does not bear a one to one relation with the spectrum of the theoretical autocovariance matrix/operator. This leads to interesting statistical issues.  We shall give a brief introduction to this historically famous and currently important and interesting matrix and  operator, touching on some of the results mentioned above.

Title: Universally Optimal Crossover Designs under Subject Dropout (Wei Zheng)
Abstract: Subject dropout is very common in practical applications of crossover designs. However, there is very limited design literature taking this into account. Optimality results have not yet been well established due to the complexity of the problem. This paper establishes feasible, as well as necessary and sufficient conditions for a crossover design to be universally optimal in approximate design theory in the presence of subject dropout. These conditions are essentially linear equations with respect to proportions of all possible treatment sequences being applied to subjects and hence they can be easily solved. A general algorithm is proposed to derive exact designs which are shown to be efficient and robust.

Intercept Poolability Test under Cross-sectional Dependence in Panel Data (Samarjit Das)
Abstract: This talk develops a test for intercept homogeneity in fixed-effects one-way error component models assuming slope homogeneity. The test is shown to be robust to cross sectional dependence; for both weak and strong dependence. The proposed test is shown to have a standard Chisquare limiting distribution and is free from nuisance parameters under the null hypothesis. Monte Carlo simulations also show that the proposed test deliver more accurate finite sample sizes than existing tests for various combinations of N and T. Simulation study shows that F-test is either over-sized or under sized depending on the pattern of cross sectional dependence. The performance of Hausman test (1978), on the other hand, is quite unstable across various DGPs; and empirical size varies from 0 percent to the nominal sizes depending on the structure of error variance-covariance matrix. The power of the proposed test outperforms the other two tests.

Title:  Protein Footprinting Coupled with Mass Spectrometry for the Investigation of Protein-Protein Interactions (Lisa M. Jones)
In recent years protein footprinting has been widely used as a tool for analyzing protein structure by labeling solvent accessible sites. Comparison of the labeling pattern between the protein in different states can provide information on the protein conformation. One footprinting method, fast photochemical oxidation of proteins (FPOP), uses hydroxyl radicals to oxidize solvent-exposed residues on a short time scale. FPOP utilizes an excimer laser to photolyze hydrogen peroxide to form the radicals. FPOP was used to investigate the protein-protein interactions in the capsid of the P22 bacteriophage, a double stranded DNA virus that infects Salmonella. The assembly of its capsid is a well-studied process that is initiated via a nucleation-limited reaction. Scaffolding protein (300 copies) copolymerizes with 415 copies of the coat protein, 12 copies of portal protein, and 12-20 copies each of three ejection proteins to form the procapsid, a T=7 metastable capsid precursor. When DNA is packaged, the scaffolding protein is released and the expansion of the capsid head occurs to form the mature capsid. The protein-protein interactions within the procapsid and mature capsid have been identified. However, recent studies have shown that point mutations within the coat protein significantly alter the morphology of the capsid. In order to determine whether the protein-protein interactions within the capsid of the mutants are different than those of the wild type capsid, FPOP was employed. The comparison of the oxidative labeling pattern between the wild type and mutant capsids reveals differences in the protein-protein interactions in the assembly of the capsid. Additionally, the data suggest a secondary interaction that had not previously been observed in the P22 capsid. The results show the efficacy of using FPOP coupled with mass spectrometry to characterize protein-protein interactions in complex macromolecular structures.


Title: Constructing Metabolic Association Networks Using High-Dimensional Mass Spectrometry Data (Seongho Kim, joint work with Drs. Imhoi Koo and Xiang Zhang)  
Abstract: The goal of metabolic association networks is to identify topology of a metabolic network for a better understanding of molecular mechanisms. An accurate metabolic association network enables investigation of the functional behavior of metabolites in a cell or tissue. Gaussian Graphical model (GGM)-based methods have been widely used in genomics to infer biological networks. However, the performance of various GGM-based methods for the construction of metabolic association networks remains unknown in metabolomics. The performance of principle component regression (PCR), independent component regression (ICR), shrinkage covariance estimate (SCE), partial least squares regression (PLSR), and extrinsic similarity (ES) methods in constructing metabolic association networks therefore was compared by estimating partial correlation coefficient matrices when the number of variables is larger than the sample size. For the simulated data, the proposed methods PCR and ICR outperform other methods when the network density is large, while PLSR and SCE perform better when the network density is small. As for experimental metabolomics data, PCR and ICR discover more significant edges and perform better than PLSR and SCE when the discovered edges are evaluated using KEGG pathway. These results suggest that the metabolic network is more complex than the genomic network and therefore, PCR and ICR have the advantage over PLSR and SCE in constructing the metabolic association networks.