A parametric framework for multidimensional linear regression

 

Dr Stanley Luck of Vector Analytics LLC has developed a novel parametric framework for multidimensional linear regression, following collaborative research and developments involving the identification of beneficial agronomic variation in maize.

Read more in Research Outreach.

Read the original article here: https://doi.org/10.1371/journal.pone.0262148

 

Image credit: Nico El Nino/Shutterstock

 

 

Transcript:

 

Hello and welcome to Research Pod! Thank you for listening and joining us today.  

 

In this episode, we will be discussing the work of Dr Stanley Luck, statistics consultant and founding member of Vector Analytics LLC in the US. Luck has developed an innovative parametric framework for establishing measurement error regression, even if data for both variables is subject to error. 

  

Dr Stanley Luck was motivated to investigate the applied algebraic foundations of data analysis after working on a collaborative research and development project involving the identification of beneficial agronomic variation in maize. This project involved the application of genome-wide association studies and expression quantitative trait loci methods to identify genetic variants, or GWAS and eQTL for short. After performing high-dimensional searches of the data, Luck observed that the results from classification and regression tree, or CART, analyses did not correspond well with the GWAS.  

 

Luck uncovered extensive research literature, and his applied algebraic investigation of the merits of various effect size measures and their associated statistical methodologies has already been recorded in two recent journal publications. In this third phase of his research into the foundations of data analysis, he investigated the issue of fitting a multidimensional line to data that are subject to stochastic error. Stochastic error is a random effect that may result in an outcome that is not expected, even though both the model and parameters are correct. This led to Luck developing a novel parametric framework for multidimensional linear regression. 

 

Ordinary Least Squares regression is a common statistical technique for modelling a two-dimensional linear relationship between an independent variable, x, and a dependent variable, y. It produces the straight line that minimises the sum of the squares, or the least squares, of the difference between the observed and predicted values. 

 

Luck explains that while Ordinary Least Squares regression serves a definitive role in establishing fundamental concepts in data analysis, it requires the independent variable to be error-free and the variance of the residual, or error term, to be constant, or homoscedastic. If the Ordinary Least Squares assumption of constant variance in the errors is violated, the Weighted Least Squares method can be used. This is an extension of Ordinary Least Squares regression where non-negative weights are applied to the data points. The error-free condition, however, is a requirement of the Moore-Penrose inverse algorithm that is used to estimate the parameters of the Weighted Least Squares regression model. Furthermore, if the independent variable is subject to error, the Ordinary Least Squares regression estimate for the slope is reduced, causing the attenuation of the Pearson correlation coefficient that measures the strength of the linear relationship between two variables. This has spurred Luck’s longstanding research effort to develop a more general framework for establishing measurement error regression where data for both variables can be subject to error – an error-in-variable regression model. 

 

 

Measurement error refers to a sub-discipline of statistics supported by extensive literature and a long history. Luck relates how the wide range of opinions about both the statistical framework and methodology of measurement error suggest that the standard textbook treatment of linear regression may be incomplete. Moreover, the confusion surrounding the fundamental role of measurement error models and Weighted Least Squares optimisation in partitioning the effects of errors contributes to the problem of irreproducibility in data analysis.  

 

 

Luck discusses his novel idea that statistical measures of linear dependence, including covariance, correlation, and regression slope, are all subject to the chain rule. The chain rule is used to differentiate a function of a function, or composite function of the form f(g(x)). He also notes that the standard linear regression framework is bivariate because it is based on the Cartesian representation y = f(x), since the dependent variable y is explained by the independent variable x. Applying the chain rule to this linear relationship led him to discover a novel parametric framework for linear regression.  

 

 

A curve can be defined using a Cartesian equation, an equation in terms of x and y only. Alternatively, a parametric equation can be used where both x and y are functions of a third variable, usually t. Luck demonstrates how employing a parametric representation, rather than a Cartesian equation, enabled him to obtain a more general framework for linear regression that also takes the experimental error in all variables into account. Using the chain rule, he transformed the ordinary linear regression method to the parametric representation (x(t), y(t)), with t corresponding to an element of the convex set formed from x and y. A convex set is made up of points so that the line joining any two points in the set lies entirely within that set, so the set is connected. 

 

How does this all relate to multidimensional linear measurement error regression?

 

Taking his innovative parametric framework for two-dimensional linear regression, Luck extended this method for modelling bivariate point data and applied it to multidimensional vector data. Thus, he has created a new framework for fitting a multidimensional line for a set of linearly related variable vectors for applications of multidimensional linear regression.  

 

In this measurement error model, the relationship between variable vectors is represented by a weighted average, with the weights determined from an error model for the input data. Here, the weighted average corresponds to the minimum coefficient of variance for error, the smallest ratio of the standard deviation of the error to the mean, and the optimal signal-to-noise ratio. In the latter, the signal is the difference in response values and the noise is the natural variation within the system. This weighted average serves as the independent variable for parametric multidimensional linear regression. Luck adds that without any loss of generality, t can be regarded as a fixed variable due to the homogeneous coordinates property where all points on the line containing the slope vector are equivalent, meaning the slope is invariant to scaling of the slope vector. 

 

In the parametric representation, the covariances form a parametric covariance vector. Moreover, the covariance is a measure of the linear dependence between the variable vectors and therefore subject to the chain rule. Consequently, Luck was able to achieve a parametric generalisation of the pairwise Pearson correlation in the form of a parametric multi-way correlation tensor, a generalisation of scalars and vectors that measures the mutual alignment of a set of linearly related variable vectors. 

 

Let’s think about the practical applications of the multidimensional parametric framework.

 

Among the many possible applications for the parametric framework for linear regression in the big data world is RNA sequencing, abbreviated to RNA-Seq. RNA-Seq is used to find the exact sequence of the building blocks that make up all RNA or ribonucleic acid molecules in a cell. It analyses the transcriptome, the collection of gene readouts in a cell, to learn more about which of the genes encoded in our DNA are turned on or off. The simplest way to quantify RNA-Seq gene expression is to count the number of reads that align with each gene. This process is known as a read count, with a read defined as the short single strand of RNA called an oligonucleotide that has been sequenced, so the count is the number of reads that overlap at a particular genomic position. 

 

In this research, Luck demonstrates the application of the multidimensional parametric framework algorithm for the conical dispersion analysis of error in publicly available RNA-Seq data and shows how it estimates the measurement error regression parameters for replicate RNA-Seq data together with the quadratic error in RNA-Seq. 

 

Luck remarks that the fact that statistical measures for linear dependence, such as regression slopes, covariance, and correlation coefficients, are subject to the chain rule has broad implications for multivariate statistics and data science. He concludes the following: ‘outreach that communicates the key findings from this work and helps to remove misconceptions about measurement error is important because of the application of linear regression methodology in many disciplines’. 

 

That’s all for this episode – thanks for listening, and stay subscribed to Research Pod for more of the latest science. See you again soon. 

Leave a Reply

Your email address will not be published.

Top
Researchpod Let's Talk

Share This

Copy Link to Clipboard

Copy