Univariate, Bivariate, Correlation and Causation
Contents
Univariate Data
In mathematics, a univariate object is an expression, equation, function or polynomial involving only one variable. Objects involving more than one variable are multivariate. In some cases the distinction between the univariate and multivariate cases is fundamental; for example, the fundamental theorem of algebra and Euclid's algorithm for polynomials are fundamental properties of univariate polynomials that cannot be generalized to multivariate polynomials.
In statistics, a univariate distribution characterizes one variable, although it can be applied in other ways as well. For example, univariate data are composed of a single scalar component. In time series analysis, the whole time series is the "variable": a univariate time series is the series of values over time of a single quantity. Correspondingly, a "multivariate time series" characterizes the changing values over time of several quantities. In some cases, the terminology is ambiguous, since the values within a univariate time series may be treated using certain types of multivariate statistical analyses and may be represented using multivariate distributions.
In addition to the question of scaling, a criterion (variable) in univariate statistics can be described by two important measures (also key figures or parameters): Location & Variation.
- Measures of Location Scales (e.g. mode, median, arithmetic mean) describe in which area the data is arranged centrally.
- Measures of Variation (e.g. span, interquartile distance, standard deviation) describe how similar or different the data are scattered.
Bivariate Data
Bivariate Data is data of two quantitative variables. This kind of data is analogous to what is known as univariate data, which is data of one quantitative variable (which could also have been deduced from their prefixes: "bi" means two and "uni" means one).
Bivariate data is used mainly in statistics. It is used in studies that include measures of central tendencies (i.e. mean, median, mode, midrange), variability, and spread. In each study, more than one variable is collected, like in medical studies. Height and weight of individuals might want to be obtained, not just height or weight. Bivariate data is the comparing of two quantitative variables of an individual, and can be shown graphically through histograms, scatterplots, or dotplots.
Example
Based on the data given below, do women generally marry at a younger age than men do?
In this example, we are given two variables: gender and age. The data is given to us below:
10 men's and women's ages of when they were married
Men:
25, 26, 27, 29, 30, 31, 33, 36, 38, 40
Women:
19, 20, 21, 22, 23, 25, 26, 28, 29, 30
With this data, you can create a histogram to graphically see the results and how they relate to one another. Then, find the mean of each separate chart. The mean age of when men get married is:
25 + 26 + 27 + 29 + 30 + 31 + 33 + 36 + 38 + 40
10
= 31.5
The mean age of when women get married is:
19 + 20 + 21 + 22 + 23 + 25 + 26 + 28 + 29 + 30
10
= 24.3
Both distributions are slightly skewed to the right, meaning that more of the data values occur to the left of the mean than to the right.
So, based on our data, women do typically marry at a younger age than men do.
Statistical and deterministic Relationships
A deterministic relationship implies that there is an exact mathematical relationship or dependence between variables. An example in physics is Newton's law of gravity: , where F, the force, is proportional to a constant, k, the mass of two objects, and , and inversely to the square of the distance.
A random or stochastic relationship allows that there is some variation in a relationship. This is where probability distributions will enter later on. The relationship may not be exact due to
- measurement errors
- reporting errors
- computing errors
- other influence,
etc. One example is crop yield relative to rain fall. We may not be able to measure the amount of rain accurately (measurement error), we round off to one decimal point (reporting error), there is a bug in the computer software that computes the sum (computing error) and there might be other factors such the quality of fertilisers, the quality of the earth and pollution (other influences).
Regression versus Causation
Regression deals with dependence amongst variables within a model. But it cannot always imply causation. For example, we stated above that rainfall affects crop yield and there is data that support this. However, this is a one-way relationship: crop yield cannot affect rainfall. It means there is no cause and effect reaction on regression if there is no causation.
In short, we conclude that a statistical relationship does not imply causation.[1]
Regression versus Correlation
Correlations form a branch of analysis called correlation analysis, in which the degree of linear association is measured between two variables. If we calculate the correlation between crop yield and rainfall, we might obtain an estimate of, say, 0.69. This is reasonably high to prove that there is a mathematical relationship between them.
There is a distinction in how we regard the relationship between rainfall and crop yield. In statistics, both variables are assumed to be variables with random error in them. Both are treated on an equal footing and there is no distinction between them..
In regression analysis, crop yield is the dependent variable and rainfall is the explanatory variable, according to our theory. The distinction is that the independent variable has no random component, all values are fixed from this distribution
This will be important in {section on measurement}.
- ↑ Gujarati (2003, p. 23)