The mean is a parameter that measures the central location of the distribution of a random variable and is an important statistic that is widely reported in scientific literature. Although the arithmetic mean is the most commonly used statistic in describing the central location of the sample data, other variations of it, such as the truncated mean, the interquartile mean, and the geometric mean, may be better suited in a given circumstance. The characteristics of the data dictate which one of them should be used. Regardless of which mean is used, the sample mean remains a random variable. It varies with each sample that is taken from the same population. This entry discusses the use of mean in probability and statistics, differentiates between the arithmetic mean and its variations, and examines how to determine its appropriateness to the data.
In probability, the mean is a parameter that measures the central location of the distribution of a random variable. For a real-valued random variable, the mean, or more appropriately the population mean, is the expected value of the random variable. That is to say, if one observes the random variable numerous times, the observed values of the random variable would converge in probability to the mean. For a discrete random variable with a probability function p(y), the expected value exists if
where y is the values assigned by the random variable. For a continuous random variable with a probability density function f(y), the expected value exists if
Comparing Equation 1 with Equation 2, one notices immediately that the f(y)dy in Equation 2 mirrors the p(y) in Equation 1, and the integration in Equation 2 is analogous to the summation in Equation 1.
The above definitions help to understand conceptually the expected value, or the population mean. However, they are seldom used in research to derive the population mean. This is because in most circumstances, either the size of the population (discrete random variables) or the true probability density function (continuous random variables) is unknown, or the size of the population is so large that it becomes impractical to observe the entire population. The population mean is thus an unknown quantity.
In statistics, a sample is often taken to estimate the population mean. Results derived from data are thus called statistics (in contrast to what are called parameters in populations). If the distribution of a random variable is known, a probability model may be fitted to the sample data. The population mean is then estimated from the model parameters. For instance, if a sample can be fitted with a normal probability distribution model with parameters μ and σ, the population mean is simply estimated by the parameter μ (and σ2 as the variance). If the sample can be fitted with a Gamma distribution with parameters α and β, the population mean is estimated by the product of α and β (i.e., αβ), with αβ2 as the variance. For an exponential random variable with parameter β, the population mean is simply the β, with β2 as the variance. For a chi-square (χ2) random variable with v degrees of freedom, the population mean is v, with 2v being the variance.
When the sample data are not fitted with a known probability model, the population mean is often inferred from the sample mean, a common practice in applied research. The most widely used sample mean for estimating the population mean is the arithmetic mean, which is calculated as the sum of the observed values of a random variable divided by the number of observations in the sample.
where m is the number of groups, and n is the total number of observations in the sample. In Equation 4, fjxj is the total value for the jth group. A summation of the values of all groups is then the grand total of the sample, which is equivalent to the value obtained through summation, as defined in Equation 3. For instance, a sample of (n =) 20 observations is divided into three groups. The intervals for the three groups are 5 to 9 (x1 = 7), 10 to 14 (x2 = 12), and 15 to 19 (x3 = 17), respectively. The corresponding frequency for each group is (f1 =) 6, (f2 =) 5, and (f3 =) 9. The sample mean according to Equation 4 is then
Notice that in Equation 3, we summed up the values of all individual observations before arriving at the sample mean. The summation process is an arithmetic operation on the data. This requires that the data be continuous, that is, they must be either in interval or in ratio scale. For ordinal data, the arithmetic mean is not always the most appropriate measure of the central location; the median is, because it does not require the summation operation.
Notice further that in Equation 3, each observation is given an equal weight. Consequently, the arithmetic mean is highly susceptible to extreme values. Extreme low values would underestimate the mean, while extreme high values would inflate the mean. One must keep this property of the sample arithmetic mean in mind when using it to describe research results.
Because the arithmetic mean is susceptible to variability in the sample data, it is often insufficient to report only the sample mean without also showing the sample standard deviation. Whereas the mean describes the central location of the data, the standard deviation provides information about the variability of the data. Two sets of data with the same sample mean, but drastically different standard deviations, inform the reader that either they come from two different populations or they suffer from variability in quality control in the data collection process. Therefore, by reporting both statistics, one informs the reader of not only the quality of the data but also the appropriateness of using these statistics to describe the data, as well as the appropriate choice of statistical methods to analyze these data subsequently.
Whether the mean is an appropriate or inappropriate statistic to describe the data is best illustrated by examples of some highly skewed sample data, such as data on the salaries of a corporation, on the house prices in a region, on the total family income in a nation, and so forth. These types of social economic data are often distorted by a few high-income earners or a few high-end properties. The mean is thus an inappropriate statistic to describe the central location of the data, and the median would be a better statistic for the purpose. On the other hand, if one is interested in describing the height or the test score of students in a school, the sample mean would be a good description of the central tendency of the population as these types of data often follow a unimodal symmetric distribution.
Extreme values in a data set, if not inherent in a population, are often erroneous and may have either human or instrumental causes. These so-called outliers are therefore artifacts. In order to better estimate the population mean when extreme values occur in a sample, researchers sometimes order the observations in a sample from the smallest to the largest in value and then remove an equal percentage of observations from both the high end and the low end of the data range before applying the arithmetic mean definition to the sample mean. An example is the awarding of a performance score to an athlete in a sport competition. Both the highest and the lowest scores given by the panel of judges are often removed before a final mean score is awarded. This variation of the arithmetic mean is called the truncated (or trimmed) mean.
In reporting the truncated mean, one must give the percentage of the removed data points in relation to the total number of observations, that is, the value of α, in order to inform the reader how the truncated mean is arrived at. Even with the removal of some extreme values, the truncated mean is still not immune to problematic data, particularly if the sample size n is small.
If the entire first quartile and the entire last quartile of the data points are removed after the observations of the data set are ordered from the smallest to the largest in value, the truncated mean of the sample is called the interquartile mean. The interquartile mean can be calculated as follows:
where the i value beneath Σ indicates that the summation starts from the (n/4 + 1)th observation of the data set, and the value above Σ signals that the summation ends at the (3n/4)th observation. The 2 above n normalizes the interquartile mean to the full n observations of the sample.
The mean is frequently referred to as the average. This interchangeable usage sometimes confuses the reader because the median is sometimes also called the average, such as what is routinely used in reporting house prices. The reader must be careful about which one of these statistics is actually being referred to.
The arithmetic mean as defined in Equation 3 is not always a good measure of the central location of the sample in some applications. An example is when the data bear considerable variability that has nothing to do with quality control in data collection. Instead, it is inherent in the random process that gives rise to the data, such as the concentration of environmental chemicals in the air. Within a given day at a given location, their concentration could vary in magnitude by multiples. Another example is the growth of bacteria on artificial media. The number of bacteria growing on the media at a given time may be influenced by the number of bacteria on the media at an earlier time, by the amount of media available for growth, by the media type, by the different antibiotics incorporated in the media, by the micro growing environment, and so on. The growth of the bacteria proceeds, not in a linear pattern, but in a multiplicative way. The central tendency of these types of data is best described according to their product, but not their sum. The geometric mean, but not the arithmetic mean, would thus be closer to the center of the data values. The geometric mean (
Comparing Equation 7 with Equation 3, one can see that the difference is that the geometric mean is obtained by multiplying the observations in the sample first and then taking the nth root of their product. In contrast, the arithmetic mean is calculated by adding up the observations first and then dividing their sum by the number of observations in the sample. Because of the multiplying and the taking-the-nth-root operations, the geometric mean can be applied only to data of positive values, not to data of negative or zero values.
When the sample size n is large, the product of the values of the observations could be very large, and taking the nth root of the product could be difficult, even with modern computers. One way to resolve these difficulties is to transform the value of all observations into a logarithm scale. The multiplication process then becomes a summation process, and the operation of taking the nth root of the product is replaced by the division of n from the logarithm sum. The geometric mean is then obtained by applying an antilogarithm operation to the result.
and the geometric mean is then
Here, the base of the logarithm scale can be either e( = 2.718281828, the base for natural logarithm) or 10. Most often, 10 is used as the base.
Median, Random Variable, Standard Deviation
Related Credo Articles
The average, or arithmetic mean. To calculate the mean of a series of numbers, the numbers are added together and then are divided by the...
a number (symbol: x) that is computed by calculating the sum of a set of numbers (Σ x ) and dividing the sum by the number of terms ( n ). ...
Like the median and the mode , the mean is a statistical measure of the average score in a distribution . It is calculated by adding up...