Motivation:
Averages are tools of summarizing data, finding representative. It facilitates the comparison. Different averages suitable to different situations are to be used. The first and second aspects of data analysis are collection of data and presentation of data respectively. The third aspect is analysis and fourth is interpretation. Averages are useful in both analysis and interpretation. Box plot technique gives us quick overall view. Percentile rank is another important indicator or tool of comparison.
Introduction:
We have studied in the previous chapters various methods of summarizing data and its graphical representation. However it becomes essential to condense the data into a single value for comparison purpose. This value is treated as a representative of data. There are various methods of selecting a single central value, we will discuss those in detail in this chapter.
Central Tendency: By means of classification and frequency curve we get an idea about the shape of frequency distribution. In most of the frequency distributions we observe that, all the class-frequencies are not same. Initially frequency is small in magnitude, later on it increases, it reaches to maximum in the middle part of the data and then falls down. In other words, the frequency curve is bell-shaped. Here we note a property that, the observations are not uniformly spread. However, most of the observations get clustered in the central part of the data. This property of observations is described as central tendency.
Naturally we select a representative observation from the central part. This is referred to as an average or measure of central tendency.
It is desired that all the important properties of the observations in the data should be represented in the average. The word average is very commonly used in day-to-day life.
For example: Average marks, average profit, average run-rate of a team in one day. A single value is suitable for comparison. Therefore, average is essential quantity. Average is a value around which most of the observations are clustered, hence this single value itself gives clear idea regarding phenomenon under study.
There are several types of averages used in practice according to the type of data and purpose.
Objectives and Requisites of Average
Objectives of average:
1. To obtain a single representative quantity for the entire data.
2. To facilitate comparison.
There are several averages in use, hence it is necessary to discuss the requisites of good or ideal average.
Requisites of good average :
It should be simple to understand and easy to calculate.
It should be rigidly defined.
It should be based on all observations in the data.
It should be capable of further mathematical treatment.
It should be least affected by extreme observations.
It should possess sampling stability.
Types of Averages or Measures of Central Tendency
There are various types of averages used in practice. Those are listed below:
1. Arithmetic mean (simple and weighted).
2.Median.
3. Mode.
Moreover, there two more averages obtained after transforming data. These are (i) Geometric Mean, (ii) Harmonic Mean.
Among the above stated averages, arithmetic mean, geometric mean and harmonic mean, three are called as mathematical averages and the median and mode are called as positional averages.
According to the nature of data, type of average is decided. Different averages possess different advantages and disadvantages. The details are discussed later.
Arithmetic Mean
This is very commonly used and widely applicable average. Definition: Arithmetic mean (A.M.) or mean is a sum of observations divided by number of observations i.e. A.M. =
According to the different types of data calculation of A.M. differs slightly. We consider these cases as given below:
Case (i) Individual observations or ungrouped data:
Suppose X1, X2, X is a set of n observations. By definition, X1 X2+...+Xn arithmetic mean will be A.M.
Numerator of right side of (1) can be symbolically written as Σ x;
i.e. x1 + x2 + ... + Xn
i = 1
Symbol (sigma) represents the sum. Further it is a customary to
Σ xi
i = 1
denote A.M. by x. Hence,
A.M. = X
n
n
For simplicity Σ x; will be written as Σx. i = 1
Case (ii) Discrete frequency distribution:
Suppose X1, X2,..., Xn are values with f1, f2, ..., fn as the corresponding frequencies. Clearly to find the sum of observations we need to add observation x1, f1 times, observation x2, f2 times and so on. Hence sum of observations will be f1 x1+f2x2 +...+fnxn and total number of observations will be f1 + f2+...+fn. Hence,
f1 X1 +f2x2 +...+fnxn
f1 + f2+...+fn
(using Σ notation we get)
x
n
Σ fixi
i = 1
n
Σ fi
i = 1
Case (iii) Continuous frequency distribution:
In this case, frequency is associated to the entire class and not to any specific single value. This creates difficulty in choosing X1, X2,..., Xn.
For calculation purpose we make a reasonable assumption that the frequency is associated with mid-point of class or equivalently we say the frequency is distributed over the respective class uniformly. Thus, taking X1, X2, .., Xn as the mid-values of class intervals we calculate mean by the same formula discussed in case (ii), i.e.
X
ΣfixiΣfixi Σ fi
N
Trimmed Mean
Outliers: Some observations in data set fall outside the usual pattern of the data. Such ponits are called as outliers (which lie outside the pattern).
Some observations at the extreme are considered to be outliers. The computation of arithmetic mean in the presence of outliers may give faulty results. The interpretation based on such summary will not be that much reliable.
For example, the examination score, data will have minimum and maximum score far off from the remaining observations.
To overcome this difficulty, it is suggested to ignore or to trim-off the outliers. The trimming factor a % means ignore topmost alpha% and lowest an 100 0% observations (i.e. observations).
Trimmed mean: Arithmetic mean obtained by ignoring lowest a% as well as highest a% observation is called as a% trimmed mean. Note:
(1) In general a% may be taken as 5% to 10% or even 20% also there is no hard and fast rule.
(2) We need to arrange the observations in increasing or in decreasing order of magnitude before computing trimmed mean.
(3) Many statistical softwares determine trimmed mean.
(4) If a% is not integer, it may be rounded off.
(For example: 10% of 45 comes to 4.5, so we ignore 5 highest and 5 lowest observations).
Illustration 1: Weights of 20 students in kg are given below:
50, 28, 32, 30, 42, 26, 40, 31, 38, 51, 48, 33, 45, 36, 40, 29, 43, 48, 52,
Obtain:
(i) 5% trimmed arithmetic mean.
(ii) 10% trimmed arithmetic mean.
(iii) arithmetic mean.
Solution: Step 1: We arrange the observations in increasing order of magnitude.
26, 28, 29, 30, 31, 32, 33, 36, 38, 40, 40, 42, 43, 45, 48, 48, 50, 51, 52, 60. (i) The number of observations: n = 20. Trimming factor is 5%, thus
5% of 20 is 20 x = 1. We have to ignore one highest (60) and one
100
lowest (26) observations to find 5% trimmed mean. Thus, the mean of middle 18 observations is the required mean.
i.e.
28+ 29 +30 + . + 50 +51 +52 716
18
10 100
(ii) 10% of 20 is 20 x =
=
= 39.7778.
18
2. We need to ignore the 2 highest
observations (52, 60) and the two lowest observations (26, 28). The mean
of middle 16 observations is the required mean
i.e.
29 +30 +31 + ... + 50 + 51 16
636 16
= 39.75.
26+28+29+...+51 +52 + 60
(3) Usual arithmetic mean is =
20
802
= 20
=40.1%.
Properties of Arithmetic Mean
(1) The algebraic sum of deviations of observations from their
arithmetic mean is zero. (i.e. Σ (x-x) = 0)
Proof: Note that x1-X, X2-X,
Xn-X are deviations.
Algebraic sum of deviations = (x-x ) = {xi-nx
=
nx - nx = 0
(:: from property (1) Σx; = nx)
(2) The sum of squares of the deviations taken from arithmetic mean is minimum. i.e. Σ (xi-x) (x; - a)2.
Proof : Suppose 'a' is any arbitrary constant. Then x; - x is deviation of x; from x and xia is deviation of x; from a.
Let x be the arithmetic mean of first group of size n1. Similarly x2 be arithmetic mean of second group of size n2, then the
n12+n2x2
combined mean xe is given by-
Õ¸Õ¹ + Õ¸Õ¹ .
Proof: Note that x1 =
(Sum of observations in first group) n1
Hence, n1 x1 = sum of observations in first group.
Similarly n2x2 = sum of observations is second group. Thus, the
combined mean x is
Xc =
(Sum of the observations in
(Sum of the observations
1+ in second group
first group (Size of first group) + (Size of second group)
Xc =
Õ¸Õ¹ + 1Õ·
Remark: The above result can be generalised to k (k ≥ 2) groups as follows:
Let there be k groups with size of ith group as n; and the arithmetic mean as xi (i = 1, 2, 3,... k). Then xc, the arithmetic mean of all the k groups combined together is given by;
+ nkxk
xc=
ÕˆÕ¹ + ÕˆÕ¹ + ... + nk
Σni xi Σπί
(5) min (X1, X2,..., Xn) ≤ x ≤ max (X1, X2, ..., Xn).
Proof: Let min (X1, X2,
Xn) a, max (X1, X2, Xn) = b
*
[as Exi≤Σb
na ≤ Σx; ≤ nb
ΣΧ
..
a s sb
n
a sx sb
Merits and Demerits of A.M.
A.M. possesses most of the requisites of a good average. Hence it is widely used. We state below its merits and demerits:
Merits:
1. It is easy to calculate and simple to follow.
2. It is based on all observations.
3. It is rigidly defined.
4. It possesses sampling stability.
5. It is capable of further mathematical treatment. Given the means
and sizes of two or more groups, we can find mean of combined group.
Demerits :
1. It is applicable only for quantitative data.
2. It is unduly affected by extreme observations. (Hence, trimmed mean is used in some cases).
3. It cannot be computed for frequency distribution with open end class. (For an open end class we cannot find mid point).
4. It cannot be determined graphically.
5. Sometimes arithmetic mean may not be an actual observation in a data.
For example, average number of T.V. sets daily sold in a particular month is 5.25. On any day actually 5. 25 T.V. sets cannot be sold.
Median:
We have seen that arithmetic mean cannot be calculated for qualitative observations like debating skill, honesty, blindness. Moreover, if a frequency distribution includes open end class, mean does not exist and it is unduly affected by extreme observations. In order to overcome these drawbacks, other measures of central tendency such as median or mode are used.
Illustration: The A.M. of 38, 43, 41, 39, 52, 48, 60, 167 is 61. This cannot be said to be a representative value, because among these 8 observations 7 are smaller than A.M. Thus in case extreme items are widely separated from most of the observations, A.M. does not remain suitable, median is suitable.
Definition: Median is the value of middle most observation in the data when the observations are arranged in increasing (or decreasing) order of their values.
Thus, median is the central observation. divides the data into two equal parts. There are equal number of observations above as well as below the median. It is also called as positional average.
(i) Computation of Median-Ungrouped data: It may be noticed that in case of individual observations or ungrouped data, computation of median does not require any formula. It can be determined by inspection.
Suppose n is the number of observations in the data. If n is odd, then there is only one middle most observation which is (n+1)/2 observation.
On the other hand, if n is even, then there are two middle most observations namely the (n/2)th and (n/2 + 1)th. In this case we take median to be mean of these two middle most observations. We follow the procedure described below for calculating median.
Step 1: Arrange the observations in increasing (or decreasing) order. Step 2: Compute the median by the following criteria:
Median = The value of (n + 1)/2th observation if n is odd
Median=(value of n/2th observation)+(value of (n/2+1)th observatio)
(ii) Computation of Median - Continuous frequency distribution: Suppose N is the total frequency. Since the variable under consideration is continuous, we can estimate the value of N/2th observation. Hence, regardless of N, whether it is even or odd in continuous frequency distribution, we take median to be the value of N/2th observation. Computational procedure:
Step 1: Obtain the class boundaries.
Step 2: Obtain less than cumulative frequencies.
Step 3: Locate the median class. Median class is the class in which median i.e. N/2th observation falls. In other words, it is in a class where less than cumulative frequency is equal to or exceeds N/2 for the first time.
Step 4: Apply the formula and find the median.
wherel= lower boundary of the median class total frequency
c.f.=less than cumulative frequency of the class just preceding the median class.
f = frequency of median class
h = class width
Median - by Graphical Method:
Median can be obtained graphically by means of ogive curve. Plot less than cumulative frequency curve taking upper boundaries on x-axis, and less than cumulative frequency on y-axis. Draw a line parallel to x-axis passing through point N/2 on y-axis. From the point of intersection of the line and ogive curve, draw a perpendicular to x-axis. The value at the foot of perpendicular is the median.
Merits:
1.It is easy to understand and easy to calculate.
2.It is not affected due to extreme observations.
3.It can be computed for a distribution with open end classes.
4. It can be determined graphically.
5. It is applicable to qualitative data also. In this case observations are arranged in order according to the quality and the middle most observation is obtained. The quality of this item is taken to be average quality or median quality.
Demerits :
1. It is not based on all the observations, hence it is not proper representative.
2. It is not capable of further mathematical treatment.
3.It is not as rigidly defined as the arithmetic mean.
Mode
It is yet another measure of central tendency developed to overcome the drawbacks of arithmetic mean. Apart from this, in some situations mode is the proper average.
Definition : The observation with maximum frequency or the most repeated observation is called as mode.
It is clear from earlier discussion that the general nature of frequency curve is bell shaped in majority of situations. Thus initially frequency is small, it increases and reaches the maximum and then it declines. The value on x-axis at which the maxima or the peak of the frequency curve appears is a mode.
In case of election results, a political party with largest votes (i.e. maximum frequency) is considered as representative. Thus, it is mode or modal opinion. In this situation, mode is the appropriate average.
Similarly, to estimate the crop yield, too good quality or too poor quality crop is not considered. A quality of crop most commonly found is taken into account, which is nothing but mode. In titration experiment, out of three readings a repeated reading is taken to be final reading. It is mode and not the arithmetic mean. Thus in number of situations mode is appropriate.
(i) Computation of mode - Individual observations and Discrete frequency distribution:
In this case we can find the observation with the largest frequency just by inspection. If the largest frequency occurs twice (or more), then we say that there are two (or many) modes.
(ii)Mode (For continuous frequency distribution):
Mode lies in the class with maximum frequency. The position of mode depends upon premodal and postmodal frequency. Clearly if premodal and postmodal frequencies are equal, mode occupies the centre B of modal class (AC) as shown in following figure.
However, if premodal frequency is larger than postmodal frequency, mode shifts earlier to the centre proportionality i.e. B shifts towards A. On the other hand, if postmodal frequency is larger than premodal frequency, then mode shifts towards C. It is like tug-off war between premodal and postmodal class frequencies. In the derivation we need to find exactly by how much quantity mode shifts from the centre.
Suppose, fm =frequency of modal class i.e. maximum frequency
f1 = premodal class frequency
f2=postmodal class-frequency
The shift in the mode from centre is in the proportion (fm f) to (fm-f2). In other words the mode (point B) divides the line AB internally in the ratio (fm-f1): (fm-f2).
AB fm-fi
=
BC fm-f2
Using the above equation one can find AB and hence the value of mode.
(ii) Computation of mode - Continuous frequency distribution: Step 1: Obtain the class boundaries.
Step 2: Locate the modal class. Modal class is a class in which mode lies or its is a class with the largest frequency.
Step 3: Apply the formula and find the mode.
Mode +
fm-fi 2fm-f1-f2
xh
where,
[= lower boundary of modal class
fm=frequency of modal class
f1=frequency of premodal class
f2=frequency of postmodal class
h = width of modal class
(iii) Computation of mode - by Empirical relation
Arithmetic mean, mode and median are averages, hence we expect that those should be identical in value. However, this is true only in ideal situation. It is true whenever the frequency curve is perfectly symmetric and bell-shaped. For a moderately asymmetric unimodal frequency distribution, the following empirical relationship holds approximately.
Mean Mode= 3 (Mean - Median)
In some situations mode is ill-defined. To overcome this difficulty computing mode, the empirical relation (1) is used. any two averages included in (1) are known, the remaining third can be computed. Therefore, if mean and median are known, then mode can be determined.
The empirical relation cannot be theoretically proved. Karl Pearson has stated it on the basis of vast experience. This relationship is observed to be valid for number of data sets after actual computations.
(iv) Computation of mode by graphical method: Mode can be obtained graphically with the help of histogram. Mode is the x-co-ordinate of point P or the value at foot of perpendicular from P to x-axis, as shown in figure:
Merits and Demerits of mode :
Merits:
1.It is simple to understand and easy to compute.
2. It is applicable for qualitative and quantitative data.
3. It is not affected by extreme observations.
4. It can be computed for distribution with open end classes.
5.It can be determined graphically.
Demerits:
1. It is not based on all the observations.
2.It is not capable of further mathematical treatment.
3.It is not rigidly defined.
4.It is indeterminate if the modal class is at the extreme of the distribution.
Geometric Mean (G.M.)
The following example illustrates the need of an average other than arithmetic mean.
Illustration : Suppose the price of an article is increased by 100% in 1992. It is reduced by 50% in the next year. Find the average change in the price.
Solution:
Year Change in price
1992 100
1993 -50
Arithmetic mean of change in price = [100+ (-50)]/ 2 = 25
Thus average change (increase) in price is 25%.
However we notice that the average change is zero. Suppose price in 1991 isa, then in 1992 it will become 2a, due to 100% increase. Further due to 50% decrease, in the price, in 1993 it will be a. Thus in the span of those two years original price is retained. Thus average change is likely to be zero.
This situation demands average other than arithmetic mean. We compute average change in price as follows:
Here we need to determine the relative change in price. We call it as growth ratio (y). It is computed as follows:
Growth ratio (y) = Current year price/Previous year price
Therefore y indicates per unit change in price. In 1992 price is increased by 100% which means it doubled, hence the corresponding y= 2. In 1993 the price is decreased by 50%, it means that the price is reduced to. Therefore the corresponding y = Suppose price in 1991 is
za
Year
Percent change
Current year price
y=
Current year price Previous year price
1991
a
1992
100
2a
2a ÷ a = 2
1993
-50
a
a + 2a = 1/2
Average of y=
1
Average change in x = (Average of y-1) x 100 = (1-1) x 100 = 0
The average of y determined in (1) is called a geometric mean. Therefore in this situation geometric mean gives the correct value.
Definition: Geometric mean (G.M.) of n observations defined as nth root of their product.
If X1, X2, Xn are the observations then geometric mean G of these observations is given by
G = (X1. X2 .X3. X4 ... Xn)/
The product X1 X2 X3... Xn can be symbolically written as
n
II xi. (II read as pi, represents product).
i = 1
Therefore,
G = (IIx;)1/n
... (2)
Sometimes IIx; is too large, hence it is difficult to compute. Therefore
we use logarithms and simplify it.
Clearly,
log G = log (Пxi)
log G ===> log Xi
G = Antilog Σ log
Uses of G.M.
Arithmetic mean is an important and widely used average, whereas geometric mean is not much in use. However, in certain situations geometric mean is more appropriate. The following are situations where, G.M. is preferred.
1. Average change in percent.
2. Average of bank interest rates.
3. Average of depreciation in the cost of a certain machine.
4. Average of population growth.
5. Average rate of returns on share.
In general, G.M. is appropriate if the values are ratios or percentages. Similarly if the values are approximately in geometric progression, then also G.M. is proper average to find rate of growth.
Due to some mathematical properties of G.M., it is popularly used in the construction of index numbers.
Merits of G.M.
1. It is based on all the observations.
2.It is rigidly defined.
3.It is capable for further mathematical treatment, such as combined G.M. of two sets.
4.It is not unduly affected by extreme observations.
Demerits of G.M.
1. The serious drawback of G.M. is, it is zero if any of the observations is zero.
2. It is not simple to understand and calculate.
3. It may be imaginary if some observations are negative. Therefore
it is calculated only for the data containing positive values.
4.It is not applicable to qualitative data.
5. It cannot be determined graphically.
6. It cannot be computed if frequency distribution includes open end class.
7.It may not be an actual observation in the data.
Harmonic Mean (H.M.)
The following example illustrates the need of an average other than A.M. and G.M.
Illustration : Suppose a train while leaving the terminus travels first kilometer distance at a speed of 10 km per hour. For the next kilometer, the speed is 15 km per hour. Compute average of the speed in these two kilometers journey.
Merits of H.M.:
1. It is based on all observations.
2. It is rigidly defined.
3. It is capable of further mathematical treatment.
Demerits of H.M.:
1. If any of the observation is zero, H.M. cannot be defined.
2. It is not simple to compute and easy to understand as compared to A.M.
3.It is not applicable to qualitative data.
4.It cannot be computed for frequency distribution with open end class.
5.It cannot be computed graphically.
6.It may not be an actual observation in the data.
7. Since H.M. is calculated to find average of rates etc., it is meaningful to compute for positive observations.
Uses of H.M.
Harmonic mean is appropriate to compute average speed, average rates etc. (where the rates are specified in units per Re.)
Ordering of A.M., G.M., H.M.
We observe the following type of ordered arrangement of A.M., H.M., G.M. for any data.
A.M>G.M>H.M
Attributes,Variables and types of data
Moments ,Skewness and Kurtosis
DISCRETE PROBABILITY DISTRIBUTIONSample Space and Events
Conditional Probability and Independence
Univariate Discrete Probability Distributions
Mathematical Expectation(Univariate)





