Statistics Concepts

Classification

In order to study a characteristic or a group of characteristics of any type, the first phase is to collect the data.

Raw data

The unprocessed data in terms of individual observations is called raw data.

For example: Income of 5000 individuals is given for analysis.

It becomes quite essential to condense the data in a suitable form. Classification can be used as a tool to condense the data.

Classification

The entire process of making homogeneous and non-overlapping groups of observations according to similarities is called classification. The groups so formed are called class intervals or classes.

The objectives of classification can be summarized as follows:

It condenses the data.
It omits unnecessary details.
It facilitates the comparison with other data.

Frequency Distribution

Frequency

The number of observations in a class is called frequency or class frequency.

Frequency Distribution

A table containing class intervals along with frequencies is called frequency distribution.

Frequency distribution of continuous variable:

The procedure of classification of continuous variable differs slightly from that of a discrete variable.

Procedure:

Find the smallest and the largest observation. Calculate the difference between them. This difference is called the range.
Decide the classes by dividing the range into several intervals. The number of classes be preferably between 7 to 20.
Prepare the first column of the table by entering the class intervals.
Classify the observations one-by-one in the appropriate class by putting tally marks in the second column against the corresponding class. Cross the observation from the original data to avoid double counting.
Count the tally marks and enter the number in the third column.

Methods of Classification

There are two methods of classification: (i) inclusive method and (ii) exclusive method. We bring out the difference between the two methods.

Inclusive Method

In this method, the observation equal to the upper limit is included in the same class. Therefore, the method is called the inclusive method. It can be observed that the upper limit of the class is not the same as the lower limit of the succeeding class. Therefore, a discontinuity is observed between the classes. For example,

Exclusive Method

In this method, the observation equal to the upper limit does not belong to the same class. It is included in the next class. Therefore, the method is called the exclusive method. For example, the observation 4000 is included in 4000-5000. In other words, the observation equal to the upper limit is excluded from the same class. For example,

In this case, the upper limit of one class is the lower limit of the subsequent class. The classes are observed to be continuous without any gap between them.

Class limits

The two numbers designating the class-interval are called class limits. With reference to the above table, the first class interval is 60-90, in this case, 60 and 69 are the class limits. The smallest possible observation that can be included in the class is the lower limit and the largest possible observation that can be included in the class is the upper limit. In the above example, 60 and 69 are lower and upper limits of the class interval 60-69.

Class boundaries

The class boundaries are the numbers up to which the actual magnitude of observation in the class can extend. The class boundaries are also called actual limits or extended limits. For the sake of clarity let us consider the frequency distribution with classes 10-19, 20-29,... etc. In this case, an observation 19.2 will be rounded off to 19 and placed in 10-19, whereas the observation 19.6 will be rounded off to 20 and will be placed in 20-29. Therefore, the actual magnitude of the observation in the class 20-29 will be between 19.5-29.5. The table below will make out the difference between class limits and class boundaries.

It can be clearly seen that in the case of the exclusive method of classification, class limits and class boundaries are the same. Using class boundaries, the classes are made continuous, however, the original frequency associated do not alter.

Class mark or Mid-values

It is the midpoint of the class interval and the same can be obtained as follows:

Class-width

It is the actual length of the class interval. We can find the class width as follows:

Open end class

A class in which one of the limits is not specified is called an open end class.

For example, in the following frequency distribution there are two open end classes.

The class 'below 2000' has no lower limit and the class '4000 and above' has no upper limit. Therefore, these classes are open end classes. Whenever the extreme observations are widely spread, open end classes are used. In the case of income distribution or the classification of sales of a company, open end classes may be required. Open end classes create some problems in further analysis. Therefore, as far as possible the open end classes should be avoided.

Cumulative Frequencies

In many situations, it is required to find the number of observations below or above a certain value. For example, in the case of a frequency distribution of income, the number of persons below the poverty line or in the case of the frequency distribution of examination marks, the number of candidates above 60 etc. is required to be found. In this case cumulative frequencies are much useful. There are two types of cumulative frequencies:

Less than type cumulative frequency
More than type cumulative frequency

Less than Type Cumulative Frequency

Less than type cumulative frequency of a class is the number of observations less than or equal to the upper limit of the corresponding class. Similarly, more than type cumulative frequency is the number of observations more than or equal to the lower limit of the corresponding class.

It is clear from the above explanation that the less than type cumulative frequencies can be obtained by computing the cumulative sum of frequencies from the lowest class to the highest class. We illustrate the procedure of computing the less than type and more than type cumulative frequencies.

It can be noted that the less than cumulative frequency is increasing in nature. Less than cumulative frequency of the lowest class is the same as the usual frequency and the less than type cumulative frequency of the highest class is the total number of observations. In the case of more than cumulative frequencies exactly reverse pattern will be seen.

A table containing upper limits along with less than type cumulative frequency or lower limits along with more than type cumulative frequency is called a cumulative frequency distribution.

Relative Frequency

Two different frequency distributions may not have the same total frequency, hence for the purpose of comparison and interpretation, sometimes it is better to express the frequency of a class in terms of proportion (or percentage) of the total number of observations. The proportion of the number of observations in a class is the relative frequency. Therefore,

Relative frequency = Total frequency/Class frequency

It can be noted that the relative frequency maintains the same pattern which is observed in class frequencies. The total of relative frequencies is 1.

Guidelines for the Choice of Classes

The number of classes should not be too large or too small. Sturge's Rule: If N is the total number of observations to be classified, then according to Sturge's rule, the number of classes is approximately 1 + 3.322 log N.
By the other approach as a thumb rule, the number of classes is approximately √N
As far as possible, classes should be of uniform width.
As far as possible open end classes should be avoided.
The class width should be preferably 5 or a multiple of 5.
The lower limit of the starting class be preferably a multiple of 5.

For example: The classes may be of the type 0-9, 10-19 ... or 10-20, 20-30... etc.

Graphical Presentation of Statistical Data

Graphs are easy to understand and create an effect which lasts for a longer time. They use voluminous, uninteresting, dry data and present the facts in an attractive and impressive manner. They facilitate comparison and hence, conclusions can be drawn quickly. Moreover, patterns present in the data are more clearly exhibited by graphs.

Histogram

It is one of the popularly used graphs for the representation of frequency distribution. It is a series of adjacent rectangles erected on the X-axis with class interval as base, hence the width of the rectangle is equal to class width. The area of the rectangle is taken as proportional to class frequency. In the case of the inclusive method of classification, an extended class interval is used as base, where extended class interval is an interval designated by class boundaries.

Since the base of the rectangle is class width, there is a slight difference in the procedure of construction of histogram when the classes are of equal width and when those are of unequal width.

Case (i) Classes of equal width:

In this case, the height of the rectangle is proportional to frequency.

Case (ii) Classes of unequal width:

In this case, the height of the rectangle is proportional to frequency density where,

Note:

A serious drawback of the histogram is that, it cannot be drawn for a frequency distribution with open end class.
In the case of discrete variables, histogram need not contain adjacent rectangles, those may be separated like a bar diagram.
Histograms are useful to find the mode, which is discussed in the 5th chapter.
Histogram remains the same if class width is changed.

Descriptive Statistics

Attributes, Variables, and types of data
Presentation of Data
Measures of Central Tendency
Measures of Dispersion
Moments, Skewness, and Kurtosis
Theory of Attributes
Correlation

Discrete Probability Distribution

Sample Space and Events
Probability
Conditional Probability and Independence
Univariate Discrete Probability Distributions
Mathematical Expectation (Univariate)
Bivariate Discrete Probability Distribution
Mathematical Expectation (Bivariate)

Presentation of Data

Classification

Raw data

Classification

Frequency Distribution

Frequency

Frequency Distribution

Frequency distribution of continuous variable:

Procedure:

Methods of Classification

Inclusive Method

Exclusive Method

Class limits

Class boundaries

Class mark or Mid-values

Class-width

Open end class

Cumulative Frequencies

Less than Type Cumulative Frequency

Relative Frequency

Guidelines for the Choice of Classes

Graphical Presentation of Statistical Data

Histogram

Case (i) Classes of equal width:

Case (ii) Classes of unequal width:

Note:

Descriptive Statistics

Discrete Probability Distribution

Post a Comment

Conditional Probability and Independence

Main Tags

Contact Form