pt.1 MEASURES OF CENTRAL TENDANCY
3.1 The Why
Organizing data into tables and graphs can help make a data set more meaningful. These methods, however, do not provide as much information as numerical measures. Descriptive statistics are numerical measures that describe a distribution by providing information on the central tendency of the distribution, the width of the distribution, and the distribution’s shape.
A measure of central tendency characterizes an entire set of data in terms of a single representative number.
Measures of central tendency measure the “middleness” of a distribution of scores in three ways: the mean, median, and mode.
3.2 The What
The most commonly used measure of central tendency is the mean—the average observation in a set of observations. There are two manifestations (yes, I said it!) of the mean; the arithmetic mean and the geometric mean. We are not interested in the geometric mean at this point but you can look at it here if you are interested. We will stick to the arithmetic mean.
Arithmetic mean: simply the mean or average when the context is clear, is the sum of a collection of numbers divided by the number of numbers in the collection
We can calculate the mean for our distribution of exam scores (from the previous lesson) by adding all of the scores together and dividing by the total number of scores. Mathematically, this would be:
X represents the individual scores, and
N represents the number of scores in the distribution
To calculate the mean, then, we sum all of the Xs, or scores, and divide by the total number of scores in the distribution (N). You may have also seen this formula represented as follows:
One of the main shortcomings of the mean is that the mean is influenced by extreme scores (what we sometimes refer to as outliers).
An outlier is an observation point that is distant from other observations.
Outliers can distort the results by giving an inaccurate representation of the distribution of the population.
Another measure of central tendency, the median, is used in situations in which the mean might not be representative of a distribution. Let’s use a different distribution of scores to demonstrate when it might be appropriate to use the median rather than the mean. Imagine that you are considering taking a job with a small computer company. When you interview for the position, the owner of the company informs you that the mean income for employees at the company is approximately Kshs. 100,000 and that the company has 25 employees. Most people would view this as good news. Having learned in a statistics class that the mean might be influenced by extreme scores, you ask to see the distribution of 25 incomes. The distribution is shown below.
Table 3.1 Employee Salaries Distribution
The mean for this distribution is Kshs. 99,920. Notice that, as claimed, the mean income of company employees is very close to Kshs. 100,000. Notice also, however, that the mean in this case is not very representative of central tendency, or “middleness.” In this distribution, the mean is thrown off center or inflated by one very extreme score of Kshs. 1,800,000 (the income of the company’s owner, needless to say). This extremely high income pulls the mean toward it and thus increases or inflates the mean. Thus, in distributions with one or a few extreme scores (either high or low), the mean will not be a good indicator of central tendency. In such cases, a better measure of central tendency is the median.
The median is the middle score in a distribution after the scores have been arranged from highest to lowest or lowest to highest.
The distribution of incomes in Table 3.1 is already ordered from lowest to highest. To determine the median, we simply have to find the middle score. In this situation, with 25 scores, that would be the 13th score. You can see that the median of the distribution would be an income of Kshs. 27,000, which is far more representative of the central tendency for this distribution of incomes.
Why is the median not as influenced as the mean by extreme scores? Think about the calculation of each of these measures. When calculating the mean, we must add in the atypical income of Kshs. 1,800,000, thus distorting the calculation. When determining the median, however, we do not consider the size of the $1,800,000 income; it is only a score at one end of the distribution whose numerical value does not have to be considered in order to locate the middle score in the distribution. The point to remember is that the median is not affected by extreme scores in a distribution because it is only a positional value. The mean is affected because its value is determined by a calculation that has to include the extreme value.
In distributions with an even number of observations, the median is calculated by averaging the two middle scores. In other words, we determine the middle point between the two middle scores.
The third measure of central tendency is the mode—the score in a distribution that occurs with the greatest frequency. Sometimes, several scores occur with equal frequency. Thus, a distribution may have two modes (bimodal), three modes (trimodal), or even more. The mode is the only indicator of central tendency that can be used with nominal data. Although it can also be used with ordinal, interval, or ratio data, the mean and median are more reliable indicators of the central tendency of a distribution, and the mode is seldom used.
3.3 The How
mydata <- read.table(“testdata.txt”) #import your dataset
#attach(mydata) # In case you want to work with the variables directly
names(mydata) #This shows us all the variable names
mean(INCOME) #If you use the attach() command, you can call variables directly
mean(mydata$INCOME) #Find the mean income
#mean(mydata$INCOME, na.rm=TRUE) #Remove NA values before computation
median(INCOME, na.rm=TRUE) #returns the middle observation
mode(INCOME) #does something weird!
# the function mode( ) in R returns the variable type
temp <- table(as.vector(INCOME)) #The first row of “temp”
#is a sorted list of all unique values in the vector INCOME
#The second row in “temp” counts how many occurrences of each value.
names(temp)[temp == max(temp)]
#this returns the names of the values that have the highest count
#in temp’s second row
# This happens to be the mode!
#R knows you will want to see all the measures of central tendency
summary(INCOME) # So it supplies them all in one command
We will discuss the “Min. 1st Qu. 3rd Qu. Max. “ in the next class. These are measures of dispersion. We will also look at range, standard deviation, percentiles, skweness and other similar animals. Until next time.