In order to better understand the nature of probabilistic decisions, consider the following court case of The People v. Collins, 1968. In this case, the robbery victim was unable to identify his assailant. All that the victim could recall was that the assailant was female with a blonde ponytail. In addition, he remembered that she fled the scene in a yellow convertible that was driven by an African American male who had a full beard. The suspect in the case fit the description given by the victim, so the question was “Could the jury be sure, beyond a reasonable doubt, that the woman on trial was the robber?” The evidence against her was as follows: She was blonde and often wore her hair in a ponytail; her codefendant friend was an African American male with a moustache, beard, and a yellow convertible. The attorney for the defense stressed the fact that the victim could not identify this woman as the woman who robbed him, and that therefore there should be reasonable doubt on the part of the jury.

The prosecutor, on the other hand, called an expert in probability theory who testified to the following: The probability of all of the above conditions (being blonde and often having a pony tail and having an African American male friend and his having a full beard, and his owning a yellow convertible) co-occurring when these characteristics are independent was 1 in 12 million. The expert further testified that the combination of characteristics was so unusual that the jury could in fact be certain “beyond a reasonable doubt” that the woman was the robber. The jury returned a verdict of “guilty” (Arkes & Hammond, 1986; Halpern, 1996).

As can be seen in the previous example, the legal system operates on probability and recognizes that we can never be absolutely certain when deciding whether an individual is guilty. Thus, the standard of “beyond a reasonable doubt” was established and jurors base their decisions on probability, whether they realize it or not. Most decisions that we make on a daily basis are, in fact, based on probabilities. Diagnoses made by doctors, verdicts produced by juries, decisions made by business executives regarding expansion and what products to carry, decisions regarding whether individuals are admitted to colleges, and most everyday decisions all involve using probability. In addition, all games of chance (for example, cards, horse racing, the stock market) involve probability.

If you think about it, there is very little in life that is certain. Therefore, most of our decisions are probabilistic and having a better understanding of probability will help you with those decisions. In addition, because probability also plays an important role in science, that is another important reason for us to have an understanding of it.


Probability is a measure of chance, and we shall propose general rules for calculating the probability of combinations of simple events.

Probability refers to the number of ways a particular outcome (event) can occur divided by the total number of outcomes (events).

The tossing of a coin is a simple example of a large class of games of chance with certain common features. Each game is decided on the results or outcomes of one or more trials, where a trial might be rolling a die, tossing a coin, or drawing a card from a pack. If the outcomes are distinguishable, we say they are mutually exclusive, and if they are the only possible results they are also said to be exhaustive. There may be more than one way of listing the outcomes. If we draw a card from the pack, the outcomes red, black are mutually exclusive and exhaustive, but so are the outcomes Spades, Hearts, Diamonds, and Clubs. The trials are also said to be independent if the result of one trial does not depend on the outcome of any previous trial, or any combination of previous trials.


Probabilities are often presented or expressed as proportions. Proportions vary between 0.0 and 1.0, where a probability of 0.0 means the event certainly will not occur and a probability of 1.0 means that the event is certain to occur. Thus, any probability between 0.0 and 1.0 represents an event with some degree of uncertainty to it. How much uncertainty depends on the exact probability with which we are dealing. For example, a probability close to 0.0 represents an event that is almost certain not to occur, and a probability close to 1.0 represents an event that is almost certain to occur. On the other hand, a probability of .50 represents maximum uncertainty.

Let’s start with a simplistic example of probability. What is the probability of getting a “head” when tossing a coin? In this example, we have to consider how many ways there are to get a “head” on a coin toss (there is only one way, the coin lands heads up) and how many possible outcomes there are (there are two possible outcomes, either a “head” or a “tail”). So, the probability of a “head” in a coin toss is:


Set Theory: A set is a collection of items or events. The items within a set are generally referred to as elements. A set can be an element of another set.

The Universal Set is the set of all possible elements. In probability, the universal set is the set of all possible outcomes of a trial (experiment).

Sample space: In order to avoid continually referring to particular games or experiments, it is useful to employ an abstract representation for a trial and its outcomes. Each distinguishable and indecomposable outcome, or simple event, is regarded as a point in a sample space, S. Thus, for the experiment of drawing a card from a pack the sample space contains 52 points. Every collection of simple events or set of points of S is called an event. The Sample space is an example of a universal set.

Intersection: The intersection of two sets A, B is the set of points of S which belong to both A and B and is an event. Thus the intersection of the sets (HH, TH, HT} and {HT, ΤΤ) is the set containing the single point HT. This event may be called ‘heads on the first coin and tails on the second coin’. It may happen that the two sets have no points in common, that is, their intersection is the empty set. Simply, an intersection of two sets is a set containing elements common to both sets.

Union: This is defined as the set which contains all the points of S which are in either A or B {or both). Thus, the union of the events {HH, TH, ΗΤ) and {HT, TT) , in the present example, is the event ( HH , TH, HT, TT} , which contains every point in the sample space and may reasonably be called ‘the certain event’. In other words a union of sets is a set that contains all unique elements of the sets.


Figure 5.1: A Venn Diagram representing the intersection of two sets.


The Venn diagram is a simple graphical tool used to represent set theory computations. In a Venn diagram, the sample space is represented by a rectangle and any event by a circle in this rectangle.


An ice-cream firm, before launching three new flavors, conducts a tasting with the assistance of 60 schoolboys. The findings were summarized as:

32 liked A

24 liked Β

31 liked C

10 liked A and Β

11 liked A and C

14 liked B and C

6 liked A and Β and C.

Since there are only three flavors. A, B, C to consider, the information provided can easily be grasped through a diagram. Can you draw a Venn diagram to represent this relationship?


For every event, E, in the sample space S we assign a non-negative number, called the probability of Ε denoted by Pr(E), so that the following axioms are satisfied.

(a) For every event E, Pr(E) > 0                                                                                  Non-negativity

(b) For the certain event, Pr(S) = 1.                                                                          Sum of all probabilities =1

(c) If E1, E2 are mutually exclusive events Pr(E1 U E2) = Pr(E1) + Pr (E2)       Additivity

(d) If E1, E2 are independent events Pr(E1 ∩ E2) = Pr(E1) x Pr (E2)                 Multiplication rule

Conditional Probability

Conditional probability measures that probability that an event will occur “given that” another event has occurred. For two events, A and B, the conditional probability of B given that A has occurred is denoted as Pr(B |A).  It is calculated as:

Bayes’ Theorem

Also called Bayes’ Rule or Bayes’ law, it simply relates our current belief based on prior evidence. Sounds vague and mysterious? Not to worry, this will become clearer when we start making deductions based on the probability estimates. For now what you need to remember is that Bayes’ law tell us something about the future based on what we have observed in the past.

The formula:

Can you try and derive this formula using the formula for conditional probability?

Next Lesson we look at the Random Variable and Probability Distributions



Normal Distributions

When a distribution of scores is very large, it tends to approximate a pattern called a normal distribution. When plotted as a frequency polygon, a normal distribution forms a symmetrical, bell-shaped pattern often called a normal curve (see Figure 5.1). We say that the pattern approximates a normal distribution because a true normal distribution is a theoretical construct not actually observed in the real world.

The normal distribution is a theoretical frequency distribution that has certain special characteristics. First, it is bell-shaped and symmetrical—the right half is a mirror image of the left half. Second, the mean, median, and mode are equal and are located at the center of the distribution. Third, the normal distribution is unimodal—it has only one mode. Fourth, most of the observations are clustered around the center of the distribution, with far fewer observations at the ends, or “tails,” of the distribution. Lastly, when standard deviations are used on the x-axis, the percentage of scores falling between the mean and any point on the x-axis is the same for all normal curves. We will discuss the normal distribution more extensively in later lessons.

Figure 5.1 Normal Curve


Although we typically think of the normal distribution as being similar to the curve depicted in Figure 5.1, there are variations in the shape of normal distributions. Kurtosis refers to how flat or peaked a normal distribution is. In other words, kurtosis refers to the degree of dispersion among the scores, or whether the distribution is tall and skinny or short and fat. The normal distribution depicted in Figure 5.1 is called mesokurtic—meso means “middle.” Mesokurtic curves have peaks of medium height and the distributions are moderate in breadth. Now look at the two distributions depicted in Figure 5.2.

The normal distribution on the left is leptokurtic—lepto means “thin.” Leptokurtic curves are tall and thin, with only a few scores in the middle of the distribution having a high frequency. Last, see the curve on the right side of Figure 5.2. This is a platykurtic curve—platy means “broad” or “flat.” Platykurtic curves are short and more dispersed (broader). In a platykurtic curve, there are many scores around the middle score that all have a similar frequency.

Figure 5.2 Kurtosis

Positively Skewed Distributions

Most distributions do not approximate a normal or bell-shaped curve. Instead, they are skewed, or lopsided. In a skewed distribution, scores tend to cluster at one end or the other of the x-axis, with the tail of the distribution extending in the opposite direction. In a positively skewed distribution, the peak is to the left of the center point and the tail extends toward the right, or in the positive direction. (See Figure 5.3.)

Notice that what is skewing the distribution, or throwing it off center, are the scores toward the right or positive direction. A few individuals have extremely high scores that pull the distribution in that direction. Notice also what this does to the mean, median, and mode. These three measures do not have the same value, nor are they all located at the center of the distribution as they are in a normal distribution. The mode—the score with the highest frequency—is the high point on the distribution. The median divides the distribution in half. The mean is pulled in the direction of the tail of the distribution; that is, the few extreme scores pull the mean toward them and inflate it.

Negatively Skewed Distributions

The opposite of a positively skewed distribution is a negatively skewed distribution—a distribution in which the peak is to the right of the center point and the tail extends toward the left, or in the negative direction. The term negative refers to the direction of the skew. As can be seen in Figure 5.3, in a negatively skewed distribution, the mean is pulled toward the left by the few extremely low scores in the distribution. As in all distributions, the median divides the distribution in half, and the mode is the most frequently occurring score in the distribution.

Figure 5.3 Skewness

Central Moment

The kth central moment (or moment about the mean) of a data population is:

Similarly, the kth central moment of a data sample is:

In particular, the second central moment of a population is its variance.

The skewness of a data population is defined by the following formula, where μ2 and μ3 are the second and third central moments.

The kurtosis of a univariate population is defined by the following formula, where μ2 and μ4 are the second and fourth central moments.

Sampling Techniques

A good reading assignment on Sampling Techniques


There are many different methods through which sampling can be done. Simple Random Sampling is considered to be the ideal sampling method for research, however, paucity of time and money creates the need to opt for other diverse means of sampling.

Probability Methods:

 This is a group of methods to be used for sampling as it further opens the opportunities for the most powerful statistical analysis.

The different probability methods are:

  •  Simple Random Sampling:  It suits and works best when the whole population is available.
  • Stratified Sampling: This kind of sampling works best in a situation when there are specific sub groups to be investigated and the researcher takes up random sampling within the group.
  • Systematic Sampling: This kind of method is workable when a stream of representative people is available.
  • Cluster Sampling: It is largely workable when the population groups are separated and the access…

View original post 133 more words



I find statistics rather mischievous. Like lady wisdom in the Book of Proverbs, one endeavours to obtain her riches but the more one obtains it, the more one becomes detached from the realities of everyday living. Statisticians use complex (sometimes even boring) concepts to explain everyday phenomena. This practise is fine if the audience is fellow detached statisticians, but when the audience is laymen, this elitism loses its logic.

The aim of this course is not to teach you to be a statistician (of those we have plenty). The objective is that you become a questioner of human behaviour. That you look at phenomena and ask, how, when, why is that happening. The course goes ahead to introduce a few tools that will help you to answer those questions.


This course is open to all with a high-school level mathematics and statistics knowledge. I have always found that it is not really the level of exposure that determines success in an endeavour, but the will to do. So the main prerequisite for this course is the will to learn and the desire to ask questions of your society. This course is not for those who want to flaunt their statistical wizardry.


The course shall be delivered through a series of lessons that will be published every Monday, Wednesday and Friday on this blog. We will systematically cover topics covered  in a typical undergraduate introductory statistics class. We shall also use the R language as a tool for statistical computations. THIS IS NOT AN R CLASS! We shall simply present R as a tool for carrying out statistical analyses. Detailed R tutorials can be downloaded here for those who are so inclined.


What’s the point of all this. Well, as mentioned before, I hope that you will be able to ask probing questions of phenomena that you observe. In statistics we call that exploratory statistics. Ultimately, we ask these questions so that we are better able to deal with these phenomena in the future, be better prepared, take advantage, etc. This is known as predictive analytics.

This is an introductory course. The content is designed in a way that the statistical concepts are explained very simply. Nevertheless, where concepts may not be very clear, examples are given to illustrate the same. It is the tutor’s hope that this course will whet the students’ appetite for more and we can consequently get into intermediary and even advanced statistical concepts.

Ready to go? Start here.


A measure of central tendency provides information about the “middleness” of a distribution of scores, but not about the width or spread of the distribution. To assess the width of a distribution, we need a measure of variability or dispersion. A measure of variation indicates how scores are dispersed around the mean of the distribution.


The simplest measure of variation is the range—the difference between the lowest and the highest score in a distribution. To find the range, simply subtract the lowest score from the highest score.

Table 4.1 Two – Class Score comparisons

Class 1 Class 2
0 45
50 50
100 55
∑= 150 ∑ = 150
µ = 50 µ = 50

In the example above the range for Class 1 is 100 points, whereas the range for Class 2 is 10 points. Thus, the range provides some information concerning the difference in the spread of the distributions. In this simple measure of variation, however, only the highest and lowest scores enter the calculation, and all other scores are ignored.


The nth percentile of an observation variable is the value that cuts off the first n percent of the data values when it is sorted in ascending order.

> quantile(INCOME, c(.32, .57, .98)) #finds the 32, 57 and 98th percentiles


There are several quartiles of an observation variable. The first quartile, or lower quartile, is the value that cuts off the first 25% of the data when it is sorted in ascending order. The second quartile, or median, is the value that cuts off the first 50%. The third quartile, or upper quartile, is the value that cuts off the first 75%.

quantile(INCOME) #gives the first second and third quartiles

The inter-quartile range is the difference between the third quartile and the first quartile.

Average Deviation and Standard Deviation

More sophisticated measures of variation use all of the scores in the distribution in their calculation. The most commonly used measure of variation is the standard deviation. Most people have heard this term before and may even have calculated a standard deviation if they have taken a statistics class. However, many people who know how to calculate a standard deviation do not really appreciate the information it provides.

To begin, let’s think about what the phrase standard deviation means. Other words that might be substituted for the word standard include average, normal, or usual. The word deviation means to diverge, move away from, or digress. Putting these terms together, we see that the standard deviation means the average movement away from something. But what? It is the average movement away from the center of the distribution—the mean.

The standard deviation, then, is the average distance of all of the scores in the distribution from the mean or central point of the distribution—or, as we shall see shortly, the square root of the average squared deviation from the mean. Think about how you would calculate the average distance of all of the scores from the mean of the distribution. First, you would have to determine how far each score is from the mean; this is the deviation, or difference, score. Then, you would have to average these scores. This is the basic idea behind calculating the standard deviation.

Let’s use these data to calculate the average distance from the mean. We will begin with a calculation that is slightly simpler than the standard deviation, known as the average deviation. The average deviation is essentially what the name implies— the average distance of all of the scores from the mean of the distribution.

X- µ

Then we need to sum the deviation scores. Notice, however, that if we were to sum these scores, they would add to zero. Therefore, we first take the absolute value of the deviation scores (the distance from the mean, irrespective of direction). To calculate the average deviation, we sum the absolute value of each deviation score:


Then we divide by the total number of scores (N) to find the average deviation.

Although the average deviation is fairly easy to compute, it is not as useful as the standard deviation because, as we will see in later modules, the standard deviation is used in many other statistical procedures.

The standard deviation is very similar to the average deviation. The only difference is that rather than taking the absolute value of the deviation scores, we use another method to “get rid of” the negative deviation scores—we square the deviation scores.

The formula for the standard deviation is:

Notice that the formula is similar to that for the average deviation. We determine the deviation scores, square the deviation scores, sum the squared deviation scores, and divide by the number of scores in the distribution. Lastly, we take the square root of that number. Why? Squaring the deviation scores has inflated them. We now need to bring the squared deviation scores back to the same level of measurement as the mean so that the standard deviation is measured on the same scale as the mean.

If, however, you are using sample data to estimate the population standard deviation, then the standard deviation formula must be slightly modified. The modification provides what is called an “unbiased estimator” of the population standard deviation based on sample data. The modified formula is:

s = unbiased estimator of population standard deviation

X = each individual score

= sample mean

N = number of scores in the sample

The main difference is in the denominator—dividing by N – 1 versus N. The reason is that the standard deviation within a small sample may not be representative of the population; that is, there may not be as much variability in the sample as there actually is in the population. We, therefore, divide by N – 1, because dividing by a smaller number increases the standard deviation and thus provides a better estimate of the population standard deviation.

sd(INCOME) #standard deviation of INCOME

var(INCOME) #variance of INCOME

The Variance is the square of the standard deviation



3.1 The Why

Organizing data into tables and graphs can help make a data set more meaningful. These methods, however, do not provide as much information as numerical measures. Descriptive statistics are numerical measures that describe a distribution by providing information on the central tendency of the distribution, the width of the distribution, and the distribution’s shape.

A measure of central tendency characterizes an entire set of data in terms of a single representative number.

Measures of central tendency measure the “middleness” of a distribution of scores in three ways: the mean, median, and mode.

3.2 The What


The most commonly used measure of central tendency is the mean—the average observation in a set of observations. There are two manifestations (yes, I said it!) of the mean; the arithmetic mean and the geometric mean. We are not interested in the geometric mean at this point but you can look at it here if you are interested. We will stick to the arithmetic mean.

 Arithmetic mean: simply the mean or average when the context is clear, is the sum of a collection of numbers divided by the number of numbers in the collection

We can calculate the mean for our distribution of exam scores (from the previous lesson) by adding all of the scores together and dividing by the total number of scores. Mathematically, this would be:

mean formula


mu (pronounced “mu”) represents the symbol for the population mean

sigma represents the symbol for “the sum of”

X represents the individual scores, and

N represents the number of scores in the distribution

To calculate the mean, then, we sum all of the Xs, or scores, and divide by the total number of scores in the distribution (N). You may have also seen this formula represented as follows:

sample mean

In this case x-bar represents a sample mean.

One of the main shortcomings of the mean is that the mean is influenced by extreme scores (what we sometimes refer to as outliers).

An outlier is an observation point that is distant from other observations.

Outliers can distort the results by giving an inaccurate representation of the distribution of the population.



Another measure of central tendency, the median, is used in situations in which the mean might not be representative of a distribution. Let’s use a different distribution of scores to demonstrate when it might be appropriate to use the median rather than the mean. Imagine that you are considering taking a job with a small computer company. When you interview for the position, the owner of the company informs you that the mean income for employees at the company is approximately Kshs. 100,000 and that the company has 25 employees. Most people would view this as good news. Having learned in a statistics class that the mean might be influenced by extreme scores, you ask to see the distribution of 25 incomes. The distribution is shown below.

Table 3.1 Employee Salaries Distribution

15,000 1 15,000
20,000 2 40,000
22,000 1 22,000
23,000 2 46,000
25,000 5 125,000
27,000 2 54,000
30,000 3 90,000
32,000 1 32,000
35,000 2 70,000
38,000 1 38,000
39,000 1 39,000
40,000 1 40,000
42,000 1 42,000
45,000 1 45,000
1,800,000 1 1,800,000
N=25 2,498,000

The mean for this distribution is Kshs. 99,920. Notice that, as claimed, the mean income of company employees is very close to Kshs. 100,000. Notice also, however, that the mean in this case is not very representative of central tendency, or “middleness.” In this distribution, the mean is thrown off center or inflated by one very extreme score of Kshs. 1,800,000 (the income of the company’s owner, needless to say). This extremely high income pulls the mean toward it and thus increases or inflates the mean. Thus, in distributions with one or a few extreme scores (either high or low), the mean will not be a good indicator of central tendency. In such cases, a better measure of central tendency is the median.

The median is the middle score in a distribution after the scores have been arranged from highest to lowest or lowest to highest.

The distribution of incomes in Table 3.1 is already ordered from lowest to highest. To determine the median, we simply have to find the middle score. In this situation, with 25 scores, that would be the 13th score. You can see that the median of the distribution would be an income of Kshs. 27,000, which is far more representative of the central tendency for this distribution of incomes.

Why is the median not as influenced as the mean by extreme scores? Think about the calculation of each of these measures. When calculating the mean, we must add in the atypical income of Kshs. 1,800,000, thus distorting the calculation. When determining the median, however, we do not consider the size of the $1,800,000 income; it is only a score at one end of the distribution whose numerical value does not have to be considered in order to locate the middle score in the distribution. The point to remember is that the median is not affected by extreme scores in a distribution because it is only a positional value. The mean is affected because its value is determined by a calculation that has to include the extreme value.

In distributions with an even number of observations, the median is calculated by averaging the two middle scores. In other words, we determine the middle point between the two middle scores.


The third measure of central tendency is the mode—the score in a distribution that occurs with the greatest frequency. Sometimes, several scores occur with equal frequency. Thus, a distribution may have two modes (bimodal), three modes (trimodal), or even more. The mode is the only indicator of central tendency that can be used with nominal data. Although it can also be used with ordinal, interval, or ratio data, the mean and median are more reliable indicators of the central tendency of a distribution, and the mode is seldom used.

 3.3 The How

R Code

mydata <- read.table(“testdata.txt”) #import your dataset

#attach(mydata) # In case you want to work with the variables directly

names(mydata) #This shows us all the variable names

mean(INCOME) #If you use the attach() command, you can call variables directly


mean(mydata$INCOME) #Find the mean income

#mean(mydata$INCOME, na.rm=TRUE) #Remove NA values before computation

median(INCOME, na.rm=TRUE) #returns the middle observation

mode(INCOME) #does something weird!

# the function mode( ) in R returns the variable type

temp <- table(as.vector(INCOME)) #The first row of “temp”

#is a sorted list of all unique values in the vector INCOME

#The second row in “temp” counts how many occurrences of each value.

names(temp)[temp == max(temp)]

#this returns the names of the values that have the highest count

#in temp’s second row

# This happens to be the mode!

#R knows you will want to see all the measures of central tendency

summary(INCOME) # So it supplies them all in one command

We will discuss the “Min. 1st Qu.  3rd Qu.    Max. “ in the next class. These are measures of dispersion. We will also look at range, standard deviation, percentiles, skweness and other similar animals. Until next time.