In order to better understand the nature of probabilistic decisions, consider the following court case of The People v. Collins, 1968. In this case, the robbery victim was unable to identify his assailant. All that the victim could recall was that the assailant was female with a blonde ponytail. In addition, he remembered that she fled the scene in a yellow convertible that was driven by an African American male who had a full beard. The suspect in the case fit the description given by the victim, so the question was “Could the jury be sure, beyond a reasonable doubt, that the woman on trial was the robber?” The evidence against her was as follows: She was blonde and often wore her hair in a ponytail; her codefendant friend was an African American male with a moustache, beard, and a yellow convertible. The attorney for the defense stressed the fact that the victim could not identify this woman as the woman who robbed him, and that therefore there should be reasonable doubt on the part of the jury.

The prosecutor, on the other hand, called an expert in probability theory who testified to the following: The probability of all of the above conditions (being blonde and often having a pony tail and having an African American male friend and his having a full beard, and his owning a yellow convertible) co-occurring when these characteristics are independent was 1 in 12 million. The expert further testified that the combination of characteristics was so unusual that the jury could in fact be certain “beyond a reasonable doubt” that the woman was the robber. The jury returned a verdict of “guilty” (Arkes & Hammond, 1986; Halpern, 1996).

As can be seen in the previous example, the legal system operates on probability and recognizes that we can never be absolutely certain when deciding whether an individual is guilty. Thus, the standard of “beyond a reasonable doubt” was established and jurors base their decisions on probability, whether they realize it or not. Most decisions that we make on a daily basis are, in fact, based on probabilities. Diagnoses made by doctors, verdicts produced by juries, decisions made by business executives regarding expansion and what products to carry, decisions regarding whether individuals are admitted to colleges, and most everyday decisions all involve using probability. In addition, all games of chance (for example, cards, horse racing, the stock market) involve probability.

If you think about it, there is very little in life that is certain. Therefore, most of our decisions are probabilistic and having a better understanding of probability will help you with those decisions. In addition, because probability also plays an important role in science, that is another important reason for us to have an understanding of it.


Probability is a measure of chance, and we shall propose general rules for calculating the probability of combinations of simple events.

Probability refers to the number of ways a particular outcome (event) can occur divided by the total number of outcomes (events).

The tossing of a coin is a simple example of a large class of games of chance with certain common features. Each game is decided on the results or outcomes of one or more trials, where a trial might be rolling a die, tossing a coin, or drawing a card from a pack. If the outcomes are distinguishable, we say they are mutually exclusive, and if they are the only possible results they are also said to be exhaustive. There may be more than one way of listing the outcomes. If we draw a card from the pack, the outcomes red, black are mutually exclusive and exhaustive, but so are the outcomes Spades, Hearts, Diamonds, and Clubs. The trials are also said to be independent if the result of one trial does not depend on the outcome of any previous trial, or any combination of previous trials.


Probabilities are often presented or expressed as proportions. Proportions vary between 0.0 and 1.0, where a probability of 0.0 means the event certainly will not occur and a probability of 1.0 means that the event is certain to occur. Thus, any probability between 0.0 and 1.0 represents an event with some degree of uncertainty to it. How much uncertainty depends on the exact probability with which we are dealing. For example, a probability close to 0.0 represents an event that is almost certain not to occur, and a probability close to 1.0 represents an event that is almost certain to occur. On the other hand, a probability of .50 represents maximum uncertainty.

Let’s start with a simplistic example of probability. What is the probability of getting a “head” when tossing a coin? In this example, we have to consider how many ways there are to get a “head” on a coin toss (there is only one way, the coin lands heads up) and how many possible outcomes there are (there are two possible outcomes, either a “head” or a “tail”). So, the probability of a “head” in a coin toss is:


Set Theory: A set is a collection of items or events. The items within a set are generally referred to as elements. A set can be an element of another set.

The Universal Set is the set of all possible elements. In probability, the universal set is the set of all possible outcomes of a trial (experiment).

Sample space: In order to avoid continually referring to particular games or experiments, it is useful to employ an abstract representation for a trial and its outcomes. Each distinguishable and indecomposable outcome, or simple event, is regarded as a point in a sample space, S. Thus, for the experiment of drawing a card from a pack the sample space contains 52 points. Every collection of simple events or set of points of S is called an event. The Sample space is an example of a universal set.

Intersection: The intersection of two sets A, B is the set of points of S which belong to both A and B and is an event. Thus the intersection of the sets (HH, TH, HT} and {HT, ΤΤ) is the set containing the single point HT. This event may be called ‘heads on the first coin and tails on the second coin’. It may happen that the two sets have no points in common, that is, their intersection is the empty set. Simply, an intersection of two sets is a set containing elements common to both sets.

Union: This is defined as the set which contains all the points of S which are in either A or B {or both). Thus, the union of the events {HH, TH, ΗΤ) and {HT, TT) , in the present example, is the event ( HH , TH, HT, TT} , which contains every point in the sample space and may reasonably be called ‘the certain event’. In other words a union of sets is a set that contains all unique elements of the sets.


Figure 5.1: A Venn Diagram representing the intersection of two sets.


The Venn diagram is a simple graphical tool used to represent set theory computations. In a Venn diagram, the sample space is represented by a rectangle and any event by a circle in this rectangle.


An ice-cream firm, before launching three new flavors, conducts a tasting with the assistance of 60 schoolboys. The findings were summarized as:

32 liked A

24 liked Β

31 liked C

10 liked A and Β

11 liked A and C

14 liked B and C

6 liked A and Β and C.

Since there are only three flavors. A, B, C to consider, the information provided can easily be grasped through a diagram. Can you draw a Venn diagram to represent this relationship?


For every event, E, in the sample space S we assign a non-negative number, called the probability of Ε denoted by Pr(E), so that the following axioms are satisfied.

(a) For every event E, Pr(E) > 0                                                                                  Non-negativity

(b) For the certain event, Pr(S) = 1.                                                                          Sum of all probabilities =1

(c) If E1, E2 are mutually exclusive events Pr(E1 U E2) = Pr(E1) + Pr (E2)       Additivity

(d) If E1, E2 are independent events Pr(E1 ∩ E2) = Pr(E1) x Pr (E2)                 Multiplication rule

Conditional Probability

Conditional probability measures that probability that an event will occur “given that” another event has occurred. For two events, A and B, the conditional probability of B given that A has occurred is denoted as Pr(B |A).  It is calculated as:

Bayes’ Theorem

Also called Bayes’ Rule or Bayes’ law, it simply relates our current belief based on prior evidence. Sounds vague and mysterious? Not to worry, this will become clearer when we start making deductions based on the probability estimates. For now what you need to remember is that Bayes’ law tell us something about the future based on what we have observed in the past.

The formula:

Can you try and derive this formula using the formula for conditional probability?

Next Lesson we look at the Random Variable and Probability Distributions



Normal Distributions

When a distribution of scores is very large, it tends to approximate a pattern called a normal distribution. When plotted as a frequency polygon, a normal distribution forms a symmetrical, bell-shaped pattern often called a normal curve (see Figure 5.1). We say that the pattern approximates a normal distribution because a true normal distribution is a theoretical construct not actually observed in the real world.

The normal distribution is a theoretical frequency distribution that has certain special characteristics. First, it is bell-shaped and symmetrical—the right half is a mirror image of the left half. Second, the mean, median, and mode are equal and are located at the center of the distribution. Third, the normal distribution is unimodal—it has only one mode. Fourth, most of the observations are clustered around the center of the distribution, with far fewer observations at the ends, or “tails,” of the distribution. Lastly, when standard deviations are used on the x-axis, the percentage of scores falling between the mean and any point on the x-axis is the same for all normal curves. We will discuss the normal distribution more extensively in later lessons.

Figure 5.1 Normal Curve


Although we typically think of the normal distribution as being similar to the curve depicted in Figure 5.1, there are variations in the shape of normal distributions. Kurtosis refers to how flat or peaked a normal distribution is. In other words, kurtosis refers to the degree of dispersion among the scores, or whether the distribution is tall and skinny or short and fat. The normal distribution depicted in Figure 5.1 is called mesokurtic—meso means “middle.” Mesokurtic curves have peaks of medium height and the distributions are moderate in breadth. Now look at the two distributions depicted in Figure 5.2.

The normal distribution on the left is leptokurtic—lepto means “thin.” Leptokurtic curves are tall and thin, with only a few scores in the middle of the distribution having a high frequency. Last, see the curve on the right side of Figure 5.2. This is a platykurtic curve—platy means “broad” or “flat.” Platykurtic curves are short and more dispersed (broader). In a platykurtic curve, there are many scores around the middle score that all have a similar frequency.

Figure 5.2 Kurtosis

Positively Skewed Distributions

Most distributions do not approximate a normal or bell-shaped curve. Instead, they are skewed, or lopsided. In a skewed distribution, scores tend to cluster at one end or the other of the x-axis, with the tail of the distribution extending in the opposite direction. In a positively skewed distribution, the peak is to the left of the center point and the tail extends toward the right, or in the positive direction. (See Figure 5.3.)

Notice that what is skewing the distribution, or throwing it off center, are the scores toward the right or positive direction. A few individuals have extremely high scores that pull the distribution in that direction. Notice also what this does to the mean, median, and mode. These three measures do not have the same value, nor are they all located at the center of the distribution as they are in a normal distribution. The mode—the score with the highest frequency—is the high point on the distribution. The median divides the distribution in half. The mean is pulled in the direction of the tail of the distribution; that is, the few extreme scores pull the mean toward them and inflate it.

Negatively Skewed Distributions

The opposite of a positively skewed distribution is a negatively skewed distribution—a distribution in which the peak is to the right of the center point and the tail extends toward the left, or in the negative direction. The term negative refers to the direction of the skew. As can be seen in Figure 5.3, in a negatively skewed distribution, the mean is pulled toward the left by the few extremely low scores in the distribution. As in all distributions, the median divides the distribution in half, and the mode is the most frequently occurring score in the distribution.

Figure 5.3 Skewness

Central Moment

The kth central moment (or moment about the mean) of a data population is:

Similarly, the kth central moment of a data sample is:

In particular, the second central moment of a population is its variance.

The skewness of a data population is defined by the following formula, where μ2 and μ3 are the second and third central moments.

The kurtosis of a univariate population is defined by the following formula, where μ2 and μ4 are the second and fourth central moments.

Sampling Techniques

A good reading assignment on Sampling Techniques


There are many different methods through which sampling can be done. Simple Random Sampling is considered to be the ideal sampling method for research, however, paucity of time and money creates the need to opt for other diverse means of sampling.

Probability Methods:

 This is a group of methods to be used for sampling as it further opens the opportunities for the most powerful statistical analysis.

The different probability methods are:

  •  Simple Random Sampling:  It suits and works best when the whole population is available.
  • Stratified Sampling: This kind of sampling works best in a situation when there are specific sub groups to be investigated and the researcher takes up random sampling within the group.
  • Systematic Sampling: This kind of method is workable when a stream of representative people is available.
  • Cluster Sampling: It is largely workable when the population groups are separated and the access…

View original post 133 more words



I find statistics rather mischievous. Like lady wisdom in the Book of Proverbs, one endeavours to obtain her riches but the more one obtains it, the more one becomes detached from the realities of everyday living. Statisticians use complex (sometimes even boring) concepts to explain everyday phenomena. This practise is fine if the audience is fellow detached statisticians, but when the audience is laymen, this elitism loses its logic.

The aim of this course is not to teach you to be a statistician (of those we have plenty). The objective is that you become a questioner of human behaviour. That you look at phenomena and ask, how, when, why is that happening. The course goes ahead to introduce a few tools that will help you to answer those questions.


This course is open to all with a high-school level mathematics and statistics knowledge. I have always found that it is not really the level of exposure that determines success in an endeavour, but the will to do. So the main prerequisite for this course is the will to learn and the desire to ask questions of your society. This course is not for those who want to flaunt their statistical wizardry.


The course shall be delivered through a series of lessons that will be published every Monday, Wednesday and Friday on this blog. We will systematically cover topics covered  in a typical undergraduate introductory statistics class. We shall also use the R language as a tool for statistical computations. THIS IS NOT AN R CLASS! We shall simply present R as a tool for carrying out statistical analyses. Detailed R tutorials can be downloaded here for those who are so inclined.


What’s the point of all this. Well, as mentioned before, I hope that you will be able to ask probing questions of phenomena that you observe. In statistics we call that exploratory statistics. Ultimately, we ask these questions so that we are better able to deal with these phenomena in the future, be better prepared, take advantage, etc. This is known as predictive analytics.

This is an introductory course. The content is designed in a way that the statistical concepts are explained very simply. Nevertheless, where concepts may not be very clear, examples are given to illustrate the same. It is the tutor’s hope that this course will whet the students’ appetite for more and we can consequently get into intermediary and even advanced statistical concepts.

Ready to go? Start here.


A measure of central tendency provides information about the “middleness” of a distribution of scores, but not about the width or spread of the distribution. To assess the width of a distribution, we need a measure of variability or dispersion. A measure of variation indicates how scores are dispersed around the mean of the distribution.


The simplest measure of variation is the range—the difference between the lowest and the highest score in a distribution. To find the range, simply subtract the lowest score from the highest score.

Table 4.1 Two – Class Score comparisons

Class 1 Class 2
0 45
50 50
100 55
∑= 150 ∑ = 150
µ = 50 µ = 50

In the example above the range for Class 1 is 100 points, whereas the range for Class 2 is 10 points. Thus, the range provides some information concerning the difference in the spread of the distributions. In this simple measure of variation, however, only the highest and lowest scores enter the calculation, and all other scores are ignored.


The nth percentile of an observation variable is the value that cuts off the first n percent of the data values when it is sorted in ascending order.

> quantile(INCOME, c(.32, .57, .98)) #finds the 32, 57 and 98th percentiles


There are several quartiles of an observation variable. The first quartile, or lower quartile, is the value that cuts off the first 25% of the data when it is sorted in ascending order. The second quartile, or median, is the value that cuts off the first 50%. The third quartile, or upper quartile, is the value that cuts off the first 75%.

quantile(INCOME) #gives the first second and third quartiles

The inter-quartile range is the difference between the third quartile and the first quartile.

Average Deviation and Standard Deviation

More sophisticated measures of variation use all of the scores in the distribution in their calculation. The most commonly used measure of variation is the standard deviation. Most people have heard this term before and may even have calculated a standard deviation if they have taken a statistics class. However, many people who know how to calculate a standard deviation do not really appreciate the information it provides.

To begin, let’s think about what the phrase standard deviation means. Other words that might be substituted for the word standard include average, normal, or usual. The word deviation means to diverge, move away from, or digress. Putting these terms together, we see that the standard deviation means the average movement away from something. But what? It is the average movement away from the center of the distribution—the mean.

The standard deviation, then, is the average distance of all of the scores in the distribution from the mean or central point of the distribution—or, as we shall see shortly, the square root of the average squared deviation from the mean. Think about how you would calculate the average distance of all of the scores from the mean of the distribution. First, you would have to determine how far each score is from the mean; this is the deviation, or difference, score. Then, you would have to average these scores. This is the basic idea behind calculating the standard deviation.

Let’s use these data to calculate the average distance from the mean. We will begin with a calculation that is slightly simpler than the standard deviation, known as the average deviation. The average deviation is essentially what the name implies— the average distance of all of the scores from the mean of the distribution.

X- µ

Then we need to sum the deviation scores. Notice, however, that if we were to sum these scores, they would add to zero. Therefore, we first take the absolute value of the deviation scores (the distance from the mean, irrespective of direction). To calculate the average deviation, we sum the absolute value of each deviation score:


Then we divide by the total number of scores (N) to find the average deviation.

Although the average deviation is fairly easy to compute, it is not as useful as the standard deviation because, as we will see in later modules, the standard deviation is used in many other statistical procedures.

The standard deviation is very similar to the average deviation. The only difference is that rather than taking the absolute value of the deviation scores, we use another method to “get rid of” the negative deviation scores—we square the deviation scores.

The formula for the standard deviation is:

Notice that the formula is similar to that for the average deviation. We determine the deviation scores, square the deviation scores, sum the squared deviation scores, and divide by the number of scores in the distribution. Lastly, we take the square root of that number. Why? Squaring the deviation scores has inflated them. We now need to bring the squared deviation scores back to the same level of measurement as the mean so that the standard deviation is measured on the same scale as the mean.

If, however, you are using sample data to estimate the population standard deviation, then the standard deviation formula must be slightly modified. The modification provides what is called an “unbiased estimator” of the population standard deviation based on sample data. The modified formula is:

s = unbiased estimator of population standard deviation

X = each individual score

= sample mean

N = number of scores in the sample

The main difference is in the denominator—dividing by N – 1 versus N. The reason is that the standard deviation within a small sample may not be representative of the population; that is, there may not be as much variability in the sample as there actually is in the population. We, therefore, divide by N – 1, because dividing by a smaller number increases the standard deviation and thus provides a better estimate of the population standard deviation.

sd(INCOME) #standard deviation of INCOME

var(INCOME) #variance of INCOME

The Variance is the square of the standard deviation



3.1 The Why

Organizing data into tables and graphs can help make a data set more meaningful. These methods, however, do not provide as much information as numerical measures. Descriptive statistics are numerical measures that describe a distribution by providing information on the central tendency of the distribution, the width of the distribution, and the distribution’s shape.

A measure of central tendency characterizes an entire set of data in terms of a single representative number.

Measures of central tendency measure the “middleness” of a distribution of scores in three ways: the mean, median, and mode.

3.2 The What


The most commonly used measure of central tendency is the mean—the average observation in a set of observations. There are two manifestations (yes, I said it!) of the mean; the arithmetic mean and the geometric mean. We are not interested in the geometric mean at this point but you can look at it here if you are interested. We will stick to the arithmetic mean.

 Arithmetic mean: simply the mean or average when the context is clear, is the sum of a collection of numbers divided by the number of numbers in the collection

We can calculate the mean for our distribution of exam scores (from the previous lesson) by adding all of the scores together and dividing by the total number of scores. Mathematically, this would be:

mean formula


mu (pronounced “mu”) represents the symbol for the population mean

sigma represents the symbol for “the sum of”

X represents the individual scores, and

N represents the number of scores in the distribution

To calculate the mean, then, we sum all of the Xs, or scores, and divide by the total number of scores in the distribution (N). You may have also seen this formula represented as follows:

sample mean

In this case x-bar represents a sample mean.

One of the main shortcomings of the mean is that the mean is influenced by extreme scores (what we sometimes refer to as outliers).

An outlier is an observation point that is distant from other observations.

Outliers can distort the results by giving an inaccurate representation of the distribution of the population.



Another measure of central tendency, the median, is used in situations in which the mean might not be representative of a distribution. Let’s use a different distribution of scores to demonstrate when it might be appropriate to use the median rather than the mean. Imagine that you are considering taking a job with a small computer company. When you interview for the position, the owner of the company informs you that the mean income for employees at the company is approximately Kshs. 100,000 and that the company has 25 employees. Most people would view this as good news. Having learned in a statistics class that the mean might be influenced by extreme scores, you ask to see the distribution of 25 incomes. The distribution is shown below.

Table 3.1 Employee Salaries Distribution

15,000 1 15,000
20,000 2 40,000
22,000 1 22,000
23,000 2 46,000
25,000 5 125,000
27,000 2 54,000
30,000 3 90,000
32,000 1 32,000
35,000 2 70,000
38,000 1 38,000
39,000 1 39,000
40,000 1 40,000
42,000 1 42,000
45,000 1 45,000
1,800,000 1 1,800,000
N=25 2,498,000

The mean for this distribution is Kshs. 99,920. Notice that, as claimed, the mean income of company employees is very close to Kshs. 100,000. Notice also, however, that the mean in this case is not very representative of central tendency, or “middleness.” In this distribution, the mean is thrown off center or inflated by one very extreme score of Kshs. 1,800,000 (the income of the company’s owner, needless to say). This extremely high income pulls the mean toward it and thus increases or inflates the mean. Thus, in distributions with one or a few extreme scores (either high or low), the mean will not be a good indicator of central tendency. In such cases, a better measure of central tendency is the median.

The median is the middle score in a distribution after the scores have been arranged from highest to lowest or lowest to highest.

The distribution of incomes in Table 3.1 is already ordered from lowest to highest. To determine the median, we simply have to find the middle score. In this situation, with 25 scores, that would be the 13th score. You can see that the median of the distribution would be an income of Kshs. 27,000, which is far more representative of the central tendency for this distribution of incomes.

Why is the median not as influenced as the mean by extreme scores? Think about the calculation of each of these measures. When calculating the mean, we must add in the atypical income of Kshs. 1,800,000, thus distorting the calculation. When determining the median, however, we do not consider the size of the $1,800,000 income; it is only a score at one end of the distribution whose numerical value does not have to be considered in order to locate the middle score in the distribution. The point to remember is that the median is not affected by extreme scores in a distribution because it is only a positional value. The mean is affected because its value is determined by a calculation that has to include the extreme value.

In distributions with an even number of observations, the median is calculated by averaging the two middle scores. In other words, we determine the middle point between the two middle scores.


The third measure of central tendency is the mode—the score in a distribution that occurs with the greatest frequency. Sometimes, several scores occur with equal frequency. Thus, a distribution may have two modes (bimodal), three modes (trimodal), or even more. The mode is the only indicator of central tendency that can be used with nominal data. Although it can also be used with ordinal, interval, or ratio data, the mean and median are more reliable indicators of the central tendency of a distribution, and the mode is seldom used.

 3.3 The How

R Code

mydata <- read.table(“testdata.txt”) #import your dataset

#attach(mydata) # In case you want to work with the variables directly

names(mydata) #This shows us all the variable names

mean(INCOME) #If you use the attach() command, you can call variables directly


mean(mydata$INCOME) #Find the mean income

#mean(mydata$INCOME, na.rm=TRUE) #Remove NA values before computation

median(INCOME, na.rm=TRUE) #returns the middle observation

mode(INCOME) #does something weird!

# the function mode( ) in R returns the variable type

temp <- table(as.vector(INCOME)) #The first row of “temp”

#is a sorted list of all unique values in the vector INCOME

#The second row in “temp” counts how many occurrences of each value.

names(temp)[temp == max(temp)]

#this returns the names of the values that have the highest count

#in temp’s second row

# This happens to be the mode!

#R knows you will want to see all the measures of central tendency

summary(INCOME) # So it supplies them all in one command

We will discuss the “Min. 1st Qu.  3rd Qu.    Max. “ in the next class. These are measures of dispersion. We will also look at range, standard deviation, percentiles, skweness and other similar animals. Until next time.


X.X Sampling

I have thrown this topic here because I am not very sure where it fits. So what is sampling?

Imagine we want to study the effect of the introduction of new parking rates on the taxi business in Nairobi. We may want to know if it has negatively impacted the business or not. So we decide to interview taxi drivers or taxi owners to get their views on the matter.

There we encounter problem number 1; it may not be feasible to interview ALL taxi drivers to get their views, either because we do not have the time or the money or both. The sum total of all the taxi drivers in Nairobi is called the population.

A population is any entire collection of people, animals, plants or things from which we may collect data. It is the entire group we are interested in, which we wish to describe or draw conclusions about.

Obviously, it would be better if we had the views of all taxi drivers in Nairobi in our study. That way we would be extremely confident with our findings. Alas, that is not possible and, therefore, we have to select a smaller group of taxi drivers whose views we will collect. This smaller group we have chosen is what is known as a sample.

A sample is a group selected from a larger group (population). By studying the sample it is hoped to draw valid conclusions about the larger group.

Here we encounter problem number 2; how do we select a smaller group of taxi drivers without appearing to be biased? The best way is if we were able to choose taxi drivers randomly. If we could get a random sample then no one can accuse us of being biased, right?

Ideally, we would like to choose a simple random sample.

A simple random sample is a subset (I will explain this concept of a subset when we start on probability) of a population in which each member of the subset has an equal probability of being chosen. It is meant to be an unbiased representation of the group.

Therein lies problem number 3; it is almost impossible to pick a truly random sample. Maybe the only way to do it is if you had the names of all the taxi drivers, put them in a hat and draw your sample at random. But then you will encounter problems when the members of the sample are too widely spread, or if by some coincidence, all of them belong to one company, or some issue like that. The more likely problem is that we do not have the names of all the taxi drivers in Nairobi. These issues increase the likelihood of a sampling error.

Sampling Error: if the sample does not accurately reflect the population it is supposed to represent. We want to minimize sampling error as much as possible.

It would be easier if we had a method for dividing up the population into manageable units from which we could draw our sample. This is called stratified random sampling. We divide the population into strata (units or divisions) based on some meaningful criteria. In our case, we could divide the sample into geographic regions, e.g. the CBD, Upperhill, Westlands, Harlingham. Then we take a random sample of taxi drivers in that area.

Remember that we are not interested in just the sample, but we want to know the effect of the parking fee increase in the entire taxi business. But we are only getting views for a smaller group (subset) of taxi drivers. Therefore, once we get the results, we have to infer what the population would be.

Statistical inference makes use of information from a sample to draw conclusions (inferences) about the population from which the sample was taken.

Election poll results are an example of statistical inference.

Let’s define two more terms before we proceed:

A parameter is a value, usually unknown (and which we, therefore, are trying to estimate), used to represent a certain population characteristic. The mean, for instance, is a population parameter used to indicate the average value of a particular quantity. In statistics, parameters are represented by Greek letters (for example µ for mean).

A statistic is a quantity calculated from a sample of data. It is used to give information about unknown values in the corresponding population. Statistics are often assigned Roman letters in statistics (for example s for standard deviation).

Back to the lesson: Organizing Data

2.1 The Why

Sometimes we have a small sample to deal with. By just looking at the data we are able to describe the data without having to proceed into various statistical techniques. Often times, though, that is not the case. More often than not we will be working with large data sets. For us to be able to “look” at the data, we would have to arrange it in a meaningful way.

Organizing data will help us to make preliminary descriptions of the data but also, it will give us an indication of the kind of techniques we would need to apply in order to make more sense of the dataset in front of us.

Visualizing the Dataset

Sometimes we want to see the data. Most of us have used excel. When you open a dataset in excel, you have already organized data! Excel is a great tool and it organizes datasets into rows and columns. The columns represent variables (remember lesson 1) while the rows represent observations.

You can also try and view datasets in SPSS, SAS and R. since we are using R in this course, to view a dataset import the dataset and assign it to the dataframe mydata:

>mydata <- read.table(“your_data_set”) #import your dataset

>mydata #view your dataset

RStudio has an even better way of displaying datasets. Can you try importing a dataset in RStudio without using the above code?

Frequency Distributions

You may have a list-like form of data as below:

Table 2.1 Scores of Student in the Class

23 45 6 73 23 23 45 50 51 34


As we said, we organize data so that meaningful conclusions can be drawn out of it. One way to do that would be to sort the data from lowest to highest or vise versa. Once this is accomplished (see Table 2.2), we can try to condense the data into a frequency distribution—a table in which all of the scores are listed along with the frequency with which each occurs. We can also show a relative frequency distribution, which indicates the proportion of the total observations included in each score. When the relative frequency distribution is multiplied by 100, it is read as a percentage.

Table 2.2 Frequency Distribution

Score 6 23 34 45 50 51 73  
Frequency 1 3 1 2 1 1 1 N=10
Relative Frequency 0.1 0.3 0.1 0.2 0.1 0.1 0.1 1.0

The frequency distribution is a way of presenting data that makes the pattern of the data easier to see. Frequency distributions are great for nominal and ordinal data.

When dealing with Interval and ratio data (especially when the dataset is very large), we group the observations and create a class interval frequency distribution. We can combine individual scores into categories, or intervals, and list them along with the frequency of scores in each interval. In our exam score example, the scores range from 0 to 80—an 80-point range. A rule of thumb when creating class intervals is to have between 10 and 20 categories. A quick method of calculating what the width of the interval should be is to subtract the smallest score from the largest score and then divide by the number of intervals you would like.

Table 2.3 Class Interval Frequency Distribution

Interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80  
Frequency 1 0 4 0 2 1 0 1 N=10
Relative Frequency 0.1 0.0 0.4 0.0 0.2 0.1 0.0 0.1 1.0

Creating Frequency Distributions with R

studentscores <- c(23,45,6,73,23,23,45,50,51,34) # create a vector of the student scores

range(studentscores) #displays the minimum and maximum score

bin = seq(0,80, by=10) #creates a vector sequence

bin # displays the values of our bin vector

studentscores.cut = cut(studentscores, bin, right=FALSE)

studentscores.frequency = table(studentscores.cut) # creates out frequency distribution

cbind(studentscores.frequency) #Displays as a data frame


Frequency distributions can provide valuable information, but sometimes a picture is of greater value. Several types of pictorial representations can be used to represent data. The choice depends on the type of data collected and what the researcher hopes to emphasize or illustrate. The most common type of data visualizations are graphs; pie charts, bar graphs, histograms and frequency polygons (line graphs).

There are new and very powerful data visualization tools. You can research on these tools if you would like to delve a little more into this area. For this class, we shall stick to the basics.

Bar Graphs and Histograms

Bar graphs and histograms are frequently confused. When the data collected are on a nominal scale, or if the variable is a qualitative variable (a categorical variable for which each value represents a discrete category), then a bar graph is most appropriate. A bar graph is a graphical representation of a frequency distribution in which vertical bars are centered above each category along the x-axis and are separated from each other by a space, indicating that the levels of the variable represent distinct, unrelated categories.

If the variable is a quantitative variable (the scores represent a change in quantity), or if the data collected are ordinal, interval, or ratio in scale, then a histogram can be used. A histogram is also a graphical representation of a frequency distribution in which vertical bars are centered above scores on the x-axis, but in a histogram the bars touch each other to indicate that the scores on the variable represent related, increasing values.


In both a bar graph and a histogram, the height of each bar indicates the frequency for that level of the variable on the x-axis. The spaces between the bars on the bar graph indicate not only the qualitative differences among the categories but also that the order of the values of the variable on the x-axis is arbitrary. In other words, the categories on the x-axis in a bar graph can be placed in any order. The fact that the bars are contiguous in a histogram indicates not only the increasing quantity of the variable but also that the variable has a definite order that cannot be changed.

Pie Chart

Like Bar Graphs, Pie Charts are used to represent categorical variables. While Bar Graphs display the frequencies, Pie charts show the proportions (relative frequencies). Most times, the relative frequencies are represented as percentages.

Frequency Polygons (Line Graphs)

We can also depict the data in a histogram as a frequency polygon—a line graph of the frequencies of individual scores or intervals. Again, scores (or intervals) are shown on the x-axis and frequencies on the y-axis. Once all the frequencies are plotted, the data points are connected. Frequency polygons are appropriate when the variable is quantitative or the data are ordinal, interval, or ratio. In this respect, frequency polygons are similar to histograms. Frequency polygons are especially useful for continuous data (such as age, weight, or time) in which it is theoretically possible for values to fall anywhere along the continuum. For example, an individual can weigh 120.5 pounds or be 35.5 years of age. Histograms are more appropriate when the data are discrete (measured in whole units)—for example, number of college classes taken or number of siblings.

3D Plots

Sometimes we may want to represent complex relationships using graphs. While line graphs are used to show the relationship between 2 variables, we may want to see the relationship between 3 variable. We can graph this using 3D graphing tools. Excel does not have this feature out of the box and it is very difficult to do it directly in R. We do this using package in R.

Packages are collections of R functions, data and compiled code in a well-defined format.

Rcommander and ggplot are touted to be the best plotting packages for R. Installing packages will be part of the assignment for this lesson.