In order to better understand the nature of probabilistic decisions, consider the following court case of The People v. Collins, 1968. In this case, the robbery victim was unable to identify his assailant. All that the victim could recall was that the assailant was female with a blonde ponytail. In addition, he remembered that she fled the scene in a yellow convertible that was driven by an African American male who had a full beard. The suspect in the case fit the description given by the victim, so the question was “Could the jury be sure, beyond a reasonable doubt, that the woman on trial was the robber?” The evidence against her was as follows: She was blonde and often wore her hair in a ponytail; her codefendant friend was an African American male with a moustache, beard, and a yellow convertible. The attorney for the defense stressed the fact that the victim could not identify this woman as the woman who robbed him, and that therefore there should be reasonable doubt on the part of the jury.
The prosecutor, on the other hand, called an expert in probability theory who testified to the following: The probability of all of the above conditions (being blonde and often having a pony tail and having an African American male friend and his having a full beard, and his owning a yellow convertible) cooccurring when these characteristics are independent was 1 in 12 million. The expert further testified that the combination of characteristics was so unusual that the jury could in fact be certain “beyond a reasonable doubt” that the woman was the robber. The jury returned a verdict of “guilty” (Arkes & Hammond, 1986; Halpern, 1996).
As can be seen in the previous example, the legal system operates on probability and recognizes that we can never be absolutely certain when deciding whether an individual is guilty. Thus, the standard of “beyond a reasonable doubt” was established and jurors base their decisions on probability, whether they realize it or not. Most decisions that we make on a daily basis are, in fact, based on probabilities. Diagnoses made by doctors, verdicts produced by juries, decisions made by business executives regarding expansion and what products to carry, decisions regarding whether individuals are admitted to colleges, and most everyday decisions all involve using probability. In addition, all games of chance (for example, cards, horse racing, the stock market) involve probability.
If you think about it, there is very little in life that is certain. Therefore, most of our decisions are probabilistic and having a better understanding of probability will help you with those decisions. In addition, because probability also plays an important role in science, that is another important reason for us to have an understanding of it.
BASIC PROBABILITY CONCEPTS
Probability is a measure of chance, and we shall propose general rules for calculating the probability of combinations of simple events.
Probability refers to the number of ways a particular outcome (event) can occur divided by the total number of outcomes (events).
The tossing of a coin is a simple example of a large class of games of chance with certain common features. Each game is decided on the results or outcomes of one or more trials, where a trial might be rolling a die, tossing a coin, or drawing a card from a pack. If the outcomes are distinguishable, we say they are mutually exclusive, and if they are the only possible results they are also said to be exhaustive. There may be more than one way of listing the outcomes. If we draw a card from the pack, the outcomes red, black are mutually exclusive and exhaustive, but so are the outcomes Spades, Hearts, Diamonds, and Clubs. The trials are also said to be independent if the result of one trial does not depend on the outcome of any previous trial, or any combination of previous trials.
COUNTING METHODS
Probabilities are often presented or expressed as proportions. Proportions vary between 0.0 and 1.0, where a probability of 0.0 means the event certainly will not occur and a probability of 1.0 means that the event is certain to occur. Thus, any probability between 0.0 and 1.0 represents an event with some degree of uncertainty to it. How much uncertainty depends on the exact probability with which we are dealing. For example, a probability close to 0.0 represents an event that is almost certain not to occur, and a probability close to 1.0 represents an event that is almost certain to occur. On the other hand, a probability of .50 represents maximum uncertainty.
Let’s start with a simplistic example of probability. What is the probability of getting a “head” when tossing a coin? In this example, we have to consider how many ways there are to get a “head” on a coin toss (there is only one way, the coin lands heads up) and how many possible outcomes there are (there are two possible outcomes, either a “head” or a “tail”). So, the probability of a “head” in a coin toss is:
AXIOMATIC APPROACH
Set Theory: A set is a collection of items or events. The items within a set are generally referred to as elements. A set can be an element of another set.
The Universal Set is the set of all possible elements. In probability, the universal set is the set of all possible outcomes of a trial (experiment).
Sample space: In order to avoid continually referring to particular games or experiments, it is useful to employ an abstract representation for a trial and its outcomes. Each distinguishable and indecomposable outcome, or simple event, is regarded as a point in a sample space, S. Thus, for the experiment of drawing a card from a pack the sample space contains 52 points. Every collection of simple events or set of points of S is called an event. The Sample space is an example of a universal set.
Intersection: The intersection of two sets A, B is the set of points of S which belong to both A and B and is an event. Thus the intersection of the sets (HH, TH, HT} and {HT, ΤΤ) is the set containing the single point HT. This event may be called ‘heads on the first coin and tails on the second coin’. It may happen that the two sets have no points in common, that is, their intersection is the empty set. Simply, an intersection of two sets is a set containing elements common to both sets.
Union: This is defined as the set which contains all the points of S which are in either A or B {or both). Thus, the union of the events {HH, TH, ΗΤ) and {HT, TT) , in the present example, is the event ( HH , TH, HT, TT} , which contains every point in the sample space and may reasonably be called ‘the certain event’. In other words a union of sets is a set that contains all unique elements of the sets.
VENN DIAGRAM
Figure 5.1: A Venn Diagram representing the intersection of two sets.
The Venn diagram is a simple graphical tool used to represent set theory computations. In a Venn diagram, the sample space is represented by a rectangle and any event by a circle in this rectangle.
Example
An icecream firm, before launching three new flavors, conducts a tasting with the assistance of 60 schoolboys. The findings were summarized as:
32 liked A
24 liked Β
31 liked C
10 liked A and Β
11 liked A and C
14 liked B and C
6 liked A and Β and C.
Since there are only three flavors. A, B, C to consider, the information provided can easily be grasped through a diagram. Can you draw a Venn diagram to represent this relationship?
AXIOMS OF PROBABILITY
For every event, E, in the sample space S we assign a nonnegative number, called the probability of Ε denoted by Pr(E), so that the following axioms are satisfied.
(a) For every event E, Pr(E) > 0 Nonnegativity
(b) For the certain event, Pr(S) = 1. Sum of all probabilities =1
(c) If E_{1}, E_{2} are mutually exclusive events Pr(E_{1 }U E_{2}) = Pr(E_{1}) + Pr (E_{2}) Additivity
(d) If E_{1}, E_{2} are independent events Pr(E_{1 }∩ E_{2}) = Pr(E_{1}) x Pr (E_{2}) Multiplication rule
Conditional Probability
Conditional probability measures that probability that an event will occur “given that” another event has occurred. For two events, A and B, the conditional probability of B given that A has occurred is denoted as Pr(B A). It is calculated as:
Bayes’ Theorem
Also called Bayes’ Rule or Bayes’ law, it simply relates our current belief based on prior evidence. Sounds vague and mysterious? Not to worry, this will become clearer when we start making deductions based on the probability estimates. For now what you need to remember is that Bayes’ law tell us something about the future based on what we have observed in the past.
The formula:
Can you try and derive this formula using the formula for conditional probability?
Next Lesson we look at the Random Variable and Probability Distributions
When a distribution of scores is very large, it tends to approximate a pattern called a normal distribution. When plotted as a frequency polygon, a normal distribution forms a symmetrical, bellshaped pattern often called a normal curve (see Figure 5.1). We say that the pattern approximates a normal distribution because a true normal distribution is a theoretical construct not actually observed in the real world.
The normal distribution is a theoretical frequency distribution that has certain special characteristics. First, it is bellshaped and symmetrical—the right half is a mirror image of the left half. Second, the mean, median, and mode are equal and are located at the center of the distribution. Third, the normal distribution is unimodal—it has only one mode. Fourth, most of the observations are clustered around the center of the distribution, with far fewer observations at the ends, or “tails,” of the distribution. Lastly, when standard deviations are used on the xaxis, the percentage of scores falling between the mean and any point on the xaxis is the same for all normal curves. We will discuss the normal distribution more extensively in later lessons.
Figure 5.1 Normal Curve
Kurtosis
Although we typically think of the normal distribution as being similar to the curve depicted in Figure 5.1, there are variations in the shape of normal distributions. Kurtosis refers to how flat or peaked a normal distribution is. In other words, kurtosis refers to the degree of dispersion among the scores, or whether the distribution is tall and skinny or short and fat. The normal distribution depicted in Figure 5.1 is called mesokurtic—meso means “middle.” Mesokurtic curves have peaks of medium height and the distributions are moderate in breadth. Now look at the two distributions depicted in Figure 5.2.
The normal distribution on the left is leptokurtic—lepto means “thin.” Leptokurtic curves are tall and thin, with only a few scores in the middle of the distribution having a high frequency. Last, see the curve on the right side of Figure 5.2. This is a platykurtic curve—platy means “broad” or “flat.” Platykurtic curves are short and more dispersed (broader). In a platykurtic curve, there are many scores around the middle score that all have a similar frequency.
Figure 5.2 Kurtosis
Positively Skewed Distributions
Most distributions do not approximate a normal or bellshaped curve. Instead, they are skewed, or lopsided. In a skewed distribution, scores tend to cluster at one end or the other of the xaxis, with the tail of the distribution extending in the opposite direction. In a positively skewed distribution, the peak is to the left of the center point and the tail extends toward the right, or in the positive direction. (See Figure 5.3.)
Notice that what is skewing the distribution, or throwing it off center, are the scores toward the right or positive direction. A few individuals have extremely high scores that pull the distribution in that direction. Notice also what this does to the mean, median, and mode. These three measures do not have the same value, nor are they all located at the center of the distribution as they are in a normal distribution. The mode—the score with the highest frequency—is the high point on the distribution. The median divides the distribution in half. The mean is pulled in the direction of the tail of the distribution; that is, the few extreme scores pull the mean toward them and inflate it.
Negatively Skewed Distributions
The opposite of a positively skewed distribution is a negatively skewed distribution—a distribution in which the peak is to the right of the center point and the tail extends toward the left, or in the negative direction. The term negative refers to the direction of the skew. As can be seen in Figure 5.3, in a negatively skewed distribution, the mean is pulled toward the left by the few extremely low scores in the distribution. As in all distributions, the median divides the distribution in half, and the mode is the most frequently occurring score in the distribution.
Figure 5.3 Skewness
The k^{th} central moment (or moment about the mean) of a data population is:
Similarly, the kth central moment of a data sample is:
In particular, the second central moment of a population is its variance.
The skewness of a data population is defined by the following formula, where μ_{2} and μ_{3} are the second and third central moments.
The kurtosis of a univariate population is defined by the following formula, where μ_{2} and μ_{4} are the second and fourth central moments.
A good reading assignment on Sampling Techniques
There are many different methods through which sampling can be done. Simple Random Sampling is considered to be the ideal sampling method for research, however, paucity of time and money creates the need to opt for other diverse means of sampling.
Probability Methods:
This is a group of methods to be used for sampling as it further opens the opportunities for the most powerful statistical analysis.
The different probability methods are:
View original post 133 more words
I find statistics rather mischievous. Like lady wisdom in the Book of Proverbs, one endeavours to obtain her riches but the more one obtains it, the more one becomes detached from the realities of everyday living. Statisticians use complex (sometimes even boring) concepts to explain everyday phenomena. This practise is fine if the audience is fellow detached statisticians, but when the audience is laymen, this elitism loses its logic.
The aim of this course is not to teach you to be a statistician (of those we have plenty). The objective is that you become a questioner of human behaviour. That you look at phenomena and ask, how, when, why is that happening. The course goes ahead to introduce a few tools that will help you to answer those questions.
Who?
This course is open to all with a highschool level mathematics and statistics knowledge. I have always found that it is not really the level of exposure that determines success in an endeavour, but the will to do. So the main prerequisite for this course is the will to learn and the desire to ask questions of your society. This course is not for those who want to flaunt their statistical wizardry.
How?
The course shall be delivered through a series of lessons that will be published every Monday, Wednesday and Friday on this blog. We will systematically cover topics covered in a typical undergraduate introductory statistics class. We shall also use the R language as a tool for statistical computations. THIS IS NOT AN R CLASS! We shall simply present R as a tool for carrying out statistical analyses. Detailed R tutorials can be downloaded here for those who are so inclined.
What?
What’s the point of all this. Well, as mentioned before, I hope that you will be able to ask probing questions of phenomena that you observe. In statistics we call that exploratory statistics. Ultimately, we ask these questions so that we are better able to deal with these phenomena in the future, be better prepared, take advantage, etc. This is known as predictive analytics.
This is an introductory course. The content is designed in a way that the statistical concepts are explained very simply. Nevertheless, where concepts may not be very clear, examples are given to illustrate the same. It is the tutor’s hope that this course will whet the students’ appetite for more and we can consequently get into intermediary and even advanced statistical concepts.
Ready to go? Start here.
Range
The simplest measure of variation is the range—the difference between the lowest and the highest score in a distribution. To find the range, simply subtract the lowest score from the highest score.
Table 4.1 Two – Class Score comparisons
Class 1  Class 2 
0  45 
50  50 
100  55 
∑= 150  ∑ = 150 
µ = 50  µ = 50 
In the example above the range for Class 1 is 100 points, whereas the range for Class 2 is 10 points. Thus, the range provides some information concerning the difference in the spread of the distributions. In this simple measure of variation, however, only the highest and lowest scores enter the calculation, and all other scores are ignored.
Percentile
The n^{th} percentile of an observation variable is the value that cuts off the first n percent of the data values when it is sorted in ascending order.
> quantile(INCOME, c(.32, .57, .98)) #finds the 32, 57 and 98^{th} percentiles
Quartile
There are several quartiles of an observation variable. The first quartile, or lower quartile, is the value that cuts off the first 25% of the data when it is sorted in ascending order. The second quartile, or median, is the value that cuts off the first 50%. The third quartile, or upper quartile, is the value that cuts off the first 75%.
quantile(INCOME) #gives the first second and third quartiles
The interquartile range is the difference between the third quartile and the first quartile.
Average Deviation and Standard Deviation
More sophisticated measures of variation use all of the scores in the distribution in their calculation. The most commonly used measure of variation is the standard deviation. Most people have heard this term before and may even have calculated a standard deviation if they have taken a statistics class. However, many people who know how to calculate a standard deviation do not really appreciate the information it provides.
To begin, let’s think about what the phrase standard deviation means. Other words that might be substituted for the word standard include average, normal, or usual. The word deviation means to diverge, move away from, or digress. Putting these terms together, we see that the standard deviation means the average movement away from something. But what? It is the average movement away from the center of the distribution—the mean.
The standard deviation, then, is the average distance of all of the scores in the distribution from the mean or central point of the distribution—or, as we shall see shortly, the square root of the average squared deviation from the mean. Think about how you would calculate the average distance of all of the scores from the mean of the distribution. First, you would have to determine how far each score is from the mean; this is the deviation, or difference, score. Then, you would have to average these scores. This is the basic idea behind calculating the standard deviation.
Let’s use these data to calculate the average distance from the mean. We will begin with a calculation that is slightly simpler than the standard deviation, known as the average deviation. The average deviation is essentially what the name implies— the average distance of all of the scores from the mean of the distribution.
X µ
Then we need to sum the deviation scores. Notice, however, that if we were to sum these scores, they would add to zero. Therefore, we first take the absolute value of the deviation scores (the distance from the mean, irrespective of direction). To calculate the average deviation, we sum the absolute value of each deviation score:
∑Xµ
Then we divide by the total number of scores (N) to find the average deviation.
Although the average deviation is fairly easy to compute, it is not as useful as the standard deviation because, as we will see in later modules, the standard deviation is used in many other statistical procedures.
The standard deviation is very similar to the average deviation. The only difference is that rather than taking the absolute value of the deviation scores, we use another method to “get rid of” the negative deviation scores—we square the deviation scores.
The formula for the standard deviation is:
Notice that the formula is similar to that for the average deviation. We determine the deviation scores, square the deviation scores, sum the squared deviation scores, and divide by the number of scores in the distribution. Lastly, we take the square root of that number. Why? Squaring the deviation scores has inflated them. We now need to bring the squared deviation scores back to the same level of measurement as the mean so that the standard deviation is measured on the same scale as the mean.
If, however, you are using sample data to estimate the population standard deviation, then the standard deviation formula must be slightly modified. The modification provides what is called an “unbiased estimator” of the population standard deviation based on sample data. The modified formula is:
s = unbiased estimator of population standard deviation
X = each individual score
= sample mean
N = number of scores in the sample
The main difference is in the denominator—dividing by N – 1 versus N. The reason is that the standard deviation within a small sample may not be representative of the population; that is, there may not be as much variability in the sample as there actually is in the population. We, therefore, divide by N – 1, because dividing by a smaller number increases the standard deviation and thus provides a better estimate of the population standard deviation.
sd(INCOME) #standard deviation of INCOME
var(INCOME) #variance of INCOME
The Variance is the square of the standard deviation
3.1 The Why
Organizing data into tables and graphs can help make a data set more meaningful. These methods, however, do not provide as much information as numerical measures. Descriptive statistics are numerical measures that describe a distribution by providing information on the central tendency of the distribution, the width of the distribution, and the distribution’s shape.
A measure of central tendency characterizes an entire set of data in terms of a single representative number.
Measures of central tendency measure the “middleness” of a distribution of scores in three ways: the mean, median, and mode.
3.2 The What
Mean
The most commonly used measure of central tendency is the mean—the average observation in a set of observations. There are two manifestations (yes, I said it!) of the mean; the arithmetic mean and the geometric mean. We are not interested in the geometric mean at this point but you can look at it here if you are interested. We will stick to the arithmetic mean.
Arithmetic mean: simply the mean or average when the context is clear, is the sum of a collection of numbers divided by the number of numbers in the collection
We can calculate the mean for our distribution of exam scores (from the previous lesson) by adding all of the scores together and dividing by the total number of scores. Mathematically, this would be:
where
(pronounced “mu”) represents the symbol for the population mean
represents the symbol for “the sum of”
X represents the individual scores, and
N represents the number of scores in the distribution
To calculate the mean, then, we sum all of the Xs, or scores, and divide by the total number of scores in the distribution (N). You may have also seen this formula represented as follows:
In this case represents a sample mean.
One of the main shortcomings of the mean is that the mean is influenced by extreme scores (what we sometimes refer to as outliers).
An outlier is an observation point that is distant from other observations.
Outliers can distort the results by giving an inaccurate representation of the distribution of the population.
Median
Another measure of central tendency, the median, is used in situations in which the mean might not be representative of a distribution. Let’s use a different distribution of scores to demonstrate when it might be appropriate to use the median rather than the mean. Imagine that you are considering taking a job with a small computer company. When you interview for the position, the owner of the company informs you that the mean income for employees at the company is approximately Kshs. 100,000 and that the company has 25 employees. Most people would view this as good news. Having learned in a statistics class that the mean might be influenced by extreme scores, you ask to see the distribution of 25 incomes. The distribution is shown below.
Table 3.1 Employee Salaries Distribution
INCOME  FREQUENCY  fX 
15,000  1  15,000 
20,000  2  40,000 
22,000  1  22,000 
23,000  2  46,000 
25,000  5  125,000 
27,000  2  54,000 
30,000  3  90,000 
32,000  1  32,000 
35,000  2  70,000 
38,000  1  38,000 
39,000  1  39,000 
40,000  1  40,000 
42,000  1  42,000 
45,000  1  45,000 
1,800,000  1  1,800,000 
N=25  2,498,000 
The mean for this distribution is Kshs. 99,920. Notice that, as claimed, the mean income of company employees is very close to Kshs. 100,000. Notice also, however, that the mean in this case is not very representative of central tendency, or “middleness.” In this distribution, the mean is thrown off center or inflated by one very extreme score of Kshs. 1,800,000 (the income of the company’s owner, needless to say). This extremely high income pulls the mean toward it and thus increases or inflates the mean. Thus, in distributions with one or a few extreme scores (either high or low), the mean will not be a good indicator of central tendency. In such cases, a better measure of central tendency is the median.
The median is the middle score in a distribution after the scores have been arranged from highest to lowest or lowest to highest.
The distribution of incomes in Table 3.1 is already ordered from lowest to highest. To determine the median, we simply have to find the middle score. In this situation, with 25 scores, that would be the 13th score. You can see that the median of the distribution would be an income of Kshs. 27,000, which is far more representative of the central tendency for this distribution of incomes.
Why is the median not as influenced as the mean by extreme scores? Think about the calculation of each of these measures. When calculating the mean, we must add in the atypical income of Kshs. 1,800,000, thus distorting the calculation. When determining the median, however, we do not consider the size of the $1,800,000 income; it is only a score at one end of the distribution whose numerical value does not have to be considered in order to locate the middle score in the distribution. The point to remember is that the median is not affected by extreme scores in a distribution because it is only a positional value. The mean is affected because its value is determined by a calculation that has to include the extreme value.
In distributions with an even number of observations, the median is calculated by averaging the two middle scores. In other words, we determine the middle point between the two middle scores.
Mode
The third measure of central tendency is the mode—the score in a distribution that occurs with the greatest frequency. Sometimes, several scores occur with equal frequency. Thus, a distribution may have two modes (bimodal), three modes (trimodal), or even more. The mode is the only indicator of central tendency that can be used with nominal data. Although it can also be used with ordinal, interval, or ratio data, the mean and median are more reliable indicators of the central tendency of a distribution, and the mode is seldom used.
3.3 The How
R Code
mydata < read.table(“testdata.txt”) #import your dataset
#attach(mydata) # In case you want to work with the variables directly
names(mydata) #This shows us all the variable names
mean(INCOME) #If you use the attach() command, you can call variables directly
#otherwise
mean(mydata$INCOME) #Find the mean income
#mean(mydata$INCOME, na.rm=TRUE) #Remove NA values before computation
median(INCOME, na.rm=TRUE) #returns the middle observation
mode(INCOME) #does something weird!
# the function mode( ) in R returns the variable type
temp < table(as.vector(INCOME)) #The first row of “temp”
#is a sorted list of all unique values in the vector INCOME
#The second row in “temp” counts how many occurrences of each value.
names(temp)[temp == max(temp)]
#this returns the names of the values that have the highest count
#in temp’s second row
# This happens to be the mode!
#R knows you will want to see all the measures of central tendency
summary(INCOME) # So it supplies them all in one command
We will discuss the “Min. 1st Qu. 3rd Qu. Max. “ in the next class. These are measures of dispersion. We will also look at range, standard deviation, percentiles, skweness and other similar animals. Until next time.
I have thrown this topic here because I am not very sure where it fits. So what is sampling?
Imagine we want to study the effect of the introduction of new parking rates on the taxi business in Nairobi. We may want to know if it has negatively impacted the business or not. So we decide to interview taxi drivers or taxi owners to get their views on the matter.
There we encounter problem number 1; it may not be feasible to interview ALL taxi drivers to get their views, either because we do not have the time or the money or both. The sum total of all the taxi drivers in Nairobi is called the population.
A population is any entire collection of people, animals, plants or things from which we may collect data. It is the entire group we are interested in, which we wish to describe or draw conclusions about.
Obviously, it would be better if we had the views of all taxi drivers in Nairobi in our study. That way we would be extremely confident with our findings. Alas, that is not possible and, therefore, we have to select a smaller group of taxi drivers whose views we will collect. This smaller group we have chosen is what is known as a sample.
A sample is a group selected from a larger group (population). By studying the sample it is hoped to draw valid conclusions about the larger group.
Here we encounter problem number 2; how do we select a smaller group of taxi drivers without appearing to be biased? The best way is if we were able to choose taxi drivers randomly. If we could get a random sample then no one can accuse us of being biased, right?
Ideally, we would like to choose a simple random sample.
A simple random sample is a subset (I will explain this concept of a subset when we start on probability) of a population in which each member of the subset has an equal probability of being chosen. It is meant to be an unbiased representation of the group.
Therein lies problem number 3; it is almost impossible to pick a truly random sample. Maybe the only way to do it is if you had the names of all the taxi drivers, put them in a hat and draw your sample at random. But then you will encounter problems when the members of the sample are too widely spread, or if by some coincidence, all of them belong to one company, or some issue like that. The more likely problem is that we do not have the names of all the taxi drivers in Nairobi. These issues increase the likelihood of a sampling error.
Sampling Error: if the sample does not accurately reflect the population it is supposed to represent. We want to minimize sampling error as much as possible.
It would be easier if we had a method for dividing up the population into manageable units from which we could draw our sample. This is called stratified random sampling. We divide the population into strata (units or divisions) based on some meaningful criteria. In our case, we could divide the sample into geographic regions, e.g. the CBD, Upperhill, Westlands, Harlingham. Then we take a random sample of taxi drivers in that area.
Remember that we are not interested in just the sample, but we want to know the effect of the parking fee increase in the entire taxi business. But we are only getting views for a smaller group (subset) of taxi drivers. Therefore, once we get the results, we have to infer what the population would be.
Statistical inference makes use of information from a sample to draw conclusions (inferences) about the population from which the sample was taken.
Election poll results are an example of statistical inference.
Let’s define two more terms before we proceed:
A parameter is a value, usually unknown (and which we, therefore, are trying to estimate), used to represent a certain population characteristic. The mean, for instance, is a population parameter used to indicate the average value of a particular quantity. In statistics, parameters are represented by Greek letters (for example µ for mean).
A statistic is a quantity calculated from a sample of data. It is used to give information about unknown values in the corresponding population. Statistics are often assigned Roman letters in statistics (for example s for standard deviation).
Back to the lesson: Organizing Data
2.1 The Why
Sometimes we have a small sample to deal with. By just looking at the data we are able to describe the data without having to proceed into various statistical techniques. Often times, though, that is not the case. More often than not we will be working with large data sets. For us to be able to “look” at the data, we would have to arrange it in a meaningful way.
Organizing data will help us to make preliminary descriptions of the data but also, it will give us an indication of the kind of techniques we would need to apply in order to make more sense of the dataset in front of us.
Visualizing the Dataset
Sometimes we want to see the data. Most of us have used excel. When you open a dataset in excel, you have already organized data! Excel is a great tool and it organizes datasets into rows and columns. The columns represent variables (remember lesson 1) while the rows represent observations.
You can also try and view datasets in SPSS, SAS and R. since we are using R in this course, to view a dataset import the dataset and assign it to the dataframe mydata:
>mydata < read.table(“your_data_set”) #import your dataset
>mydata #view your dataset
RStudio has an even better way of displaying datasets. Can you try importing a dataset in RStudio without using the above code?
Frequency Distributions
You may have a listlike form of data as below:
Table 2.1 Scores of Student in the Class
23  45  6  73  23  23  45  50  51  34 
As we said, we organize data so that meaningful conclusions can be drawn out of it. One way to do that would be to sort the data from lowest to highest or vise versa. Once this is accomplished (see Table 2.2), we can try to condense the data into a frequency distribution—a table in which all of the scores are listed along with the frequency with which each occurs. We can also show a relative frequency distribution, which indicates the proportion of the total observations included in each score. When the relative frequency distribution is multiplied by 100, it is read as a percentage.
Table 2.2 Frequency Distribution
Score  6  23  34  45  50  51  73  
Frequency  1  3  1  2  1  1  1  N=10 
Relative Frequency  0.1  0.3  0.1  0.2  0.1  0.1  0.1  1.0 
The frequency distribution is a way of presenting data that makes the pattern of the data easier to see. Frequency distributions are great for nominal and ordinal data.
When dealing with Interval and ratio data (especially when the dataset is very large), we group the observations and create a class interval frequency distribution. We can combine individual scores into categories, or intervals, and list them along with the frequency of scores in each interval. In our exam score example, the scores range from 0 to 80—an 80point range. A rule of thumb when creating class intervals is to have between 10 and 20 categories. A quick method of calculating what the width of the interval should be is to subtract the smallest score from the largest score and then divide by the number of intervals you would like.
Table 2.3 Class Interval Frequency Distribution
Interval  010  1020  2030  3040  4050  5060  6070  7080  
Frequency  1  0  4  0  2  1  0  1  N=10 
Relative Frequency  0.1  0.0  0.4  0.0  0.2  0.1  0.0  0.1  1.0 
Creating Frequency Distributions with R
studentscores < c(23,45,6,73,23,23,45,50,51,34) # create a vector of the student scores
range(studentscores) #displays the minimum and maximum score
bin = seq(0,80, by=10) #creates a vector sequence
bin # displays the values of our bin vector
studentscores.cut = cut(studentscores, bin, right=FALSE)
studentscores.frequency = table(studentscores.cut) # creates out frequency distribution
cbind(studentscores.frequency) #Displays as a data frame
DATA VISUALIZATION
Frequency distributions can provide valuable information, but sometimes a picture is of greater value. Several types of pictorial representations can be used to represent data. The choice depends on the type of data collected and what the researcher hopes to emphasize or illustrate. The most common type of data visualizations are graphs; pie charts, bar graphs, histograms and frequency polygons (line graphs).
There are new and very powerful data visualization tools. You can research on these tools if you would like to delve a little more into this area. For this class, we shall stick to the basics.
Bar Graphs and Histograms
Bar graphs and histograms are frequently confused. When the data collected are on a nominal scale, or if the variable is a qualitative variable (a categorical variable for which each value represents a discrete category), then a bar graph is most appropriate. A bar graph is a graphical representation of a frequency distribution in which vertical bars are centered above each category along the xaxis and are separated from each other by a space, indicating that the levels of the variable represent distinct, unrelated categories.
If the variable is a quantitative variable (the scores represent a change in quantity), or if the data collected are ordinal, interval, or ratio in scale, then a histogram can be used. A histogram is also a graphical representation of a frequency distribution in which vertical bars are centered above scores on the xaxis, but in a histogram the bars touch each other to indicate that the scores on the variable represent related, increasing values.
//
In both a bar graph and a histogram, the height of each bar indicates the frequency for that level of the variable on the xaxis. The spaces between the bars on the bar graph indicate not only the qualitative differences among the categories but also that the order of the values of the variable on the xaxis is arbitrary. In other words, the categories on the xaxis in a bar graph can be placed in any order. The fact that the bars are contiguous in a histogram indicates not only the increasing quantity of the variable but also that the variable has a definite order that cannot be changed.
Pie Chart
Like Bar Graphs, Pie Charts are used to represent categorical variables. While Bar Graphs display the frequencies, Pie charts show the proportions (relative frequencies). Most times, the relative frequencies are represented as percentages.
Frequency Polygons (Line Graphs)
We can also depict the data in a histogram as a frequency polygon—a line graph of the frequencies of individual scores or intervals. Again, scores (or intervals) are shown on the xaxis and frequencies on the yaxis. Once all the frequencies are plotted, the data points are connected. Frequency polygons are appropriate when the variable is quantitative or the data are ordinal, interval, or ratio. In this respect, frequency polygons are similar to histograms. Frequency polygons are especially useful for continuous data (such as age, weight, or time) in which it is theoretically possible for values to fall anywhere along the continuum. For example, an individual can weigh 120.5 pounds or be 35.5 years of age. Histograms are more appropriate when the data are discrete (measured in whole units)—for example, number of college classes taken or number of siblings.
3D Plots
Sometimes we may want to represent complex relationships using graphs. While line graphs are used to show the relationship between 2 variables, we may want to see the relationship between 3 variable. We can graph this using 3D graphing tools. Excel does not have this feature out of the box and it is very difficult to do it directly in R. We do this using package in R.
Packages are collections of R functions, data and compiled code in a welldefined format.
Rcommander and ggplot are touted to be the best plotting packages for R. Installing packages will be part of the assignment for this lesson.
Ensure that you have R and RStudio installed in your PC. For installation instructions go to:
Click here to download RStudio
Once you have successfully installed R and RStudio, create a folder in a place that is easily accessible and call it r_working_directory. Then go to the desktop and right click on the R icon. Select properties. In the “Start In:” text box put the location of your r_working_directory folder
(e.g. C:\Users\Kevin\Documents\r_working_directory)
Be sure to read the R tutorial. Download R manual. Remember that R has an inbuilt help function that is accessible within the console by either typing ? followed by whatever it is you want to know more about or help( ) function. To test this out, type ?help in the R console.
To start us off, we will work with a simple data set. Sign up here to receive the dataset. Put your extracted data into your r_working_directory. Open RStudio and import the first dataset (testdata.txt) and list all the variable names (column names) then import the second dataset and do the same thing. Getting problems? Try the help function and see where you are going wrong. Try opening the dataset in notepad and look at the data. Then try and open each of them in excel (Hint: you have to import the txt file into excel).
Some of the code can be used as below.
The R Code
#First we get our data
mydata < read.table(“testdata.txt”)
#?read.table
# Help with the function read.table(). Will also list all the arguments associated with the function
#mydata < read.csv(“testdata.csv”, sep=”,”, header=TRUE)
# sometimes your data is in csv format and you want to tell R as much
names(mydata) #lists all the variable names
Submit the following:
The objective of statistics can be divided into three parts:
Description begins with careful observation. It involves carefully observing behavior in order to describe it. Description allows us to learn about behavior and when it occurs. Let’s say, for example, that you were interested in the channelsurfing behavior of males and females. Careful observation and description would be needed in order to determine whether or not there were any gender differences in channelsurfing. Description allows us to observe that two events are systematically related to one another. Without description as a first step, predictions cannot be made.
Explanation allows us to identify the causes that determine when and why a behavior occurs. In order to explain a behavior, we need to demonstrate that we can manipulate the factors needed to produce or eliminate the behavior. For example, in our channelsurfing example, if gender predicts channelsurfing, what might cause it? It could be genetic or environmental. Maybe males have less tolerance for commercials and thus channelsurf at a greater rate. Maybe females are more interested. Maybe the attention span of females is greater. Maybe something associated with having a Y chromosome increases channelsurfing, or something associated with having two X chromosomes leads to less channelsurfing. Obviously the possible explanations are numerous and varied. As scientists, we test these possibilities to identify the best explanation of why a behavior occurs.
Prediction allows us to identify the factors that indicate when an event or events will occur. In other words, knowing the level of one variable allows us to predict the approximate level of the other variable.
Variable: An event or behavior that has at least two values
We know that if one variable is present at a certain level, then there is a greater likelihood that the other variable will be present at a certain level. For example, if we observed that males channelsurf with greater frequency than females, we could then make predictions about how often males and females might change channels when given the chance.
1.2 The What
So we want to describe or explain or predict. The question is what? We are interested in phenomena; that is, an event, a characteristic or behavior. We could be interested in the channel surfing behavior of people or the relationship between traffic levels and weather patterns. To do that we have to make observations; to observe the phenomena we want to describe or explain or predict.
An observation is an instance of a variable
This is a good time to talk about a variable. A variable is a phenomena that can take on two or more values. The weights of a group of people in a classroom, for instance, is a variable. Each person (observation) will have their own weight (variable).
Data, therefore, is a set of observations that is arranged in a meaningful way. We shall talk about the arrangement of data later. For now let us focus on the observations themselves.
Data is the focal point of all our statistical activity. How much and what kind of data we have will greatly influence what we can do. It may seem that we are spending a lot of time on what may be considered unnecessary but these are the building blocks.
Characteristics or Properties of Data
These properties include identity, magnitude, equal unit size, and absolute zero. When a measure has the property of identity, objects that are different receive different scores. For example, if members of this class had different heights, they would all receive different measurements. Measurements have the property of magnitude (also called ordinality) when the ordering of the numbers reflects the ordering of the variable. In other words, numbers are assigned in order so that some numbers represent more or less of the variable being measured than others.
Measurements have an equal unit size when a difference of 1 is the same amount throughout the entire scale. For example, the difference between people who are 64 kilos and 65 kilos is the same as the difference between people who are 72 kilos and 73 kilos. The difference in each situation (1 kg) is identical. Notice how this differs from the property of magnitude. Were we to simply line up and rank a group of individuals based on their weight, the scale would have the properties of identity and magnitude, but not equal unit size. Can you think about why this would be so? We would not actually measure people’s weight in kilos, but simply order them in terms of how big they appear, from smallest (the person receiving a score of 1) to biggest (the person receiving the highest score). Thus, our scale would not meet the criteria of equal unit size. In other words, the difference in weight between the two people receiving scores of 1 and 2 might not be the same as the difference in height between the two people receiving scores of 3 and 4.
Lastly, measures have an absolute zero when assigning a score of zero indicates an absence of the variable being measured. For example, bank account balance would have the property of absolute zero because a score of 0 on this measure would mean an individual has no money in the bank. However, a score of 0 is not always equal to the property of absolute zero. As an example, think about the temperature scale. That measurement scale has a score of 0 (the thermometer can read 0 degrees), but does that score indicate an absence of temperature? No, it indicates a very cold temperature. Hence, it does not have the property of absolute zero.
SCALES OF MEASUREMENT
Why are the properties of data important? They are important because data in itself might not be very useful. If we have a bunch of data (please do not use this phrase outside of this class!) we want to do something with it. That “something” is called manipulation. I am not talking about the evil and conniving manipulation of soap opera villains. Manipulation is simply the use of some techniques to convert one or more quantities into another quantity or quantities.
Every January, people are always pledging to join the gym because they have “added weight”. This means they had a starting weight and then gained some more weight to total into the weight they now have (say 65kg + 2kg = 67kg). Conversely, let’s say that a group of men went for a workshop on how to become better men in society. Can we say they have added “manness”? No. The quantity gender cannot be added. Can you explain why that is from the qualities of data above?
As noted previously, the level or scale of measurement depends on the properties of the data. There are four scales of measurement (nominal, ordinal, interval, and ratio), and each of these scales has one or more of the properties described in the previous section. As we will see later on, it is important to establish the scale of measurement of your data in order to determine the appropriate statistical test to use when analyzing the data.
A nominal scale is one in which objects or individuals are broken into categories that have no numerical properties. Nominal scales have the characteristic of identity but lack the other properties. Variables measured on a nominal scale are often referred to as categorical variables because the measuring scale involves dividing the data into categories. However, the categories carry no numerical weight. Some examples of categorical variables, or data measured on a nominal scale, include ethnicity, gender, and political affiliation.
An ordinal scale is one in which objects or individuals are categorized and the categories form a rank order along a continuum. Data measured on an ordinal scale have the properties of identity and magnitude but lack equal unit size and absolute zero. Ordinal data are often referred to as ranked data because the data are ordered from highest to lowest, or biggest to smallest. For example, the number ranks students are given in school based on performance in an exam (number 1, 2, etc.) is an ordinal scale. This variable would carry identity and magnitude because each individual receives a rank (a number) that carries identity, and beyond simple identity it conveys information about order or magnitude (how many students performed better or worse in the class).
An interval scale is one in which the units of measurement (intervals) between the numbers on the scale are all equal in size. When using an interval scale, the properties of identity, magnitude, and equal unit size are met. For example, the temperature scale is an interval scale of measurement. A given temperature carries identity (days with different temperatures receive different scores on the scale), magnitude (cooler days receive lower scores and hotter days receive higher scores), and equal unit size (the difference between 20 and 21 degrees is the same as that between 30 and 31 degrees.) However, the temperature scale does not have an absolute zero. Because of this, we are not able to form ratios based on this scale (for example, 50 degrees is not twice as hot as 25 degrees).
A ratio scale is one in which, in addition to order and equal units of measurement, there is an absolute zero that indicates an absence of the variable being measured. Ratio data have all four properties of measurement—identity, magnitude, equal unit size, and absolute zero. Examples of ratio scales of measurement include weight, time, and height. Each of these scales has identity (individuals who weigh different amounts would receive different scores), magnitude (those who weigh less receive lower scores than those who weigh more), and equal unit size (1 kg is the same weight anywhere along the scale and for any person using the scale). These scales also have an absolute zero, which means a score of zero reflects an absence of that variable. This also means that ratios can be formed. For example, a weight of 100 kg is twice as much as a weight of 50 kg.
Table 1.1: SCALES OF MEASUREMENT

Nominal 
Ordinal 
Interval 
Ratio 
Example 
Ethnicity Religion Gender 
Class rank Letter Grade 
Temperature 
Weight Height Time 
Properties 
Identity

Identity Magnitude

Identity Magnitude Equal unit size

Identity Magnitude Equal unit size Absolute zero 
Mathematical Operations 
None 
Rank Order 
Add Subtract Multiple Divide 
Add Subtract Multiple Divide 
Typical Statistics Used 
Mode Chi Square 
Mode Median Wilcoxon Test 
Mode Median Mean t test ANOVA 
Mode Median Mean t test ANOVA 
Another means of classifying variables is in terms of whether they are discrete or continuous in nature. Discrete variables usually consist of wholenumber units or categories. They are made up of chunks or units that are detached and distinct from one another. A change in value occurs a whole unit at a time, and decimals do not make sense with discrete scales. Most nominal and ordinal data are discrete. For example, gender, political party, and ethnicity are discrete scales. Some interval or ratio data can be discrete. For example, the number of children someone has would be reported as a whole number (discrete data), yet it is also ratio data (you can have a true zero and form ratios).
Continuous variables usually fall along a continuum and allow for fractional amounts. The term continuous means that it “continues” between the wholenumber units. Examples of continuous variables are age (22.7 years), height (64.5 inches), and weight (113.25 kg). Most interval and ratio data are continuous in nature.
1.3 The How
Psychiatrists say that we cannot remember the first three years of our lives. Our minds just blot those memories out. For most of us, it is not just these memories that we forget, but where we put our keys, our friends’ birthdays and that all important meeting are just some of the things we regularly forget. Man is by his very nature, inadequate. That is why he built machines!
Al throughout school, you have been trained to memorize concepts. This is not one of those classes. I do not want you to memorize anything. I want you to understand the concept. Anything else can always be found when needed. When you understand how to do it, you can always find the means and the tools to do the job required. Especially in exploratory analysis (you don’t have to know what this is now), the job is in trying to figure out what you want to do, rather the doing itself.
I will try as much as possible to provide reference material that further explains the concepts introduced in this course. Nevertheless, we shall focus on the practical application of the concepts rather than the theoretical.
TOOLS
As mentioned earlier, man built tools to help make his work easier. Therefore, in this class, we shall also make use of statistical tools/packages to help make our work easier. For statistical analysis, we have several options:
The above 5 tools are the most commonly used tools for statistical analysis. Each one has its own advantages and disadvantages and the choice of which one to use will depend on many factors including expertise of the user, size of task, cost, etc. In this class we shall be using R. There are many reasons for using R but the main reason we are using it in this class is because it is free while all the others have to be purchased.
Please see our R Tutorials. For this lesson, we need to be able to install and run R, create data and import data into R.
Next: Descriptive Statistics I – Organizing Data