# LESSON 2: ORGANIZING DATA

X.X Sampling

I have thrown this topic here because I am not very sure where it fits. So what is sampling?

Imagine we want to study the effect of the introduction of new parking rates on the taxi business in Nairobi. We may want to know if it has negatively impacted the business or not. So we decide to interview taxi drivers or taxi owners to get their views on the matter.

There we encounter problem number 1; it may not be feasible to interview ALL taxi drivers to get their views, either because we do not have the time or the money or both. The sum total of all the taxi drivers in Nairobi is called the population.

A population is any entire collection of people, animals, plants or things from which we may collect data. It is the entire group we are interested in, which we wish to describe or draw conclusions about.

Obviously, it would be better if we had the views of all taxi drivers in Nairobi in our study. That way we would be extremely confident with our findings. Alas, that is not possible and, therefore, we have to select a smaller group of taxi drivers whose views we will collect. This smaller group we have chosen is what is known as a sample.

A sample is a group selected from a larger group (population). By studying the sample it is hoped to draw valid conclusions about the larger group.

Here we encounter problem number 2; how do we select a smaller group of taxi drivers without appearing to be biased? The best way is if we were able to choose taxi drivers randomly. If we could get a random sample then no one can accuse us of being biased, right?

Ideally, we would like to choose a simple random sample.

A simple random sample is a subset (I will explain this concept of a subset when we start on probability) of a population in which each member of the subset has an equal probability of being chosen. It is meant to be an unbiased representation of the group.

Therein lies problem number 3; it is almost impossible to pick a truly random sample. Maybe the only way to do it is if you had the names of all the taxi drivers, put them in a hat and draw your sample at random. But then you will encounter problems when the members of the sample are too widely spread, or if by some coincidence, all of them belong to one company, or some issue like that. The more likely problem is that we do not have the names of all the taxi drivers in Nairobi. These issues increase the likelihood of a sampling error.

Sampling Error: if the sample does not accurately reflect the population it is supposed to represent. We want to minimize sampling error as much as possible.

It would be easier if we had a method for dividing up the population into manageable units from which we could draw our sample. This is called stratified random sampling. We divide the population into strata (units or divisions) based on some meaningful criteria. In our case, we could divide the sample into geographic regions, e.g. the CBD, Upperhill, Westlands, Harlingham. Then we take a random sample of taxi drivers in that area.

Remember that we are not interested in just the sample, but we want to know the effect of the parking fee increase in the entire taxi business. But we are only getting views for a smaller group (subset) of taxi drivers. Therefore, once we get the results, we have to infer what the population would be.

Statistical inference makes use of information from a sample to draw conclusions (inferences) about the population from which the sample was taken.

Election poll results are an example of statistical inference.

Let’s define two more terms before we proceed:

A parameter is a value, usually unknown (and which we, therefore, are trying to estimate), used to represent a certain population characteristic. The mean, for instance, is a population parameter used to indicate the average value of a particular quantity. In statistics, parameters are represented by Greek letters (for example µ for mean).

A statistic is a quantity calculated from a sample of data. It is used to give information about unknown values in the corresponding population. Statistics are often assigned Roman letters in statistics (for example s for standard deviation).

Back to the lesson: Organizing Data

2.1 The Why

Sometimes we have a small sample to deal with. By just looking at the data we are able to describe the data without having to proceed into various statistical techniques. Often times, though, that is not the case. More often than not we will be working with large data sets. For us to be able to “look” at the data, we would have to arrange it in a meaningful way.

Organizing data will help us to make preliminary descriptions of the data but also, it will give us an indication of the kind of techniques we would need to apply in order to make more sense of the dataset in front of us.

Visualizing the Dataset

Sometimes we want to see the data. Most of us have used excel. When you open a dataset in excel, you have already organized data! Excel is a great tool and it organizes datasets into rows and columns. The columns represent variables (remember lesson 1) while the rows represent observations.

You can also try and view datasets in SPSS, SAS and R. since we are using R in this course, to view a dataset import the dataset and assign it to the dataframe mydata:

>mydata <- read.table(“your_data_set”) #import your dataset

>mydata #view your dataset

RStudio has an even better way of displaying datasets. Can you try importing a dataset in RStudio without using the above code?

Frequency Distributions

You may have a list-like form of data as below:

Table 2.1 Scores of Student in the Class

 23 45 6 73 23 23 45 50 51 34

As we said, we organize data so that meaningful conclusions can be drawn out of it. One way to do that would be to sort the data from lowest to highest or vise versa. Once this is accomplished (see Table 2.2), we can try to condense the data into a frequency distribution—a table in which all of the scores are listed along with the frequency with which each occurs. We can also show a relative frequency distribution, which indicates the proportion of the total observations included in each score. When the relative frequency distribution is multiplied by 100, it is read as a percentage.

Table 2.2 Frequency Distribution

 Score 6 23 34 45 50 51 73 Frequency 1 3 1 2 1 1 1 N=10 Relative Frequency 0.1 0.3 0.1 0.2 0.1 0.1 0.1 1.0

The frequency distribution is a way of presenting data that makes the pattern of the data easier to see. Frequency distributions are great for nominal and ordinal data.

When dealing with Interval and ratio data (especially when the dataset is very large), we group the observations and create a class interval frequency distribution. We can combine individual scores into categories, or intervals, and list them along with the frequency of scores in each interval. In our exam score example, the scores range from 0 to 80—an 80-point range. A rule of thumb when creating class intervals is to have between 10 and 20 categories. A quick method of calculating what the width of the interval should be is to subtract the smallest score from the largest score and then divide by the number of intervals you would like.

Table 2.3 Class Interval Frequency Distribution

 Interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 Frequency 1 0 4 0 2 1 0 1 N=10 Relative Frequency 0.1 0.0 0.4 0.0 0.2 0.1 0.0 0.1 1.0

Creating Frequency Distributions with R

studentscores <- c(23,45,6,73,23,23,45,50,51,34) # create a vector of the student scores

range(studentscores) #displays the minimum and maximum score

bin = seq(0,80, by=10) #creates a vector sequence

bin # displays the values of our bin vector

studentscores.cut = cut(studentscores, bin, right=FALSE)

studentscores.frequency = table(studentscores.cut) # creates out frequency distribution

cbind(studentscores.frequency) #Displays as a data frame

DATA VISUALIZATION

Frequency distributions can provide valuable information, but sometimes a picture is of greater value. Several types of pictorial representations can be used to represent data. The choice depends on the type of data collected and what the researcher hopes to emphasize or illustrate. The most common type of data visualizations are graphs; pie charts, bar graphs, histograms and frequency polygons (line graphs).

There are new and very powerful data visualization tools. You can research on these tools if you would like to delve a little more into this area. For this class, we shall stick to the basics.

Bar Graphs and Histograms

Bar graphs and histograms are frequently confused. When the data collected are on a nominal scale, or if the variable is a qualitative variable (a categorical variable for which each value represents a discrete category), then a bar graph is most appropriate. A bar graph is a graphical representation of a frequency distribution in which vertical bars are centered above each category along the x-axis and are separated from each other by a space, indicating that the levels of the variable represent distinct, unrelated categories.

If the variable is a quantitative variable (the scores represent a change in quantity), or if the data collected are ordinal, interval, or ratio in scale, then a histogram can be used. A histogram is also a graphical representation of a frequency distribution in which vertical bars are centered above scores on the x-axis, but in a histogram the bars touch each other to indicate that the scores on the variable represent related, increasing values.

//

In both a bar graph and a histogram, the height of each bar indicates the frequency for that level of the variable on the x-axis. The spaces between the bars on the bar graph indicate not only the qualitative differences among the categories but also that the order of the values of the variable on the x-axis is arbitrary. In other words, the categories on the x-axis in a bar graph can be placed in any order. The fact that the bars are contiguous in a histogram indicates not only the increasing quantity of the variable but also that the variable has a definite order that cannot be changed.

Pie Chart

Like Bar Graphs, Pie Charts are used to represent categorical variables. While Bar Graphs display the frequencies, Pie charts show the proportions (relative frequencies). Most times, the relative frequencies are represented as percentages.

Frequency Polygons (Line Graphs)

We can also depict the data in a histogram as a frequency polygon—a line graph of the frequencies of individual scores or intervals. Again, scores (or intervals) are shown on the x-axis and frequencies on the y-axis. Once all the frequencies are plotted, the data points are connected. Frequency polygons are appropriate when the variable is quantitative or the data are ordinal, interval, or ratio. In this respect, frequency polygons are similar to histograms. Frequency polygons are especially useful for continuous data (such as age, weight, or time) in which it is theoretically possible for values to fall anywhere along the continuum. For example, an individual can weigh 120.5 pounds or be 35.5 years of age. Histograms are more appropriate when the data are discrete (measured in whole units)—for example, number of college classes taken or number of siblings.

3D Plots

Sometimes we may want to represent complex relationships using graphs. While line graphs are used to show the relationship between 2 variables, we may want to see the relationship between 3 variable. We can graph this using 3D graphing tools. Excel does not have this feature out of the box and it is very difficult to do it directly in R. We do this using package in R.

Packages are collections of R functions, data and compiled code in a well-defined format.

Rcommander and ggplot are touted to be the best plotting packages for R. Installing packages will be part of the assignment for this lesson.

Advertisements