Lesson 1: Data

1.1 The Why


The objective of statistics can be divided into three parts:

  • To Describe
  • To Explain and
  • To Predict

Description begins with careful observation. It involves carefully observing behavior in order to describe it. Description allows us to learn about behavior and when it occurs. Let’s say, for example, that you were interested in the channel-surfing behavior of males and females. Careful observation and description would be needed in order to determine whether or not there were any gender differences in channel-surfing. Description allows us to observe that two events are systematically related to one another. Without description as a first step, predictions cannot be made.

Explanation allows us to identify the causes that determine when and why a behavior occurs. In order to explain a behavior, we need to demonstrate that we can manipulate the factors needed to produce or eliminate the behavior. For example, in our channel-surfing example, if gender predicts channel-surfing, what might cause it? It could be genetic or environmental. Maybe males have less tolerance for commercials and thus channel-surf at a greater rate. Maybe females are more interested. Maybe the attention span of females is greater. Maybe something associated with having a Y chromosome increases channel-surfing, or something associated with having two X chromosomes leads to less channel-surfing. Obviously the possible explanations are numerous and varied. As scientists, we test these possibilities to identify the best explanation of why a behavior occurs.

Prediction allows us to identify the factors that indicate when an event or events will occur. In other words, knowing the level of one variable allows us to predict the approximate level of the other variable.

Variable: An event or behavior that has at least two values

We know that if one variable is present at a certain level, then there is a greater likelihood that the other variable will be present at a certain level. For example, if we observed that males channel-surf with greater frequency than females, we could then make predictions about how often males and females might change channels when given the chance.

1.2 The What

So we want to describe or explain or predict. The question is what? We are interested in phenomena; that is, an event, a characteristic or behavior. We could be interested in the channel surfing behavior of people or the relationship between traffic levels and weather patterns. To do that we have to make observations; to observe the phenomena we want to describe or explain or predict.

An observation is an instance of a variable

This is a good time to talk about a variable. A variable is a phenomena that can take on two or more values. The weights of a group of people in a classroom, for instance, is a variable. Each person (observation) will have their own weight (variable).

Data, therefore, is a set of observations that is arranged in a meaningful way. We shall talk about the arrangement of data later. For now let us focus on the observations themselves.

Data is the focal point of all our statistical activity. How much and what kind of data we have will greatly influence what we can do. It may seem that we are spending a lot of time on what may be considered unnecessary but these are the building blocks.

Characteristics or Properties of Data

These properties include identity, magnitude, equal unit size, and absolute zero. When a measure has the property of identity, objects that are different receive different scores. For example, if members of this class had different heights, they would all receive different measurements. Measurements have the property of magnitude (also called ordinality) when the ordering of the numbers reflects the ordering of the variable. In other words, numbers are assigned in order so that some numbers represent more or less of the variable being measured than others.

Measurements have an equal unit size when a difference of 1 is the same amount throughout the entire scale. For example, the difference between people who are 64 kilos and 65 kilos is the same as the difference between people who are 72 kilos and 73 kilos. The difference in each situation (1 kg) is identical. Notice how this differs from the property of magnitude. Were we to simply line up and rank a group of individuals based on their weight, the scale would have the properties of identity and magnitude, but not equal unit size. Can you think about why this would be so? We would not actually measure people’s weight in kilos, but simply order them in terms of how big they appear, from smallest (the person receiving a score of 1) to biggest (the person receiving the highest score). Thus, our scale would not meet the criteria of equal unit size. In other words, the difference in weight between the two people receiving scores of 1 and 2 might not be the same as the difference in height between the two people receiving scores of 3 and 4.

Lastly, measures have an absolute zero when assigning a score of zero indicates an absence of the variable being measured. For example, bank account balance would have the property of absolute zero because a score of 0 on this measure would mean an individual has no money in the bank. However, a score of 0 is not always equal to the property of absolute zero. As an example, think about the temperature scale. That measurement scale has a score of 0 (the thermometer can read 0 degrees), but does that score indicate an absence of temperature? No, it indicates a very cold temperature. Hence, it does not have the property of absolute zero.


Why are the properties of data important? They are important because data in itself might not be very useful. If we have a bunch of data (please do not use this phrase outside of this class!) we want to do something with it. That “something” is called manipulation. I am not talking about the evil and conniving manipulation of soap opera villains. Manipulation is simply the use of some techniques to convert one or more quantities into another quantity or quantities.

Every January, people are always pledging to join the gym because they have “added weight”. This means they had a starting weight and then gained some more weight to total into the weight they now have (say 65kg + 2kg = 67kg). Conversely, let’s say that a group of men went for a workshop on how to become better men in society. Can we say they have added “manness”? No. The quantity gender cannot be added. Can you explain why that is from the qualities of data above?

As noted previously, the level or scale of measurement depends on the properties of the data. There are four scales of measurement (nominal, ordinal, interval, and ratio), and each of these scales has one or more of the properties described in the previous section. As we will see later on, it is important to establish the scale of measurement of your data in order to determine the appropriate statistical test to use when analyzing the data.

A nominal scale is one in which objects or individuals are broken into categories that have no numerical properties. Nominal scales have the characteristic of identity but lack the other properties. Variables measured on a nominal scale are often referred to as categorical variables because the measuring scale involves dividing the data into categories. However, the categories carry no numerical weight. Some examples of categorical variables, or data measured on a nominal scale, include ethnicity, gender, and political affiliation.

An ordinal scale is one in which objects or individuals are categorized and the categories form a rank order along a continuum. Data measured on an ordinal scale have the properties of identity and magnitude but lack equal unit size and absolute zero. Ordinal data are often referred to as ranked data because the data are ordered from highest to lowest, or biggest to smallest. For example, the number ranks students are given in school based on performance in an exam (number 1, 2, etc.) is an ordinal scale. This variable would carry identity and magnitude because each individual receives a rank (a number) that carries identity, and beyond simple identity it conveys information about order or magnitude (how many students performed better or worse in the class).

An interval scale is one in which the units of measurement (intervals) between the numbers on the scale are all equal in size. When using an interval scale, the properties of identity, magnitude, and equal unit size are met. For example, the temperature scale is an interval scale of measurement. A given temperature carries identity (days with different temperatures receive different scores on the scale), magnitude (cooler days receive lower scores and hotter days receive higher scores), and equal unit size (the difference between 20 and 21 degrees is the same as that between 30 and 31 degrees.) However, the temperature scale does not have an absolute zero. Because of this, we are not able to form ratios based on this scale (for example, 50 degrees is not twice as hot as 25 degrees).

A ratio scale is one in which, in addition to order and equal units of measurement, there is an absolute zero that indicates an absence of the variable being measured. Ratio data have all four properties of measurement—identity, magnitude, equal unit size, and absolute zero. Examples of ratio scales of measurement include weight, time, and height. Each of these scales has identity (individuals who weigh different amounts would receive different scores), magnitude (those who weigh less receive lower scores than those who weigh more), and equal unit size (1 kg is the same weight anywhere along the scale and for any person using the scale). These scales also have an absolute zero, which means a score of zero reflects an absence of that variable. This also means that ratios can be formed. For example, a weight of 100 kg is twice as much as a weight of 50 kg.











Class rank

Letter Grade













Equal unit size




Equal unit size

Absolute zero

Mathematical Operations


Rank Order









Typical Statistics Used


Chi Square



Wilcoxon Test




t- test





t- test



Another means of classifying variables is in terms of whether they are discrete or continuous in nature. Discrete variables usually consist of whole-number units or categories. They are made up of chunks or units that are detached and distinct from one another. A change in value occurs a whole unit at a time, and decimals do not make sense with discrete scales. Most nominal and ordinal data are discrete. For example, gender, political party, and ethnicity are discrete scales. Some interval or ratio data can be discrete. For example, the number of children someone has would be reported as a whole number (discrete data), yet it is also ratio data (you can have a true zero and form ratios).

Continuous variables usually fall along a continuum and allow for fractional amounts. The term continuous means that it “continues” between the whole-number units. Examples of continuous variables are age (22.7 years), height (64.5 inches), and weight (113.25 kg). Most interval and ratio data are continuous in nature.

1.3 The How

Psychiatrists say that we cannot remember the first three years of our lives. Our minds just blot those memories out. For most of us, it is not just these memories that we forget, but where we put our keys, our friends’ birthdays and that all important meeting are just some of the things we regularly forget. Man is by his very nature, inadequate. That is why he built machines!

Al throughout school, you have been trained to memorize concepts. This is not one of those classes. I do not want you to memorize anything. I want you to understand the concept. Anything else can always be found when needed. When you understand how to do it, you can always find the means and the tools to do the job required. Especially in exploratory analysis (you don’t have to know what this is now), the job is in trying to figure out what you want to do, rather the doing itself.

I will try as much as possible to provide reference material that further explains the concepts introduced in this course. Nevertheless, we shall focus on the practical application of the concepts rather than the theoretical.


As mentioned earlier, man built tools to help make his work easier. Therefore, in this class, we shall also make use of statistical tools/packages to help make our work easier. For statistical analysis, we have several options:

  1. MS Excel
  2. SPSS
  3. SAS
  4. STATA
  5. R

The above 5 tools are the most commonly used tools for statistical analysis. Each one has its own advantages and disadvantages and the choice of which one to use will depend on many factors including expertise of the user, size of task, cost, etc. In this class we shall be using R. There are many reasons for using R but the main reason we are using it in this class is because it is free while all the others have to be purchased.

Please see our R Tutorials. For this lesson, we need to be able to install and run R, create data and import data into R.

Next: Descriptive Statistics I – Organizing Data



  1. Pingback: COURSE MOTIVATION | Do Thy Math

  2. What is the difference between by-trait factorial analysis and by-person factorial analysis? How would this difference play out for ordinal measures with a rank-order entries (entered through Likert scale / semantic differential) where subjectivity is what is being measured?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s