For my senior year of high school, having completed the high school calculus course, I focused on statistics and probability for mathematics, using the self-paced course on Stanford Online Lagunita and videos Khan Academy. (If you are interested, I highly recommend the former.)

Statistics has developed a poor reputation over the years. It’s difficult. It’s biased. It’s unreliable. It’s outright wrong. Nevertheless, many colleges and many degree programs require that students enroll in an introductory statistics course. (The college that I will attend come the fall does not have this requirement, but that’s beside the point.)

Technically, I could have terminated high school mathematics study in 10th grade, but I’m a nerd and enjoy math, so I continue forth. I decided to take statistics this year for two main reasons: 1) According to family, friends, and the internet, it’s useful, and 2) after two years of calculus, I wanted a change of mathematical pace.

Working through Stanford Lagunita and Khan Academy these past 7+ months, I began to realize just how important a statistics class is.

Statistics surrounds us. We encounter it in the grocery store, on the news, in classrooms, during debates – anywhere you can collect, organize, interpret, and present data of any kind. This prevalence of statistical data leaves it open to abuse. Misleading statistics may occur by error, but a more dangerous circumstance occurs when analysts produce misleading statistics deliberately.

The ability to recognize graphs with exaggerated scales, sampling bias, when researchers blur the line between causation and correlation, and other data manipulations or mistakes is crucial.

Image from Vertical Measures

Statistics is concerned with comparing differences in data, whether between years, products, or people. It has four main parts:

  1. Identification of a population
  2. Selection of a sub-group, or sample, of the population through randomization
  3. Summary of the collected data in exploratory data analysis (EDA)
  4. Conclusions through inference

Casual statistics usually ends at #3. We identified our population as the American people, we chose a sample of Americans, and we collect data from them in a survey about their opinions on gun laws/conflicts/religion/(insert any other data topic of interest). The end. Let us now present to the public that this percentage of Americans disagree with this, or agree with that, or believe that blue is the king of colors.

When assessing statistical data, the underlying framework is just as important – dare I say, more important – than the data itself. Right from the start, in part 2, statistics can break down.

Randomization, randomization, randomization.

This is a buzzword in statistics. The sample taken in part 2 should truly represent the population identified in part 1. If a sample is not random, bias in the data is likely. For example, say a news station invites viewers to log on to their website and answer a poll about the Common Core. Participation is voluntary. Who will actually join the poll? Most likely, those with the strongest opinions on Common Core will. In statistics, this is called volunteer bias.

My statistics courses beat that random sample criteria of analysis until it was dry and dead. The short and sweet of it is: If you fail in that step, all results implode. A couple of lessons on the importance of random samples will make you look more carefully at the statistical data in books, pamphlets, and online articles.

Image from the Centers for Disease Control and Protection

The topic that has particularly enraptured me recently is inference, part 4 of the statistics process. In inference, you analyze the data about the sample in order to draw conclusions about the population. It’s not enough to survey 1,000 people and find that 58% support such-and-such a motion. You have to ask, “How statistically significant is that data? Does it matter as much as the media has made it out?”

Statistics has three main inferential methods: point estimation, interval estimation, and hypothesis testing. (The following definitions come straight from my notes, and may or may not be verbatim definitions from Stanford Lagunita.)

  • Point estimation estimates an unknown parameter using a single number from the sample data.
  • Interval estimation estimates an unknown parameter using an interval of values called a confidence interval. It basically enhances point estimates.
  • Hypothesis testing makes a claim about a population and checks data for evidence against it.

Hypothesis testing, the most complex of the three, uses the confidence interval of interval estimation, among other fun statistical constructs, to determine how amazing the data observed is. Data found through surveys, polls, experiments, or observational studies mean nothing if you do not put it in context. Consider, for example, the greater weight a finding that teenagers have, on average, only 7 hours of sleep using a sample size of 1,000 teenagers has over a similar finding with a sample size of 10.

In the future, I’ll talk more about z-tests, t-tests, p-values, and all of the other wonderful components involved in hypothesis testing. It’s a bit too much to squeeze into the end of this post. For now, I just wanted to introduce you to the value in this field and, hopefully, pass a little of my excitement about statistics to you.