Home » Books » Mathematics » Statistics for Absolute Beginners: A Plain English Introduction

Statistics for Absolute Beginners: A Plain English Introduction

Author: Oliver Theobald

Publisher: Independently published

Genres: Mathematics

Publish Date: June 18, 2020

ISBN-10: B08BDZ2DCF

Pages: 156

File Type: Epub

Language: English

Book Preface

Let’s listen to the data.”

“Do you have the numbers to back that up?”

We live in an age and society where we trust technology and quantifiable information more than we trust each other—and sometimes ourselves. The gut feeling and conviction of Steve Jobs to know “what consumers would later want” is revered and romanticized. Yet there’s sparse literature (Blink by Malcolm Gladwell is a notable exception), an eerie absence of online learning courses, and little sign of a mainstream movement promoting one person’s unaided intuition as a prerequisite to success in business. Everyone is too preoccupied with thinking about quantitative evidence, including the personal data generated by Apple’s expanding line of products. Extensive customer profiling and procuring data designed to wrench out our every hidden desire are dominant and pervasive trends in business today.

Perhaps Jobs represents a statistical anomaly. His legacy cannot be wiped from the dataset, but few in the business world would set out to emulate him without data in their pocket. As Wired Magazine’s Editor-in-chief Chris Anderson puts it, we don’t need theories but rather data to look at and analyze in the current age of big data. [1 ]

Data—both big and small—is collected instantly and constantly: how far we travel each day, who we interact with online and where we spend our money. Every bit of data has a story to tell. But, left isolated, these parcels of information rest dormant and underutilized—equivalent to Lego blocks cordoned into bags of separate pieces.

Data, though, is extraordinarily versatile in the hands of the right operator. Like Lego laid out across the floor, it can be arranged and merged to serve in a variety of ways and rearranged to derive value beyond its primary purpose. A demonstration of data’s secondary value came in 2002 when Amazon signed a deal with AOL granting it access to user data from AOL’s e-commerce platform. While AOL viewed their data in terms of its primary value (recorded sales data), Amazon saw a secondary value that would improve its ability to push personalized product recommendations to users. By gaining access to data that documented what AOL users were browsing and purchasing, Amazon was able to improve the performance of its own product recommendations, explains Amazon’s former Chief Scientist Andreas Weigend. [2]

Various fields of data analytics including machine learning, data mining, and deep learning continue to improve our ability to unlock patterns hidden in data for direct or secondary analysis as typified by Amazon. But behind each new advanced technique is a trusted and lasting method of attaining insight, popularized more than two and a half hundred years ago under the title of this book.

While primary methods of statistical analysis date back to at least the 5th Century BC, it wasn’t until the 18th Century AD that these and newly evolved methods coalesced into a distinctive sub-field of mathematics and probability known today as statistics .

A notable frontrunner to the developments of the 18th Century was John Graunt’s publication Natural and Political Observations Made upon the Bills of Mortality. The London-born haberdasher [3] and his friend, William Petty, are credited with developing the early techniques of census analysis that later provided the framework for modern demographic studies.

Graunt developed the first “life table,” which surmised the probability of survival amongst age groups during a public health crisis that hit Europe in the mid-1600s. By analyzing the weekly bills of mortality (deaths), Graunt and Petty attempted to create a warning system to offset the spread of the bubonic plague in London. While the system was never actually implemented, it served as a useful statistical exercise in estimating London’s sizeable population.

Probability theory evolved during this same period courtesy of new theories published by Gerolamo Cardano, Blaise Pascal, and Pierre de Fermat. As an accomplished chess player and gambler in Italy, Cardano observed dice games to comprehend and distill the basic concepts of probability. This included the ability to produce a desired result by defining odds as the ratio of favorable to unfavorable outcomes. He subsequently wrote Liber de ludo aleae (Book on Games of Chance) in 1564, but the book wasn’t published until a century later in 1663. Beyond its section on effective cheating methods, Cardano’s thesis was well received as the first systematic treatment of probability.

A decade earlier, in 1654, Pierre de Fermat and Blaise Pascal (also known for his work on the arithmetical triangle and co-inventor of the mechanical calculator) collaborated to develop the concept of expected value, which was again developed for the purpose of interpreting gambling scenarios. To end a game early, Pascal and de Fermat devised a method to divide the stakes equitably based on the calculated probability that each player had of winning. The French duo’s study into the mathematical theory of probability helped develop the concept of expected value or the law of large numbers .

Pascal and de Fermat found that as the number of independent trials increases, the average of the outcomes creeps toward an expected value, which is calculated as the sum of all possible values multiplied by the probability of its occurrence. If you continually roll a six-sided dice, for example, the expected average value of all the results is close to 3.5.

Example

(1 + 2 + 3 + 4 + 5 + 6) x (1/6)

21 x 0.16666666666667

= 3.5

By the 18th Century, further breakthroughs in probability theory and the study of demography (based on Graunt’s prior work in census studies) combined to spawn the modern field of statistics. Derived from the Latin stem “sta,” meaning “to stand, set down, make or be firm,” the field of statistics was initially limited to policy discussions and the condition of the state. [4] The earliest known recording of the term is linked to the German word “statistik,” which was popularized and supposedly coined by the German political scientist Gottfried Aschenwall (1719-1772) in his 1748 publication Vorbereitung zur Staatswissenschaft . [5]

The German word “statistik” is thought to have borrowed from the Modern Latin term “statisticum collegium” (lecture course on state affairs), the Italian word “statista” (statesman or one skilled in statecraft), and the Latin word “status” (meaning a station, position, place, order, arrangement, or condition). [6] The new term was later published in the English-speaking world by Sir John Sinclair in his 1791 publication the Statistical Account of Scotland .

By the close of the 18th Century, “statistics” was synonymous with the systematic collection and analysis of a state’s demographic and economic information. The regular recording of state resources had been in practice since ancient times, but deeper than an exercise in state bookkeeping, the new moniker inspired specialist studies in utilizing data to inform decision-making and incorporated the latest methods in distribution and probability.

Statistics subsequently expanded in scope during the 19th Century. No longer confined to assessing the condition of the state, attention was recalibrated to all fields of study, including medicine and sport, which is how we recognize statistics today. But while emerging fields like “machine learning” and “data mining” sound new and exciting now, “statistics” generally evokes memories of a dry and compulsory class taught in college or high school. In the book Naked Statistics: Stripping the Dread from the Data, author Charles Wheelan writes that students often complain that “statistics is confusing and irrelevant,” but outside the classroom, they are glad to discuss batting averages, the wind chill factor, grade point averages and how to reliably measure the performance of an NFL quarterback. [7] As Wheelan observes, a large number of people study statistics as part of their education, but very few know how to apply these methods past examination day despite an inherent curiosity and interest in measuring things and especially performance.

This dichotomy has begun to change with the recent popularity of data science, which has grown in favor since Charles Wheelan’s book was published in 2012. From the planning of data collection to advanced techniques of predictive analysis, statistics is applied across nearly all corners of data science. Machine learning, in particular, overlaps with inferential statistics , which involves extracting a sample from a pool of data and making generalized predictions about the full population. Like inferential statistics, machine learning draws on a set of observations to discover underlying patterns that are used to form predictions.

In this absolute beginners’ introduction to statistics, we focus primarily on inferential statistics to prepare you for further study in the field of data science and other areas of quantitative research where statistical inference is applied. While there is an excessive number of statistical methods to master, this introductory book covers core inferential techniques including hypothesis testing, linear regression analysis, confidence levels, probability theory, and data distribution. Descriptive methods such as central tendency measures and standard deviation are also covered in the first half of the book. These methods complement inferential analysis by allowing statisticians to familiarize themselves with the makeup and general features of the dataset. (In statistics, you can never be too familiar with your data.)

Before we proceed to the next chapter, it’s important to note that there are four major categories of statistical measures used to describe data. Those four categories are:

1) Measures of Frequency: Analyzes the number of occurrences of any particular data value in the dataset and counts the number of times that it occurs, such as the number of Democrat and Republican voters within a sample population.

2) Measures of Central Tendency: Examines data values that accumulate in the middle of the dataset’s distribution such as the median and mode. Discussed in Chapter 5 .

3) Measures of Spread: Describes how similar or varied observed values are within the dataset such as standard deviation. Discussed in Chapter 6 .

4) Measures of Position: Identifies the exact location of an observed value within the dataset such as standard scores. Discussed in Chapter 7 .