Why Aren’t We Better Prepared to Deal With Asymmetric Data?

Kurtosis Doesn’t Measure Peakedness. The Central Limit Theorem Won’t (Always) Save You. What Are The Implications of Asymmetries in War? Superficial Knowledge Doesn’t Equate to Expertise.

Vatrushka
CodeX

--

Statistics is arguably the most important branch of applied mathematics because of how prevalent the use of data in decision-making is. Many non-STEM majors at the undergraduate and graduate level require a course in statistics because of data’s prevalence and necessity in many fields (to the absolute horror of those people who would rather write essays or examine inconsequential theory ad nauseam). It’s clear statistical inference is integral to professional and academic life. So why aren’t we taught to deal with asymmetric data and understand their consequences better?

I was working on a visualization project for a class when this idea really hit me. The object of the assignment was to take a data set and visualize the distribution of the variables in it. I found a data set that met the assignment’s criteria, it all seemed easy enough. At the time, I expected to see something like this when I plotted the distribution of the variables (give or take):

Well, maybe I didn’t expect a perfectly symmetrical distribution…but you get the idea

This is what I got instead:

Google “lol what pear” to get a visual understanding of how I felt after I plotted this

In an attempt to “fix” the data I resorted to the only recourse in my statistical toolkit at the time, the logarithmic transformation. Here’s what the transformed data looks like:

The data were re-scaled, but it’s still a far cry from fitting anything resembling the normal distribution. I realized I didn’t quite understand the implications of the distribution’s shape. None of the courses in statistics I had taken up to that point spent a great deal of time covering asymmetric distributions. This situation was a catalyst for reevaluating how I approached statistical inference.

The Third and Fourth Moment

The measures for how well a distribution fits the normal distribution are skewness and kurtosis. I had been introduced to skewness and understood its consequences (somewhat) but had never been taught anything about kurtosis. I searched for the word “kurtosis” in the most recent statistics textbook I used for a course taught at the graduate level. There was nothing. To be fair, “skewness” is mentioned five times in the text and the purpose of the course focused on application rather than theory. I thought about other statistics courses I’ve taken such as my introductory statistics course or the econometrics courses during my undergraduate studies. Non-normal distributions were covered, but the focus was the methodology used to assume the normal distribution (The law of large numbers and the central limit theorem always came to the rescue).

A More Technical Review of Kurtosis

I’m going to focus on kurtosis because it seems there is confusion in regards to what it actually measures. In his book, The (Mis)behavior of Markets, Benoit Mandelbrot says kurtosis measures how well real data fit the ideal bell curve and that kurtosis is the “spice” in the statistical broth (I prefer to call it the statistical meatball). In the notes of his book, Mandelbrot goes into more detail by writing:

“Kurtosis is one of the founders of the standard measures of a distribution curve’s shape, which are based on the first four “moments.” The first moment is the average value: the second is the variance; third is the skewness — a measure of how asymmetrically the data are distributed around the average; and the fourth is kurtosis, a measure of how tall or squat the curve is. A bell curve has a kurtosis of three. Larger values imply the curve is tall in the center, with fat tails.”

That last sentence is important because it’s actually a bit misleading. The height of a curve (its “peakedness”) is not determined by kurtosis as Peter H. Westfall explains in his article Kurtosis as Peakedness, 1905–2014. R.I.P. Westfall says kurtosis is possibly conflated for peakedness because heavy-tailed distributions sometimes have higher peaks than light-tailed distributions and he uses a histogram of n = 1000 Cauchy random variables to illustrate the point:

The dotted lines are at +/- one standard deviation

The equation for kurtosis is:

Where Z is:

Where m and s are the sample mean and sample standard deviation of the Cauchy distribution above let:

Now let's use R to write our own function to calculate kurtosis for the Cauchy distribution.

# cauchy distribution
set.seed(12344)
cauchy = rcauchy(1000)
#---------------------------
# create a kurtosis function
#---------------------------
k = function(x){
z = (x - mean(x))/ sd(x)
k = (1/length(x) * sum(z^4))
return(k)
}
#---------------------------
# calculate kurtosis for the cauchy distribution
k_cauchy = k(cauchy)
k_cauchy
[1] 436.5128

If you remember from Mandelbrot’s note, k = 3 for a normal distribution. For the Cauchy distribution, k = 436.5128. That’s a spicy meatball. This confirms what we already knew from the histogram, we’re dealing with asymmetric data. But now, let’s focus on the proportion of the kurtosis statistic determined by the data within a standard deviation of the mean. Another way to think about it is, we’ll determine the proportion of the statistic determined by the data where the peak is located on the graph versus where there is no peak. To do this, we’ll need to isolate the calculation of z in its own function and manually compute k.

# isolate Z in its own function
Z = function(x){(x - mean(x))/ sd(x)}
# calculate z for the cauchy distribution
z_cauchy = Z(cauchy)
# calculate kurtosis within one standard deviation
z_in = z_cauchy[z_cauchy %>% abs() <= 1]
k_in = 1/1000 * sum(z_in^4)
# calculate kurtosis outside one standard deviation
z_out = z_cauchy[z_cauchy %>% abs() > 1]
k_out = 1/1000 * sum(z_out^4)
# add the k values and compare to original calculation
k_in + k_out
[1] 436.5128
k_cauchy
[1] 436.5128
# determine the proportion of k within one standard deviation
k_in / k_cauchy
[1] 1.670914e-05
# determine the proportion of k outside one standard deviation
k_out / k_cauchy
[1] 0.9999833

The values outside one standard deviation account for ~99% of the kurtosis statistic. Of note, only 17 of the 1000 Cauchy random variables are outside of a single standard deviation. This means 1.7% of the data represents ~99% of the kurtosis. To quote Westfall, “…the notion that the kurtosis statistic has anything to do with the data near the peak is nothing short of silly with these data.” Distributions can have the same kurtosis but different peaks. Kurtosis is about the outliers, not the peaks.

Слава Україні (Glory to Ukraine)

The very asymmetric data I plotted earlier (the one where I told you to google the “lol what pear”) is the distribution of deaths suffered by Ukrainian government forces in their ongoing war with the Russian state and its proxies. Here is the distribution again with labels:

The kurtosis of the distribution is 962.2642, that’s a really spicy meatball (Гостра Фрикаделька). I found that one observation in the data is responsible for 0.9935008% of the kurtosis. I plotted the number of deaths per observation against its individual z score and adjusted the size of the points by each observation’s value of k. Here’s the graph:

See the point on the top right? That’s the meatball. That observation represents The Battle of Ilovaisk (Бої за Іловайськ). This marks the point in the conflict when regular Russian troops became involved in the conflict and Ukrainian government forces suffered heavy losses as a result. This event had huge implications and heavily influenced the course of the conflict as it reversed the gains Ukrainian forces made against Russian proxy forces up to that point.

A picture I took at a war memorial in Kyiv. A large number of the fallen whose pictures are memorialized on that wall died at Ilovaisk.

What Are The Implications?

It seems given the properties of the distribution of Ukrainian deaths we’ve analyzed, the average becomes almost a meaningless statistic (m = ~1.56). There were 366 deaths suffered by Ukrainian forces at The Battle of Ilovaisk (according to the Uppsala data). When you have an asymmetric distribution, it’s the extremes that matter. They’re what determine the statistical properties. An extreme event in war could mean a huge loss of life or the loss of a nation’s sovereignty. In economic terms, an extreme event could spur on a crippling recession. In financial terms, it could mean going bust in a second.

The Central Limit Theorem Will Not (Always) Save You

When asymmetric distributions were covered in my statistics and econometrics courses, the recourse was always the central limit theorem (CLT) and the law of large numbers (LLN). The LLN roughly states that if a distribution has a finite mean, and you add independent random variables drawn from it (the sample size increases) you eventually converge to the mean. The CLT state that n-summed independent random variables with a finite second moment (variance) end up looking like a normal distribution (Taleb 28). This is hugely important in linear regression because a consistent estimator converges in probability to the quantity being estimated (Taleb 128).

How Fast do The Distributions Converge?

So given our definition of the LLN and CLT, all we need is more data. That’s what I thought when I first began to understand these concepts. More data will always fix the problem. However, I never thought about how much data is required in practice or how the underlying distribution of the data affects convergence. To visualize these concepts, let's look at a uniform distribution with support [0,1] and then add two, three, and four identically distributed and independent variables to the original variable.

Fast Convergence: the Uniform Distribution

As you can see from the graph above, the distribution became bell-shaped immediately. Now let’s explore a different distribution, the Pareto. Take a look at the histogram below:

The Pareto distribution is unique in that approximately 80% of the observations fall below the mean. This distribution was randomly created using the Pareto package in R. When I first plotted the distribution, I was struck at how similar it looked to the distribution I plotted from the Ukrainian conflict data. I calculated how much of the Ukrainian deaths data fell below the mean, it was 80% (80.003428% to be exact). Look at the two distributions side by side:

The deaths data is even more fat-tailed

We saw earlier how quickly the uniform converged with four summands. Let’s see what happens to convergence of the Pareto distribution with 100 summands:

Now 1000:

Now 10,000:

Eventually you’ll get there, but this distribution is very stubborn about losing its skewness. In theory, we know the distribution will converge as n approaches infinity. But in practice, we never have infinite data. What if the underlying distribution of an estimator in a regression equation is fat-tailed? Doesn’t this mean our predictions are flawed? I’m still trying to better understand that question

Final Thoughts

I by no means am an expert in anything I’ve written about in this article (I actually don’t like the word “expert” because I feel like too many people pass superficial knowledge off as expertise). The more I discover and explore different types of data the more I observe asymmetries. The more I observe asymmetries the more I wonder why my formal education was so bereft in dealing with them (the answer was usually a log transformation). Even in my professional life where I mostly deal with categorical data, I observe asymmetries regularly and they have very important consequences. Why aren’t we prepared to better deal with asymmetries?

Notes

The technical review of kurtosis came from Westfall’s paper and I used his examples. The examples of the speed of convergence came from Taleb’s book. I also used his examples.

For a really great slide deck covering asymptotics and consistency in linear regression, see this slide deck made by a former professor of mine, Ed Rubin.

If you’d like to review all of the code I wrote for this blog, you can view it here.

References

Mandelbrot, B. and Hudson, R., 2008. The (mis)behavior of markets. New York: Basic Books.

Westfall, P., 2014. Kurtosis as Peakedness, 1905–2014.R.I.P. The American Statistician, 68(3), pp.191–195.

Taleb, N., n.d. Statistical consequences of fat tails. pp.28, 131, 133. (download for free here)

--

--