Vilfredo Fritz Pareto was an Italian statistician and sociologist that described the famous Pareto Principle.
In short, it says that 80% of the outcome is explained by 20% of the causes. That means that if you are able to identify that small hidden root cause, you can fix 80% of your issues.
In R we can do this with the built in graphics... or we can go with the way more trendy, fashionable, and powerful library ggplot (self called "the grammar of graphics").
First of all, we will need the data to feed the chart.
# Sample data: defect types and their frequencies
defects <- c("A" = 50, "B" = 30, "C" = 15, "D" = 5)
# Sort in descending order
defects_sorted <- sort(defects, decreasing = TRUE)
# Calculate cumulative percentages
cumulative_freq <- cumsum(defects_sorted)
cumulative_pct <- cumulative_freq / sum(defects_sorted) * 100
df_defects <- data.frame(
category = names(defects_sorted),
frequency = as.numeric(defects_sorted),
cumulative_freq,
cumulative_pct
)
What happened above? We created a dataframe of different deffects and their frequency (A happens 50 times, B 30 times, etc).
After that, we sort the defects in descending order (remember we want to find the root causes that cause the 80% of trouble right? easier to see visually if we put the larger ones first).
Then we complete a typical frequency table with the cumulative frequency and the cummulative relative frequency (which is the data we want to show in the chart later on).
Now we have a beautiful data frame that renders contains the following data:
category frequency cumulative_freq cumulative_pct
A A 50 50 50
B B 30 80 80
C C 15 95 95
D D 5 100 100
If you don't have the ggplot library installed, simply install it with the following instruction:
install.packages("ggplot")
Then, let the magic happen:
library(ggplot2)
ggplot(data = df_defects, mapping = aes(x = category, y = frequency)) +
geom_col() +
geom_line(mapping = aes(x = category, y = cumulative_pct), group = 1,
colour = "red", size = 3) +
xlab("Category") +
ylab("Frequency")
ggplot can look a bit scary at first.. but believe me it has all the logic in the world after the initial little learning curve. For now let's look at the (beautiful) output:
As you can see, the red line shows the % accumulated on the top of every category, in this example we see that we need the 3 initial columns on the left to get above the 80%.
Going back to the code.
ggplot() creates an empty canvas and tells it which dataset to use (df_defects) and how to map the data (categories go on the x axis and frequencies on the y axis).
geom_col() draws the actual bars based on those mappings.
geom_line() adds a red line showing cumulative percentages, with group = 1 telling ggplot to connect all points into one continuous line, and size = 3 making it thick and visible.
xlab() and ylab() add descriptive labels to the axes so readers know what they're looking at.
In resume, and contrary to other methods I am more used to, and this blew my mind at the beginning, in ggplot2, you build visualizations by adding layers with the + operator, like stacking transparent sheets.
Each function adds a specific element. You start with ggplot() that creates the foundation, geom_col() adds bars, geom_line() overlays a line, and xlab() or ylab() add text labels.
This is a modular approach that lets you combine different visual elements incrementally, reaching high levels of personalizatin and beautiness if you have some decent taste.
If you want to learn more about ggplot, I recommend this other article from another user at dev.to