Way to Avoid the Pitfall of Cherry-Picking Data

I often cherry-pick samples from a population, even though there are many talented statisticians out there who have developed appropriate sampling methods for sampling needs. To avoid this fallacy, I am forced to learn about probability sampling.

In this post, I'll talk about implementing cluster sampling in C. Cluster sampling is one of those handy statistical sampling methods for pulling samples from a population. There are other cool probability sampling techniques too—like simple random sampling, stratified random sampling, and others—but let’s focus on cluster sampling for now.

Why bother with cluster sampling? Well, when you're dealing with a huge population, building a full sampling frame can be a pain—it’s time-consuming, expensive, and just overkill. That’s where probability sampling comes to the rescue!

And if your data is geographically distributed and naturally forms clusters (like neighborhoods, schools, or districts), then cluster sampling is a no-brainer. It’s efficient, practical, and saves you a ton of effort.

So, let’s dive into the game!

Steps to do the cluster sampling:

Group the member of population into 𝑁 clusters.
Do an SRS (simple random samling) to choose 𝑛 from 𝑁 clusters.
Then, estimate the population using all samples in the selected clusters.

Assume this dataset comes from AirAvata, a startup company that provides delivery services. They have expertise in air delivery. They plan to expand their services to a city, let's call it City A. However, AirAvata needs to estimate the appropriate delivery rate for the residents of City A.

Objectives:

Estimate the average people spend per month on goods delivery in city A.
Estimate the 95% CI for the average people spend per month on goods delivery.

Asumtion:

There are 451 districts in city A.

Considering that City A has been divided into districts, I am thinking of conducting cluster sampling. Here is a screenshot of the dataset.

District : the sampled District
People : the number of people in each districts
Total Spend : total expense in each districts (in Rupiah)

Based on the dataset above, we have information below:

Number of all district in city A = 415 districts
Number of sampled district in city A = 20 districts

To estimate the average people spend per month on goods delivery, we use:

I took the above equation from the statistics program material by pacmann

In the code below, i represent the dataset with an object, then I implement the above equation in several functions.

typedef struct 
{
    int district;
    int people;
    double total_spend;
} SurveyData;

double calculate_sc_squared(SurveyData* data, int count, double r) {
    double sum = 0.0;
    for (int i = 0; i < count; i++) {
        double term = data[i].total_spend - (r * data[i].people);
        sum += pow(term, 2);
    }
    return sum / (count - 1);
}

double calculate_variance_tau_hat(int N, int n, double sc_squared) {
    return N * (N - n) * (sc_squared / n);
}

// Modified calculate_and_display_metrics function
void calculate_and_display_metrics(SurveyData* data, int count) {
    int total_people = 0;
    double total_spend = 0.0;
    double average_spend;

    // Calculate totals
    for (int i = 0; i < count; i++) {
        total_people += data[i].people;
        total_spend += data[i].total_spend;
    }

    printf("Total people across all districts: %d\n", total_people);
    printf("Total spend across all districts: %.2f\n", total_spend);

    if (total_people > 0) {
        average_spend = total_spend / (double)total_people;
        printf("\nAverage monthly spend per person on goods delivery: %.2f\n", average_spend);

        // Calculate variance components
        double sc_squared = calculate_sc_squared(data, count, average_spend);
        double var_y_total_est = TOTAL_DISTRICTS * (TOTAL_DISTRICTS - count) * (sc_squared / count);

        printf("\nVariance Calculations:\n");
        printf("Sampled cluster variance (sc²): %.4f\n", sc_squared);
        printf("Variance of total pop. est.: (Rp %.0f)^2\n", sqrt(var_y_total_est));

        // Calculate mean people per district and total population estimate
        double M_bar = (double)total_people / count;
        double M_tot_est = TOTAL_DISTRICTS * M_bar;

        printf("\nPopulation Estimates:\n");
        printf("Mean people per district: %.0f\n", M_bar);
        printf("Estimated total population in city A: %.0f\n", M_tot_est);

        // Variance of mean per person estimator
        double var_y_mean_est = var_y_total_est / pow(M_tot_est, 2);
        printf("Variance of mean per person est.: (Rp %.0f)^2\n", sqrt(var_y_mean_est));

        // Confidence interval calculations
        printf("\nConfidence Interval:\n");
        printf("z-stat (95%% CI): %.2f\n", Z_95_CI);
        double d = Z_95_CI * sqrt(var_y_mean_est);
        printf("Margin of error (d): Rp %.0f\n", d);
        printf("95%% Confidence Interval: Rp %.2f ± Rp %.0f\n", 
               average_spend, d);
        printf("                      (Rp %.0f - Rp %.0f)\n",
               average_spend - d, average_spend + d);
    } else {
        printf("\nError: No people data available for calculation\n");
    }
}

Here's the program that have been compiled to executable files:

District,People,Total Spend

412,171,64050000
324,185,64500000
182,363,123900000
77,148,55200000
418,257,87450000
185,311,107250000
124,244,74400000
320,416,129750000
332,330,104700000
16,305,100500000
352,272,94800000
417,146,49050000
10,195,65700000
32,238,86250000
94,378,126300000
420,239,78900000
214,326,111600000
200,197,69750000
308,182,66750000
268,318,96750000
data: 20 

Number of all districts in city A: 415
Number of sampled districts in city A: 20
Total records: 21

Total people across all districts: 5221
Total spend across all districts: 1757550000.00

Average monthly spend per person on goods delivery: 336630.91

Variance Calculations:
Sampled cluster variance (sc²): 26330932965053.5312
Variance of total pop. est.: (Rp 464558833)^2

Population Estimates:
Mean people per district: 261
Estimated total population in city A: 108336
Variance of mean per person est.: (Rp 4288)^2

Confidence Interval:
z-stat (95% CI): 1.96
Margin of error (d): Rp 8405
95% Confidence Interval: Rp 336630.91 ± Rp 8405
                      (Rp 328226 - Rp 345036)

Full implementation in C, was posted in github Here

Mean of number of people per district: 261
Estimation of total number of people in city A: 108336
The variance of mean ssu. est.: (Rp 4288)^2

With 95% confidence, we estimate that the true average monthly spending on goods delivery in City A falls between Rp 328,226 and Rp 345,036.

Reference:
Pacmann statistic course (RECOMENDED)
GeeksForGeeks

Rama Reksotinoyo @ramareksotinoyo

Way to Avoid the Pitfall of Cherry-Picking Data

Comments 0 total