Benchmarking CSV File Processing: Golang vs NestJS vs PHP vs Python
Linda Sebastian

Linda Sebastian @rocklinda

About: A software engineer with expertise in backend development.

Location:
Indonesia
Joined:
Sep 29, 2021

Benchmarking CSV File Processing: Golang vs NestJS vs PHP vs Python

Publish Date: Aug 12 '24
34 24

Introduction

Processing large CSV files efficiently is a common requirement in many applications, from data analysis to ETL (Extract, Transform, Load) processes. In this article, I want to benchmark the performance of four popular programming languages—Golang, NodeJS with NestJS, PHP, and Python—in handling large CSV files on a MacBook Pro M1. I aim to determine which language provides the best performance for this task.

Test Environment

Hardware: MacBook Pro M1, 256GB SSD, 8GB RAM

Software:

  • macOS Sonoma 14.5
  • PHP 8.3.6
  • Golang 1.22.4
  • Node.js 22.0.0 with NestJS
  • Python 3.12.3

Test Data

I used a synthetic CSV file named sales_data.csv with approximately 1 million rows, each containing transaction details such as transaction_id, product_id, quantity, price, and timestamp.

Task Description

For each language, the script performs the following tasks:

  1. Reads the CSV file.
  2. Calculates the total sales amount.
  3. Identifies the product with the highest sales.

Implementation

Here are the scripts used for each language:

Golang Script:

sales.go



package main

import (
    "encoding/csv"
    "fmt"
    "os"
    "strconv"
    "time"
)

func main() {
    start := time.Now()

    file, err := os.Open("../generate-csv/sales_data.csv")
    if err != nil {
        fmt.Println("Error:", err)
        return
    }
    defer file.Close()

    reader := csv.NewReader(file)
    _, _ = reader.Read() // Skip header

    totalSales := 0.0
    productSales := make(map[string]float64)

    for {
        line, err := reader.Read()
        if err != nil {
            break
        }
        productID := line[1]
        quantity, _ := strconv.Atoi(line[2])
        price, _ := strconv.ParseFloat(line[3], 64)
        total := float64(quantity) * price

        totalSales += total
        productSales[productID] += total
    }

    var topProduct string
    var topSales float64
    for product, sales := range productSales {
        if sales > topSales {
            topProduct = product
            topSales = sales
        }
    }

    elapsed := time.Since(start)
    fmt.Printf("Golang Execution time: %s\n", elapsed)
    fmt.Printf("Total Sales: $%.2f\n", totalSales)
    fmt.Printf("Top Product: %s with sales $%.2f\n", topProduct, topSales)
}



Enter fullscreen mode Exit fullscreen mode

NestJS Script:

csv.service.ts



import { Injectable } from '@nestjs/common';
import * as fs from 'fs';
import * as fastcsv from 'fast-csv';

// path file CSV
const GLOBAL_CSV_PATH = '../generate-csv/sales_data.csv';

@Injectable()
@Injectable()
export class CsvService {
  async parseCsv(): Promise<{
    nestExecutionTime: number;
    totalSales: number;
    topProductSales: number;
  }> {
    return new Promise((resolve, reject) => {
      const startTime = process.hrtime();

      let totalSales = 0;
      const productSales: { [key: string]: number } = {};

      fs.createReadStream(GLOBAL_CSV_PATH)
        .pipe(fastcsv.parse({ headers: true, delimiter: ',' }))
        .on('data', (row) => {
          const productID = row.product_id;
          const quantity = parseInt(row.quantity, 10);
          const price = parseFloat(row.price);
          const total = quantity * price;
          totalSales += total;
          if (!productSales[productID]) {
            productSales[productID] = 0;
          }
          productSales[productID] += total;
        })
        .on('end', () => {
          const topProduct = Object.keys(productSales).reduce((a, b) =>
            productSales[a] > productSales[b] ? a : b,
          );
          const topProductSales = productSales[topProduct] || 0;
          const endTime = process.hrtime(startTime);
          const nestExecutionTime = endTime[0] + endTime[1] / 1e9;

          console.log(`NestJS Execution time: ${nestExecutionTime} seconds`);
          console.log(`Total Sales: $${totalSales}`);
          console.log(
            `Top Product: ${topProduct} with sales $${topProductSales}`,
          );

          resolve({
            nestExecutionTime,
            totalSales,
            topProductSales,
          });
        })
        .on('error', (error) => reject(error));
    });
  }
}


Enter fullscreen mode Exit fullscreen mode

csv.controller.ts



import { Controller, Get } from '@nestjs/common';
import { CsvService } from './csv.service';

@Controller('csv')
export class CsvController {
  constructor(private readonly csvService: CsvService) {}

  @Get('parse')
  async parseCsv(): Promise<{
    nestExecutionTime: number;
    totalSales: number;
    topProductSales: number;
  }> {
    return this.csvService.parseCsv();
  }
}


Enter fullscreen mode Exit fullscreen mode

PHP Script

sales.php



<?php
$start_time = microtime(true);

$file = fopen("../generate-csv/sales_data.csv", "r");
$total_sales = 0;
$product_sales = [];

fgetcsv($file); // Skip header
while (($line = fgetcsv($file)) !== false) {
    $product_id = $line[1];
    $quantity = (int)$line[2];
    $price = (float)$line[3];
    $total = $quantity * $price;

    $total_sales += $total;
    if (!isset($product_sales[$product_id])) {
        $product_sales[$product_id] = 0;
    }
    $product_sales[$product_id] += $total;
}
fclose($file);

arsort($product_sales);
$top_product = array_key_first($product_sales);

$end_time = microtime(true);
$execution_time = ($end_time - $start_time);

echo "PHP Execution time: ".$execution_time." seconds\n";
echo "Total Sales: $".$total_sales."\n";
echo "Top Product: ".$top_product." with sales $".$product_sales[$top_product]."\n";


Enter fullscreen mode Exit fullscreen mode

Python Script



import csv
import time

# Input file name config
input_file = '../generate-csv/sales_data.csv'


def parse_csv(file_path):
    start_time = time.time()

    total_sales = 0
    product_sales = {}

    with open(file_path, mode='r') as file:
        reader = csv.DictReader(file)

        for row in reader:
            product_id = row['product_id']
            quantity = int(row['quantity'])
            price = float(row['price'])
            total = quantity * price
            total_sales += total

            if product_id not in product_sales:
                product_sales[product_id] = 0
            product_sales[product_id] += total

    top_product = max(product_sales, key=product_sales.get)
    execution_time = time.time() - start_time

    return {
        'total_sales': total_sales,
        'top_product': top_product,
        'top_product_sales': product_sales[top_product],
        'execution_time': execution_time,
    }


if __name__ == "__main__":
    result = parse_csv(input_file)
    print(f"Python Execution time: {result['execution_time']:.2f} seconds")
    print(f"Total Sales: ${result['total_sales']:.2f}")
    print(f"Top Product: {result['top_product']} with sales ${
          result['top_product_sales']:.2f}")



Enter fullscreen mode Exit fullscreen mode

Results

Here are the results of our benchmark tests:

Golang

  • Execution time: 466.69975ms
  • Total Sales: $274654985.36
  • Top Product: Product 1126 with sales $305922.81

golang result

NestJS

  • Execution time: 6.730134208 seconds
  • Total Sales: $274654985.36000216
  • Top Product: 1126 with sales $305922.8099999997

nestJS result

PHP

  • Execution time: 1.5142710208893 seconds
  • Total Sales: $274654985.36
  • Top Product: 1126 with sales $305922.81

PHP result

Python

  • Execution time: 2.56 seconds
  • Total Sales: $274654985.36
  • Top Product: 1126 with sales $305922.81

Python result

Analysis

My benchmark reveals several interesting insights:

Execution Time: Golang performed the best in terms of execution time, followed closely by PHP8, while NestJS took the longest time to complete the task.
Memory Usage: Build NestJS demonstrated efficient memory usage, while Python showed higher memory consumption.
Ease of Implementation: Golang provided the most straightforward implementation, while NestJS required more lines of code and complexity.

Conclusion

Based on my findings, Golang offers the best performance speed and memory efficiency, making it an excellent choice for handling large datasets.

Complete Code

You can get the full code on My Github repository
csv-parsing-battle.

Update Refactor Go

I refactor the part of Go from two loops to one loop for simplicity.



package main

import (
    "encoding/csv"
    "fmt"
    "os"
    "strconv"
    "time"
)

func main() {
    start := time.Now()

    file, err := os.Open("../generate-csv/sales_data.csv")
    if err != nil {
        fmt.Println("Error:", err)
        return
    }
    defer file.Close()

    reader := csv.NewReader(file)
    _, _ = reader.Read() // Skip header

    totalSales := 0.0
    productSales := make(map[string]float64)

    var topProduct string
    var topSales float64

    for {
        line, err := reader.Read()
        if err != nil {
            break
        }

        productID := line[1]
        quantity, _ := strconv.Atoi(line[2])
        price, _ := strconv.ParseFloat(line[3], 64)
        total := float64(quantity) * price

        totalSales += total
        productSales[productID] += total

        // Update topProduct and topSales in the same loop
        if productSales[productID] > topSales {
            topProduct = productID
            topSales = productSales[productID]
        }
    }

    elapsed := time.Since(start)
    fmt.Printf("Golang Execution time: %s\n", elapsed)
    fmt.Printf("Total Sales: $%.2f\n", totalSales)
    fmt.Printf("Top Product: %s with sales $%.2f\n", topProduct, topSales)
}


Enter fullscreen mode Exit fullscreen mode

The result barely any difference.

  • Execution time: 457.721917ms
  • Total Sales: $274654985.36
  • Top Product: Product 1126 with sales $305922.81 Go Refactor

Comments 24 total

  • salah eddine
    salah eddineAug 12, 2024

    Using Polars with Python is more flexible and powerful for handling large datasets and performing complex data manipulation. While Golang is powerful, it isn't designed for working with datasets like data frames. However, Golang could dominate if a framework similar to Polars is developed for it

    • Linda Sebastian
      Linda SebastianAug 13, 2024

      Thank you for your insight, in this case, I just want to try plain Python. Next time I will try Polars Python with a larger dataset.

  • Marcelloh
    MarcellohAug 12, 2024

    why the second loop in Go, because you could have looked for the topProduct in the first loop.
    (then you don't need a map, saving memory and such...)

    I think this might be faster :-)

    • Linda Sebastian
      Linda SebastianAug 13, 2024

      You are right I just realized now. Thank you for your sharp eye.

      • Marcelloh
        MarcellohAug 13, 2024

        Can you check if the outcome is a bit more in favour of Go? (and let us know)

        • Linda Sebastian
          Linda SebastianAug 13, 2024

          I made some changes you can see in my article above. Changing from two loops to one loop doesn't have any difference, I still need a map for storing variables and getting a product with the highest price. But, I think if in real-world cases which lot of logic will make a difference. Do you have any thoughts about this? Let me know...

          • Marcelloh
            MarcellohAug 13, 2024

            Linda, please have a look at this (to understand more about what I meant)
            goplay.space/#Ny0OT89_zNP
            I can't test it, since I don't have the file.

            Trick is: no map, no loop for lookup. This must be slightly better.

            • Linda Sebastian
              Linda SebastianAug 13, 2024

              I think you misunderstood the case, topProduct is the sum of all the sales not the product with the highest price. There's no way I can get the sum value of each product with an integer variable. I still need a map to collect all the sums and get the highest value of total sales. My approach is the same as storing values in an array in PHP/NodeJS or a list in Python.

              • 𒎏Wii 🏳️‍⚧️
                𒎏Wii 🏳️‍⚧️Aug 13, 2024

                You're missing the case of duplicate products.

                If a product appears more than once in the list, you need to add the values.

  • grant horwood
    grant horwoodAug 12, 2024

    just an fyi: if you're benchmarking a php script, hrtime stopwatch is a better option.

    php.net/manual/en/class.hrtime-sto...

  • Kamil Yesil
    Kamil YesilAug 13, 2024

    I check GitHub. Why nestjs run web server? Other tests direct run. It's not fair. Maybe run js only nodejs or bun.

    • Linda Sebastian
      Linda SebastianAug 14, 2024

      I thought about it as well, you are right. I will fix it later.

  • Iván Vitta
    Iván VittaAug 15, 2024

    So compiled languages are faster than interpreted languages?

    • Linda Sebastian
      Linda SebastianAug 27, 2024

      I can't say that is correct or incorrect because if we want to know about this we need to compare between all compile languages and interpreter. In this experiment I only used Go.

  • Aman Kumar
    Aman KumarAug 15, 2024

    I couldn't help but see abnormalities with results. And no justice for nodejs there.

    I cloned your repository and run tests by writing my javascript version running with both nodejs and bunjs.
    Results good.

    Processor: Ryzen 7 5800H
    RAM: 16GB DDR4

    Nodejs and Bunjs:

    PS C:\aman\csv-parsing-battle\read-csv-nodejs> node .\index.js
    Nodejs Execution Time: 0.5335701
    Total Sales: $ 274836733.6899998
    Top Product: 1067 with sales $1067
    
    PS C:\aman\csv-parsing-battle\read-csv-nodejs> bun index.js
    Nodejs Execution Time: 0.4552085
    Total Sales: $ 274836733.6899998
    Top Product: 1067 with sales $1067
    
    Enter fullscreen mode Exit fullscreen mode

    The same golang code for reference:

    PS C:\aman\csv-parsing-battle\read-csv-go> go run .\sales.go
    Golang Execution time: 245.1086ms
    Total Sales: $274836733.69
    Top Product: 1067 with sales $308326.83
    
    Enter fullscreen mode Exit fullscreen mode

    EDIT:
    Added a PR github.com/rocklinda/csv-parsing-b...
    Used node 22

  • Lucas Sproule
    Lucas SprouleAug 18, 2024

    Why use a library for Javascript if you won't use a library for python?

    • Linda Sebastian
      Linda SebastianAug 27, 2024

      The initial idea is that I really want to know if the NestJS framework is fast enough or not because I used this framework at my previous workplace. However, the nodeJS is already updated in GitHub with help from @amankrokx you can take a look.

      • Aman Kumar
        Aman KumarAug 27, 2024

        I think the huge execution time difference is not because of using nestjs but rather the streaming way of getting CSV file and parsing it. The function call with rows can also contribute to it.
        If we read the CSV all at once and then process it, I beleive it will run way within 6 seconds.

        Then again, I haven't really used nestjs so maybe some other factors might be involved.
        But overall, processing is processing, as long as compiler/interpreter produces similar machine code, it should perform similarly.

  • Steve McDougall
    Steve McDougallAug 18, 2024

    You could probably refactor the PHP code to use generators for better speed. 1 million rows is enough data to see an improvement. It's a tipping point, if it isn't enough data it's actually slower - go figure

    • Linda Sebastian
      Linda SebastianAug 27, 2024

      I've never used generators before. Yeah, I will refactor this later to see the difference.

  • ChooKing
    ChooKingAug 19, 2024

    I would like to see Golang compared to other native code. All the other languages you tested are interpreted.

    • Linda Sebastian
      Linda SebastianAug 27, 2024

      Yeah, that's a good idea. The reason why I use those languages is because of their popularity in BE technology, I don't consider between interpreter and compile.

  • Jeroen Deviaene (Jerodev)
    Jeroen Deviaene (Jerodev)Aug 19, 2024

    In Go, you are converting the quantity to an int to then convert it to a float directly after. Converting to float immediately should noticably impact performance.

Add comment