CSV Challenge
jorin

jorin @jorinvo

About: indie dev. building https://taleshape.com

Location:
chiang mai, thailand
Joined:
Feb 3, 2017

CSV Challenge

Publish Date: Nov 10 '17
18 35

You got your hands on some data that was leaked from a social network and you want to help the poor people.

Luckily you know a government service to automatically block a list of credit cards.

The service is a little old school though and you have to upload a CSV file in the exact format. The upload fails if the CSV file contains invalid data.

The CSV files should have two columns, Name and Credit Card. Also, it must be named after the following pattern:

YYYYMMDD.csv.

The leaked data doesn't have credit card details for every user and you need to pick only the affected users.

The data was published here:

data.json

You don't have much time to act.

What tools would you use to get the data, format it correctly and save it in the CSV file?


Do you have a crazy vim configuration that allows you to do all of this inside your editor? Are you a shell power user and write this as a one-liner? How would you solve this in your favorite programming language?

Show your solution in the comments below!

Comments 35 total

  • Thomas Rayner
    Thomas RaynerNov 10, 2017

    PowerShell to the rescue!

    $json = invoke-webrequest 'gist.githubusercontent.com/jorinvo...' | convertfrom-json

    $json | select name,creditcard | export-csv "$(get-date -format yyyyMMdd).csv" -NoTypeInformation

  • Tobias Salzmann
    Tobias SalzmannNov 10, 2017

    ramda-cli:

    curl -s https://gist.githubusercontent.com/jorinvo/7f19ce95a9a842956358/raw/e319340c2f6691f9cc8d8cc57ed532b5093e3619/data.json \
    | ramda 'filter where name: (complement isNil), creditcard: (complement isNil)' 'map (x) -> x.name + ", " + x.creditcard' -o raw > `date +%Y%m%d.csv`
    

    scala:

    import java.io.{BufferedWriter, FileOutputStream, OutputStreamWriter}
    import java.text.SimpleDateFormat
    import java.util.Date
    
    import io.circe.generic.auto._
    import io.circe.parser._
    
    object Data extends App {
      case class CCInfo(name: Option[String], creditcard: Option[String])
    
      val url = "https://gist.githubusercontent.com/jorinvo/7f19ce95a9a842956358/raw/e319340c2f6691f9cc8d8cc57ed532b5093e3619/data.json"
      val json = scala.io.Source.fromURL(url).mkString
    
      val infos = decode[List[CCInfo]](json).toOption.get
    
      val lines = infos.collect{case CCInfo(Some(name), Some(creditcard)) => s"$name, $creditcard"}
    
      Helper.writeToFile(lines, s"${Helper.formatDate("yyyyMMdd")}.csv")
    }
    
    object Helper {
      def writeToFile(lines: TraversableOnce[String], fileName: String): Unit = {
        val writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileName)))
        for (x <- lines) {
          writer.write(x + "\n")
        }
        writer.close()
      }
    
      def formatDate(format: String, date: Date = new Date()) = 
        new SimpleDateFormat(format).format(new Date())
    } 
    
  • Florian Rohrer
    Florian RohrerNov 11, 2017

    Nice post!

    import json
    from csv import DictWriter
    
    with open("data.json", "r") as f:
        users = json.load(f)
    
    cols = ["name", "creditcard"]
    with open("20150425.csv", "w", newline='') as f:
        dw = DictWriter(f, cols)
        dw.writeheader()
        for u in users:
            if u["creditcard"]:
                dw.writerow({k: u[k] for k in cols})
    

    All users share the same date. So I didn't bother and didn't write into separate files.
    Another thing, I was going to write "Hey, that's not valid json you are giving us.", because I saw the objects are in a list and that list is not wrapped into an outer object. But my Python parser did not complain, so it turns out valid. You learn something new every day.

    • Tobias Salzmann
      Tobias SalzmannNov 11, 2017

      Seems like json can have an array at the root, even according to the first standard: tools.ietf.org/html/rfc4627, section 2

    • jorin
      jorinNov 11, 2017

      Having arrays on the top-level of JSON documents is indeed valid although it is definitely an anti-pattern. By doing so you block yourself from adding any meta information in the future.
      If you build an API, you always want to wrap an array in an object. Then you can add additional fields like possible errors or pagination later on.
      e.g.

      {
        "data": [],
        "status": "not ok",
        "error": { "code": 123, "message": "..." },
        "page": 42
      }
      
      • Tobias Salzmann
        Tobias SalzmannNov 12, 2017

        Personally, I'd prefer the array in most cases. If I call an endpoint called customers, I would expect it to return an array of customers, not something that contains such an array, might or might not have an error and so on.
        If I want to stream the response, I'd also be better off with an array, because whatever streaming library I use probably supports it.

  • Ayman Nedjmeddine
    Ayman NedjmeddineNov 11, 2017

    A oneliner if you're a linuxer 😉

    curl -sSLo- https://gist.githubusercontent.com/jorinvo/7f19ce95a9a842956358/raw/e319340c2f6691f9cc8d8cc57ed532b5093e3619/data.json \
    | jq -r '.[] | {name: .name, creditcard: .creditcard} | join(",")' \
    > `date +%Y%m%d`.csv
    

    However, there is something you have not mentioned in your post: Should the CSV file have the header line?

    If yes, then use this:

    echo 'name,creditcard' > `date +%Y%m%d`.csv && \
    curl -sSLo- https://gist.githubusercontent.com/jorinvo/7f19ce95a9a842956358/raw/e319340c2f6691f9cc8d8cc57ed532b5093e3619/data.json \
    | jq -r '.[] | {name: .name, creditcard: .creditcard} | join(",")' \
    >> `date +%Y%m%d`.csv
    
    • Richard Metzler
      Richard MetzlerNov 11, 2017

      Doesn't the second solution need a >> in the last line, so the output is appended?

      • Ayman Nedjmeddine
        Ayman NedjmeddineNov 12, 2017

        Yes, it does. (Didn't copy the correct version)

        Thanks ☺

    • Devin Weaver
      Devin WeaverNov 12, 2017

      This adds quotes.

      "Dax Brekke II,1234-2121-1221-1211"
      "Brando Stanton Jr.,1228-1221-1221-1431"
      "Lacey McDermott PhD,"
      "Elza Bauch,"
      

      Maybe adding this sed command:

      curl -sSLo- https://gist.githubusercontent.com/jorinvo/7f19ce95a9a842956358/raw/e319340c2f6691f9cc8d8cc57ed532b5093e3619/data.json \
      | jq '.[] | {name: .name, creditcard: .creditcard} | join(",")' \
      | sed -e 's/^"//' -e 's/"$//' -e 's/\\"/"/g' \
      > "$(date +%Y%m%d).csv"
      
  • Thorsten Hirsch
    Thorsten HirschNov 11, 2017

    Well, at work I would use a tool called "IBM Transformation Extender", which is specialised on data transformation. It breaks the job down into 3 tasks:

    1. create the csv output format (there's a gui for that)
    2. import some example json data in order to create the input format
    3. develop the "map" by configuring 1 as output, 2 as input, and the following "mapping rule" for the transformation:
    =f_record(EXTRACT(Record:json, PRESENT(creditcard:.:json)))
    

    ...and in f_record() one would simply drag'n'drop the name and the credit card fields from the input to the output.

    Not the cheapest solution, obviously, but its maintainability is great if you have hundreds of these mappings.

  • Jakub Karczewski
    Jakub KarczewskiNov 12, 2017

    Since I started learning Ruby this week my solution written in it :D

    require 'open-uri'
    require 'json'
    
    url = "https://gist.githubusercontent.com/jorinvo/7f19ce95a9a842956358/raw/e319340c2f6691f9cc8d8cc57ed532b5093e3619/data.json"
    data = JSON.parse(open(url).read)
    i = 0
    
    File.open(DateTime.now.strftime("%Y%m%d") + ".csv", "w") do |f|
        f.write("Name,\"Credit Card\"")
        data.each do |record|
            if record["creditcard"]
                i+=1
                name = record["name"].match(/\s/) ? "\""+ record["name"] +"\"" : record["name"]
                f.write("\n"+name+","+record["creditcard"])
            end    
        end 
    end
    
    printf("Created CSV file, %d affected accounts detected", i)
    

    Thanks for another great challenge Jorin :)

  • Jonathan Stowe
    Jonathan StoweNov 12, 2017

    Perl 6? :

    use JSON::Fast;
    
    my $json = 'data.json'.IO.slurp;
    
    my $d = Date.today;
    my $out-filename = sprintf "%04i%02i%02i.csv", $d.year, $d.month, $d.day;
    
    my $out = $out-filename.IO.open(:w);
    
    for from-json($json).list -> %row {
        if %row<creditcard> {
            $out.say: %row<name>, ',', %row<creditcard>;
        }
    }
    $out.close;
    
    

    Of course in reality you'd probably want to use Text::CSV to properly format the CSV output in order to handle quoting and escaping properly.

  • Šimon Let
    Šimon LetNov 12, 2017

    Oneliner:

    curl "https://gist.githubusercontent.com/jorinvo/7f19ce95a9a842956358/raw/e319340c2f6691f9cc8d8cc57ed532b5093e3619/data.json" 2>/dev/null | \
     jq '.[] | .name +","+ .creditcard' --raw-output > `date +"%Y%m%d.csv"`
    
  • reed1
    reed1Nov 12, 2017

    Almost all (except 2 at this time) submission writes csv by hand, not using library. The output will not be valid if a value contains , or "

    • Michael
      MichaelNov 12, 2017

      True. I have not thought of that.
      I open the csv in LibreOffice, to make sure it comes out fine, but with really big files, it might not be possible.

  • Doshirae
    DoshiraeNov 12, 2017
    require "json"
    require "open-uri"
    url = "https://gist.githubusercontent.com/jorinvo/7f19ce95a9a842956358/raw/e319340c2f6691f9cc8d8cc57ed532b5093e3619/data.json"
    data = JSON.parse(open(url).read)
    
    filtered_data = data.select { |line| not line["creditcard"].nil? }
    file = File.open(DateTime.now.strftime("%Y%m%d") + ".csv", 'w')
    file.write "Name,Creditcart\n"
    filtered_data.each do |line|
        file.write [line["name"], line["creditcard"]].join(',')
        file.write("\n")
    end
    

    Or if you guys line nasty oneliners (requre statements don't count)

    require "json"
    require "open-uri"
    File.open(DateTime.now.strftime("%Y%m%d") + ".csv", 'w') { |file| file.write "Name,Creditcard\n"; JSON.parse(open("https://gist.githubusercontent.com/jorinvo/7f19ce95a9a842956358/raw/e319340c2f6691f9cc8d8cc57ed532b5093e3619/data.json").read).select { |line| not line["creditcard"].nil? }.each { |line| file.write "#{line['name']},#{line['creditcard']}\n" } }
    

    I'm trying to do it in Elixir now :D

  • Timur Zurbaev
    Timur ZurbaevNov 12, 2017

    PHP:

    <?php
    
    $json = json_decode(file_get_contents('https://gist.githubusercontent.com/jorinvo/7f19ce95a9a842956358/raw/e319340c2f6691f9cc8d8cc57ed532b5093e3619/data.json'), true);
    
    $users = array_filter($json, function (array $item) {
        return !empty($item['name']) && !empty($item['creditcard']);
    });
    
    $file = fopen(date('Ymd').'.csv', 'w+');
    
    foreach ($users as $user) {
        fputcsv($file, [$user['name'], $user['creditcard']]);
    }
    
    fclose($file);
    
    • Michael Orji
      Michael OrjiNov 12, 2017

      You beat me to the PHP implementation. And your solution is so elegant.

  • Michael
    MichaelNov 12, 2017

    I set myself a time limit of 15 minutes, with no google. I did not know how to download using python, so i used wget or powershell. The rest is straight forward.

    #!/usr/bin/env python3
    import json
    from datetime import datetime
    import os
    from sys import platform
    
    URL = "https://gist.githubusercontent.com/jorinvo/7f19ce95a9a842956358/raw/e319340c2f6691f9cc8d8cc57ed532b5093e3619/data.json"
    
    if platform == "linux" or platform == "linux2" or platform == "darwin":
        os.system("wget -O data.json %s" % URL)
    elif platform == "win32" or platform == "win64":
        os.system("powershell Invoke-WebRequest -Uri %s -OutFile data.json" % URL)
    
    with open('data.json', 'r') as input_file:
        input_data = json.load(input_file)
    
    with open('%s.csv' % datetime.today().strftime('%Y%m%d'), 'w') as output_file:
        for victim in input_data:
            if victim['creditcard']:
                output_file.write("%s,%s\n" % (victim['name'], victim['creditcard']))
    
  • Josh Cheek
    Josh CheekNov 12, 2017

    solution

    A few things to note: cache is a program I wrote that caches command-line invocations, it's to make it cheap to iterate (e.g. so you don't have to hit the network each time) github.com/JoshCheek/dotfiles/blob...

    My shell is fish (fishshell.com) which allows multi-line editing, and the parentheses in fish are like backticks in bash, so the > (...) is redirecting the output into a file whose name is the result of the ...

  • Devin Weaver
    Devin WeaverNov 12, 2017

    A vanilla Node.JS version:

    #!/usr/bin/env node
    
    function escapeCSV(str) {
      if (str == null) { return ''; }
      return /[",]/.test(str) ? `"${str.replace(/"/g, '\\"')}"` : str;
    }
    
    let data = require('./sample.json');
    process.stdout.write('Name,Credit Card\n');
    for (let { name, creditcard } of data) {
      let line = [name, creditcard].map(escapeCSV).join(',');
      process.stdout.write(`${line}\n`);
    }
    
  • Francesco Cogno
    Francesco CognoNov 12, 2017

    Aaaand Rust :)

    Really an overkill for this task but fun nevertheless!

    extern crate chrono;
    extern crate csv;
    extern crate futures;
    extern crate hyper;
    extern crate hyper_tls;
    extern crate serde;
    #[macro_use]
    extern crate serde_derive;
    extern crate serde_json;
    extern crate tokio_core;
    
    use futures::prelude::*;
    use futures::future::ok;
    use tokio_core::reactor::Core;
    use hyper::client::Client;
    use hyper_tls::HttpsConnector;
    use chrono::{DateTime, FixedOffset};
    use std::collections::HashMap;
    use csv::Writer;
    use std::fs::File;
    
    
    #[derive(Debug, Deserialize, Clone)]
    struct Record {
        name: String,
        email: Option<String>,
        city: Option<String>,
        mac: String,
        timestamp: String,
        creditcard: Option<String>,
    }
    
    #[derive(Debug, Clone)]
    struct RecordParsed {
        record: Record,
        ts: DateTime<FixedOffset>,
    }
    
    const FORMAT: &'static str = "%Y%m%d";
    
    fn main() {
        let mut core = Core::new().unwrap();
        let client = Client::configure()
            .connector(HttpsConnector::new(4, &core.handle()).unwrap())
            .build(&core.handle());
    
        let uri = "https://gist.githubusercontent.com/jorinvo/7f19ce95a9a842956358/raw/e319340c2f6691f9cc8d8cc57ed532b5093e3619/data.json"
            .parse()
            .unwrap();
    
        let fut = client.get(uri).and_then(move |resp| {
            resp.body().concat2().and_then(move |body| {
                let array: Vec<Record> = serde_json::from_slice(&body as &[u8]).unwrap();
                let mut a_parsed: HashMap<String, Vec<RecordParsed>> = HashMap::new();
    
                array
                    .into_iter()
                    .filter(|item| item.creditcard.is_some())
                    .map(|item| {
                        let dt =
                            DateTime::parse_from_str(&item.timestamp, "%Y-%m-%d %H:%M:%S %z").unwrap();
    
                        let rp = RecordParsed {
                            record: item,
                            ts: dt,
                        };
    
                        let date_only = format!("{}.csv", rp.ts.format(FORMAT).to_string());
    
                        let ret = match a_parsed.get_mut(&date_only) {
                            Some(ar) => {
                                ar.push(rp);
                                None
                            }
                            None => {
                                let mut ar: Vec<RecordParsed> = Vec::new();
                                ar.push(rp);
                                Some(ar)
                            }
                        };
    
                        if let Some(ar) = ret {
                            a_parsed.insert(date_only, ar);
                        }
                    })
                    .collect::<()>();
    
                a_parsed
                    .iter()
                    .map(|(key, array)| {
                        println!("generating file == {:?}", key);
                        let file = File::create(key).unwrap();
                        let mut wr = Writer::from_writer(file);
    
                        array
                            .iter()
                            .map(|record| {
                                let creditcard = match record.record.creditcard {
                                    Some(ref c) => c,
                                    None => panic!("should have filtered those!"),
                                };
                                wr.write_record(&[&record.record.name, creditcard]).unwrap();
                            })
                            .collect::<()>();
                    })
                    .collect::<()>();
    
                ok(())
            })
        });
    
        core.run(fut).unwrap();
    }
    
  • Devin Weaver
    Devin WeaverNov 12, 2017

    Since the input JSON could be really large, here is a Node.JS steaming version (using stream-json package):

    #!/usr/bin/env node
    let fs = require('fs');
    let { Transform } = require('stream');
    let StreamArray = require("stream-json/utils/StreamArray");
    let stream = StreamArray.make();
    
    function escapeCSV(str) {
      if (str == null) { return ''; }
      return /[",]/.test(str) ? `"${str.replace(/"/g, '\\"')}"` : str;
    }
    
    class CsvStream extends Transform {
      constructor() {
        super({objectMode: true});
      }
      _transform(chunk, enc, cb) {
        let { name, creditcard } = chunk.value;
        let line = [name, creditcard].map(escapeCSV).join(',');
        this.push(`${line}\n`);
        cb();
      }
    }
    
    process.stdin
      .pipe(stream.input);
    
    stream.output
      .pipe(new CsvStream())
      .pipe(process.stdout);
    
  • Al
    AlNov 13, 2017

    R

    library("jsonlite")
    
    frames <- fromJSON("https://gist.githubusercontent.com/jorinvo/7f19ce95a9a842956358/raw/e319340c2f6691f9cc8d8cc57ed532b5093e3619/data.json")
    frames <- frames[!is.na(frames$creditcard),]
    frames <- frames[,c("name","creditcard")]
    
    write.csv(frames, file="20171112.csv", row.names=FALSE)
    
  • alex䷰dante
    alex䷰danteNov 13, 2017
    import csv
    import requests
    
    LEAK_URL = 'https://gist.githubusercontent.com/jorinvo/7f19ce95a9a842956358/raw/e319340c2f6691f9cc8d8cc57ed532b5093e3619/data.json'
    
    with open('20171112.csv', 'w') as csv_file:
        writer = csv.DictWriter(csv_file, ['Name', 'Credit Card'])
        writer.writeheader()
        writer.writerows(
            {'Name':x['name'],'Credit Card':x['creditcard']} for x in requests.get(LEAK_URL).json() if x['creditcard']
        )
    
  • Ryan Palo
    Ryan PaloNov 13, 2017

    Using the CSV module to avoid any quoting pitfalls. :)

    require 'CSV'
    require 'date'
    require 'JSON'
    
    data = JSON.parse(`curl #{ARGV[0]}`)
    filename = Date.today.strftime('%Y%m%d') + '.csv'
    
    CSV.open("#{filename}.csv", 'w') do |csv|
      data
        .select { |item| item['name'] && item['creditcard'] }
        .map { |item| [item['name'], item['creditcard']] }
        .sort
        .each { |item| csv << item }
    end
    
    • jorin
      jorinNov 13, 2017

      Ruby is still one of the most pretty languages!
      Maybe you can use the open(url).read from require 'open-uri' instead of curl to allow it to run on other systems 🙂

      Alernatively could look like this:

      CSV.open "#{Date.today.strftime '%Y%m%d'}.csv", 'w' do |csv|
        JSON.parse(open(ARGV[0]).read).each { |x| csv << x if x['creditcard'] }
      end
      
      • Ryan Palo
        Ryan PaloNov 13, 2017

        Oh, I like that!

        1. I didn't know about those extra options for CSV. Awesome.
        2. I didn't know about the open-uri built-in. Also awesome.
        3. I love the short and sweet each block! It even feels a little Pythonic, which is nice. Also also awesome!
  • jorin
    jorinNov 15, 2017

    Just leaving a note here for everyone that would like see more tools and solutions. Checkout the original CSV Challenge.

  • Dave Bucklin
    Dave BucklinNov 15, 2017

    awk:

    awk '
        BEGIN {FS=",";OFS=","}
        /creditcard/ {
            split($1,namearr,":")
            name = namearr[2]; gsub(/"/,"",name)
            split($5,dtarr,":")
            dt = dtarr[2]; gsub(/"| .+|-/,"",dt)
            split($6,ccarr,":")
            cc = ccarr[2]; gsub(/"|}/,"",cc)
            fname=dt ".csv"
            print name, cc >> fname
        }
    ' data.json
    
Add comment