Lateral Flattening of JSON Data in Python

Introduction

Lateral flattening is the process of converting nested or hierarchical JSON structures into flat, tabular formats. Python’s pandas library provides powerful tools for this task, particularly the json_normalize function. This article explores how to use Python to explode arrays and flatten nested JSON data for use in analytics, databases, or machine learning pipelines.

JSON’s nested structure is ideal for web APIs and configuration files but becomes cumbersome for:

Relational databases (e.g., PostgreSQL, MySQL).
Tabular analysis tools (e.g., pandas, Excel).
Machine learning models (most require 2D input).

Flattening resolves nested keys and explodes arrays into rows or columns, enabling compatibility with these systems.

Python Implementation with `pandas`

Key Function: `json_normalize`

The pandas.json_normalize method recursively flattens nested JSON structures. Its parameters include:

data: The JSON input (dict or list of dicts).
record_path: The key containing the array to explode.
meta: Fields to preserve as metadata (e.g., id, name).
meta_prefix: A prefix for nested metadata keys (e.g., user_).

Example 1: Exploding a Simple Array

Input JSON:

{
  "id": 1,
  "name": "John Doe",
  "contactIds": [1, 2, 3, 4]
}

Python Code:

import pandas as pd

data = {
    "id": 1,
    "name": "John Doe",
    "contactIds": [1, 2, 3, 4]
}

# Explode the "contactIds" array into rows
df = pd.json_normalize(
    data,
    record_path="contactIds",  # Array to explode
    meta=["id", "name"],       # Fields to retain
    record_prefix="contactId_" # Optional: prefix for exploded values
)

print(df)

Output:

   id      name  contactId_
0   1  John Doe           1
1   1  John Doe           2
2   1  John Doe           3
3   1  John Doe           4

Example 2: Handling Nested Objects

For JSON with nested objects (e.g., address.street), json_normalize automatically concatenates keys:

Input JSON:

{
  "id": 1,
  "name": "John Doe",
  "address": {
    "street": "123 Main St",
    "city": "Anytown"
  }
}

Python Code:

df = pd.json_normalize(data)
print(df)

Output:

   id      name  address.street address.city
0   1  John Doe     123 Main St     Anytown

To rename columns:

df.columns = df.columns.str.replace(".", "_")

Example 3: Complex Nesting (Arrays of Objects)

For arrays containing nested objects, json_normalize combines key concatenation and array explosion:

Input JSON:

{
  "id": 1,
  "orders": [
    {"item": "A", "price": 10},
    {"item": "B", "price": 20}
  ]
}

Python Code:

df = pd.json_normalize(
    data,
    record_path="orders",  # Explode the "orders" array
    meta=["id"],           # Keep "id" as metadata
    meta_prefix="user_"
)

print(df)

Output:

  item  price  user_id
0    A     10        1
1    B     20        1

Advanced Customization

Handling Missing Data

Use the errors parameter to ignore or raise errors for missing fields:

pd.json_normalize(data, errors="ignore")  # Skip missing keys

Flattening Multiple Levels

For deeply nested JSON, combine json_normalize with recursive functions or custom logic.

Alternatives to `pandas`

flatdict Library: Lightweight flattening without dependencies.
Manual Recursion: Custom Python functions for edge cases.

When to Avoid Flattening

Preserving Hierarchy: Nested JSON is more efficient for tree-like data (e.g., organizational charts).
APIs: Clients often expect nested responses.

Footnotes

Use max_level in json_normalize to control flattening depth (e.g., max_level=2).
Flattened JSON may increase storage size due to duplicated metadata.

dotnet dev @dotnetdev