Parsing structured strings in Java has always been painful. Most developers reach for regular expressions, split()
, or manual slicing. But these techniques are error-prone, hard to read, and most importantly—unsafe at compile time.
The StringFormat
class makes parsing so easy that even a beginner can implement with a one-liner.
🧩 Regex is Not Easy
Take a common use case: parsing a file path like this:
/logs/2024/05/16/system.log
You want to extract year, month, day, and filename. Here’s what most people do:
private static final Pattern LOG_PATH = Pattern.compile(
"/logs/(\\d{4})/(\\d{2})/(\\d{2})/(.+)\\.log"
);
// elsewhere in code:
Matcher matcher = LOG_PATH.matcher(path);
if (matcher.matches()) {
String year = matcher.group(1);
String month = matcher.group(2);
String day = matcher.group(3);
String file = matcher.group(4);
}
Some problems:
- Regex pattern readability sucks.
- Use group names and your pattern readability sucks even more.
- With pattern and extraction logic far apart, you could get the groups out of order.
- Group indices are magic numbers.
✅ Structured Parsing with StringFormat
Tthe same logic using StringFormat
:
private static final StringFormat FORMAT =
new StringFormat("/logs/{year}/{month}/{day}/{file}.log");
FORMAT.parseOrThrow(
"/logs/2024/05/16/system.log",
(year, month, day, file) ->
new Log(parseInt(year), parseInt(month), parseInt(day), file);
That’s it. No string math, no group numbers.
🛡️ Compile-time Safety
StringFormat performs compile-time validation of the lambda:
// ❌ Error: parameter count mismatch
new StringFormat("{a}-{b}")
.parseOrThrow("1-2", (a, b, c) -> { });
// ~~~~~~~~~~~~~~~
// Compilation error: too many parameters
// ❌ Error: parameter order mismatch
new StringFormat("{a}-{b}")
.parseOrThrow("1-2", (b, a) -> { });
// ~~~~~~~
// Compilation error: expected order (a, b)
// ✅ Correct: order matches field declaration
new StringFormat("{a}-{b}")
.parseOrThrow("1-2", (a, b) -> ...);
This level of safety is not possible with regex, split()
, or ad-hoc parsing.
No need to “remember” group indices or keep documentation in sync with code—the compiler checks it for you.
✅ Scanning repeatedly
Say, if you have many such file paths in the input string, you can lazily scan them all:
List<Log> logs = FORMAT.scan(input, (year, month, day, file) -> ...)
.filter(...) // apply whatever filtering you care about
.limit(10) // if you want up to 10 such files
.toList();
✅ Regex is Unpredictable
Regex engines in Java (and many other platforms) use NFA-based backtracking, which means certain patterns (especially involving nested repetitions or ambiguous alternations) can cause catastrophic slowdowns.
Even simple-looking regexes like:
href="([^"]*)"
or:
((a|b|c)+)+
can trigger catastrophic backtracking when matched against malicious input like nested quotes or repeated characters. These aren’t toy examples—real-world systems have gone down because of them:
- Stack Overflow 2016: a regex used to extract comment anchors caused a global outage due to backtracking explosion (postmortem).
- Cloudflare 2019: a single WAF rule with a pathological regex caused CPU saturation and took down large parts of the network (incident report).
RE2, Google's safe regex engine, avoids this with DFA-based guarantees—but at the cost of expressive power (e.g., no backreferences or lookaround) and often slower than hand-written substring logic.
✅ StringFormat uses indexOf()
calls
StringFormat
avoids regex entirely. It splits the template into static fragments and uses a left-to-right scan driven by String.indexOf(...)
to locate placeholder matches.
This gives several benefits:
- ✅ Deterministic linear scanning
- ✅ No backtracking
- ✅ No regex syntax or escaping
- ✅ Fast constant-time fragment matching (in practice,
indexOf(...)
is extremely fast for short needles) - ✅ Safe under adversarial input (no regex ReDoS risk)
In real workloads, where fragments are small (/
, :
, @
, etc.) and inputs are well-structured, indexOf()
outperforms regex by a wide margin—often many times faster per match.
🧠 Summary
If you're building tooling that parses structured text:
- Don't use regex unless you need full pattern generality.
- Don’t trust that it will behave the same under scale or attack.
- Use simple substring matching where structure allows—like
StringFormat
does.
↔️ Bidirectional
While parsing is the main use case, the same format string also supports formatting:
String path = FORMAT.format(year, month, day, file);
The format string is always the source of truth. No string concatenation. No misplaced .append()
chains.
Similar compile-time protection exists: you'll get compilation error if you pass in wrong number of args, or for example get the file
and year/month/day
in the wrong order.
With traditional String.format("input size for %s is %s", user.id(), input.size())
, it's not a good idea to stash away the format string as a constant because then it's easy to pass the format arguments in the wrong order.
But with StringFormat
and its compile-time check, it's safe to do so, making it easier to reuse the same format string at different places.
format()
uses direct string concatenation (+
), which is faster than StringBuilder
on Java 9 and higher.
🐍 Python's parse
Looking around, Python offers similar syntax in its parse
library:
from parse import parse
result = parse("/logs/{year}/{month}/{day}/{file}.log", "/logs/2024/05/17/server.log")
print(result.named) # {'year': '2024', 'month': '05', 'day': '17', 'file': 'server'}
Except parse()
offers no compile-time enforcements, and under the hood is a wrapper of regex, so suffers the same NFA-based regex performance overhead and potential disastrous backtracking.
🌍 Other Languages
Language | Closest Equivalent | Readability | Compile-Time Safety | Bidirectional | Notes |
---|---|---|---|---|---|
Java | StringFormat |
✅ High | ✅ Yes | ✅ Yes | Template-based parsing with lambda + type safety |
Python | parse |
✅ High | ❌ No | ✅ Yes | Clean syntax, runtime-only verification |
JS/TS |
path-to-regexp , match()
|
⚠️ Medium | ❌ No | ⚠️ Partial | URL-focused, lacks general structure matching |
Go |
regexp (manual group extraction) |
❌ Low | ❌ No | ❌ No | Verbose and error-prone, no field names |
C++ |
std::regex , manual parsing |
❌ Low | ❌ No | ❌ No | No built-in structure mapping; verbose |
Kotlin |
Regex , manual destructuring |
⚠️ Medium | ❌ No | ❌ No | No declarative templates, no type checks |
🔍 Observations
-
Python is the only language with syntax on par with
StringFormat
, but all verification is deferred to runtime. - JavaScript, Go, C++, and Kotlin rely on regex or ad-hoc logic without structural templates.
- Only Java + StringFormat delivers template-based parsing with lambda field binding, and compiler-checked safety.
👉 Github Repo: Mug
Can
StringFormat
support automatic type validation and conversion like%d
or Python parse's:d
?It seems
parse
can also support datetime parsing, liketi
?