
Introduction
Benchmarking remains a critical (and often underestimated) tool when designing or validating large-scale data platforms. While many teams rely on synthetic workloads or production replays, standardized benchmarks still play a key role when comparing architectures, tuning clusters, or validating infrastructure choices.
TPCx-HS is one of those benchmarks: designed to stress big data systems at scale, with a focus on throughput, storage, and distributed processing.
However, the original reference implementations were designed for older Hadoop-centric stacks, making them increasingly difficult to run in modern environments built around Apache Spark 2.x / 3.x and Kubernetes.
This is why I decided to modernize TPCx-HS:
- Make it runnable on current Spark versions
- Make it Kubernetes-native
- Keep it close to the spirit of the original benchmark, without over-engineering
The result is an open-source project you can find here:
👉 https://github.com/julienlau/tpcx-hs
Why TPCx-HS Still Matters
TPCx-HS (formerly TeraSort-based benchmarks) is designed to evaluate:
- Large-scale data generation
- Distributed sort performance
- End-to-end system throughput (CPU, network, storage)
In practice, it’s a great stress test for:
- Object storage vs HDFS
- Shuffle performance
- Executor sizing and memory pressure
- Scheduler behavior under load
Yet, many teams avoid it today because:
- Legacy scripts assume YARN
- Spark versions are outdated
- Kubernetes is not supported out of the box
That gap is exactly what this project addresses.
What Was Changed
1. Spark 2.x and 3.x Compatibility
The codebase was refactored to:
- Run on Spark 2.x and Spark 3.x
- Avoid deprecated APIs
- Use configuration patterns that work across versions
This makes it usable on:
- On-prem clusters
- Cloud managed Spark
- Spark-on-Kubernetes distributions
2. Kubernetes-First Design
Running Spark on Kubernetes introduces very different constraints compared to YARN:
- No shared filesystem by default
- Ephemeral executors
- Externalized storage (S3, MinIO, etc.)
- Explicit resource isolation
The project was adapted to:
- Work cleanly with Spark on Kubernetes
- Be compatible with object storage backends
- Avoid assumptions tied to HDFS or static nodes
This makes it suitable for modern cloud-native data platforms.
3. Minimalism Over Reinvention
This is not a full benchmarking framework.
Deliberate choices:
- No custom orchestration layer
- No vendor-specific tuning
- No attempt to “optimize the benchmark itself”
The goal is simple:
Provide a clean, runnable, understandable TPCx-HS implementation for modern Spark platforms.
You are expected to tune your cluster, not the benchmark.
Typical Use Cases
This implementation is useful if you want to:
- Compare Spark-on-K8s vs legacy YARN
- Stress test object storage throughput
- Validate executor sizing and shuffle behavior
- Reproduce performance issues under controlled load
- Have a known baseline before production rollout
It is intentionally infrastructure-agnostic.
What This Is Not
To set expectations clearly:
- ❌ Not a certified TPC submission
- ❌ Not a vendor benchmark
- ❌ Not an auto-tuning solution
It is a practical engineering tool, not a marketing benchmark.
Who This Is For
This project will be most useful to:
- Data platform engineers
- SREs working on Spark infrastructure
- Architects migrating from Hadoop/YARN to Kubernetes
- Anyone who needs realistic, heavy Spark workloads without production data
If you’ve ever asked:
“Can my Spark platform really handle this scale?”
This benchmark helps you answer it.
Repository
📦 Source code and documentation:
👉 https://github.com/julienlau/tpcx-hs
Contributions, feedback, and performance reports are welcome.
0 Comments