Apache Spark Smoke Test on Kubernetes (TPCx-HS Modernized)

Introduction

Benchmarking remains a critical (and often underestimated) tool when designing or validating large-scale data platforms. While many teams rely on synthetic workloads or production replays, standardized benchmarks still play a key role when comparing architectures, tuning clusters, or validating infrastructure choices.

TPCx-HS is one of those benchmarks: designed to stress big data systems at scale, with a focus on throughput, storage, and distributed processing.

However, the original reference implementations were designed for older Hadoop-centric stacks, making them increasingly difficult to run in modern environments built around Apache Spark 2.x / 3.x and Kubernetes.

This is why I decided to modernize TPCx-HS:

Make it runnable on current Spark versions
Make it Kubernetes-native
Keep it close to the spirit of the original benchmark, without over-engineering

The result is an open-source project you can find here:
👉 https://github.com/julienlau/tpcx-hs

Why TPCx-HS Still Matters

TPCx-HS (formerly TeraSort-based benchmarks) is designed to evaluate:

Large-scale data generation
Distributed sort performance
End-to-end system throughput (CPU, network, storage)

In practice, it’s a great stress test for:

Object storage vs HDFS
Shuffle performance
Executor sizing and memory pressure
Scheduler behavior under load

Yet, many teams avoid it today because:

Legacy scripts assume YARN
Spark versions are outdated
Kubernetes is not supported out of the box

That gap is exactly what this project addresses.

What Was Changed

1. Spark 2.x and 3.x Compatibility

The codebase was refactored to:

Run on Spark 2.x and Spark 3.x
Avoid deprecated APIs
Use configuration patterns that work across versions

This makes it usable on:

On-prem clusters
Cloud managed Spark
Spark-on-Kubernetes distributions

2. Kubernetes-First Design

Running Spark on Kubernetes introduces very different constraints compared to YARN:

No shared filesystem by default
Ephemeral executors
Externalized storage (S3, MinIO, etc.)
Explicit resource isolation

The project was adapted to:

Work cleanly with Spark on Kubernetes
Be compatible with object storage backends
Avoid assumptions tied to HDFS or static nodes

This makes it suitable for modern cloud-native data platforms.

3. Minimalism Over Reinvention

This is not a full benchmarking framework.

Deliberate choices:

No custom orchestration layer
No vendor-specific tuning
No attempt to “optimize the benchmark itself”

The goal is simple:

Provide a clean, runnable, understandable TPCx-HS implementation for modern Spark platforms.

You are expected to tune your cluster, not the benchmark.

Typical Use Cases

This implementation is useful if you want to:

Compare Spark-on-K8s vs legacy YARN
Stress test object storage throughput
Validate executor sizing and shuffle behavior
Reproduce performance issues under controlled load
Have a known baseline before production rollout

It is intentionally infrastructure-agnostic.

What This Is Not

To set expectations clearly:

❌ Not a certified TPC submission
❌ Not a vendor benchmark
❌ Not an auto-tuning solution

It is a practical engineering tool, not a marketing benchmark.

Who This Is For

This project will be most useful to:

Data platform engineers
SREs working on Spark infrastructure
Architects migrating from Hadoop/YARN to Kubernetes
Anyone who needs realistic, heavy Spark workloads without production data

If you’ve ever asked:

“Can my Spark platform really handle this scale?”

This benchmark helps you answer it.

Repository

📦 Source code and documentation:
👉 https://github.com/julienlau/tpcx-hs

Contributions, feedback, and performance reports are welcome.

Linkedin post

Better call an expert

Apache Spark smoke tests on kubernetes

Published by jlu on 2026-01-122026-01-12