Introduction

Benchmarking remains a critical (and often underestimated) tool when designing or validating large-scale data platforms. While many teams rely on synthetic workloads or production replays, standardized benchmarks still play a key role when comparing architectures, tuning clusters, or validating infrastructure choices.

TPCx-HS is one of those benchmarks: designed to stress big data systems at scale, with a focus on throughput, storage, and distributed processing.

However, the original reference implementations were designed for older Hadoop-centric stacks, making them increasingly difficult to run in modern environments built around Apache Spark 2.x / 3.x and Kubernetes.

This is why I decided to modernize TPCx-HS:

  • Make it runnable on current Spark versions
  • Make it Kubernetes-native
  • Keep it close to the spirit of the original benchmark, without over-engineering

The result is an open-source project you can find here:
👉 https://github.com/julienlau/tpcx-hs


Why TPCx-HS Still Matters

TPCx-HS (formerly TeraSort-based benchmarks) is designed to evaluate:

  • Large-scale data generation
  • Distributed sort performance
  • End-to-end system throughput (CPU, network, storage)

In practice, it’s a great stress test for:

  • Object storage vs HDFS
  • Shuffle performance
  • Executor sizing and memory pressure
  • Scheduler behavior under load

Yet, many teams avoid it today because:

  • Legacy scripts assume YARN
  • Spark versions are outdated
  • Kubernetes is not supported out of the box

That gap is exactly what this project addresses.


What Was Changed

1. Spark 2.x and 3.x Compatibility

The codebase was refactored to:

  • Run on Spark 2.x and Spark 3.x
  • Avoid deprecated APIs
  • Use configuration patterns that work across versions

This makes it usable on:

  • On-prem clusters
  • Cloud managed Spark
  • Spark-on-Kubernetes distributions

2. Kubernetes-First Design

Running Spark on Kubernetes introduces very different constraints compared to YARN:

  • No shared filesystem by default
  • Ephemeral executors
  • Externalized storage (S3, MinIO, etc.)
  • Explicit resource isolation

The project was adapted to:

  • Work cleanly with Spark on Kubernetes
  • Be compatible with object storage backends
  • Avoid assumptions tied to HDFS or static nodes

This makes it suitable for modern cloud-native data platforms.


3. Minimalism Over Reinvention

This is not a full benchmarking framework.

Deliberate choices:

  • No custom orchestration layer
  • No vendor-specific tuning
  • No attempt to “optimize the benchmark itself”

The goal is simple:

Provide a clean, runnable, understandable TPCx-HS implementation for modern Spark platforms.

You are expected to tune your cluster, not the benchmark.


Typical Use Cases

This implementation is useful if you want to:

  • Compare Spark-on-K8s vs legacy YARN
  • Stress test object storage throughput
  • Validate executor sizing and shuffle behavior
  • Reproduce performance issues under controlled load
  • Have a known baseline before production rollout

It is intentionally infrastructure-agnostic.


What This Is Not

To set expectations clearly:

  • ❌ Not a certified TPC submission
  • ❌ Not a vendor benchmark
  • ❌ Not an auto-tuning solution

It is a practical engineering tool, not a marketing benchmark.


Who This Is For

This project will be most useful to:

  • Data platform engineers
  • SREs working on Spark infrastructure
  • Architects migrating from Hadoop/YARN to Kubernetes
  • Anyone who needs realistic, heavy Spark workloads without production data

If you’ve ever asked:

“Can my Spark platform really handle this scale?”

This benchmark helps you answer it.

Repository

📦 Source code and documentation:
👉 https://github.com/julienlau/tpcx-hs

Contributions, feedback, and performance reports are welcome.

Linkedin post


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *