Bit by bit

Setting PySpark up on Windows¹ for local runs is relatively easy… when you know the supplementary steps and hidden requirements necessary to make it work.

As one of the dependencies is winutils², the Spark version that can run on Windows depends on the maintainer’s release of winutils.exe and hadoop.dll. For example, as of 10 Nov 2023, because the latest winutils version is 3.3.5, Windows users can only use Spark 3.3.x, i.e., Spark’s MAJOR.MINOR version must match winutils’ MAJOOR.MINOR version.

‘Nuff said, the steps to setting up PySpark on Windows are as follows:

Download winutils.exe and hadoop.dll from winutils repo.
Download Spark that matches winutils’ (MAJOR.MINOR) version.
Unzip Spark and copy winutils.exe and hadoop.dll to Spark’s bin directory.
Set the environment variables SPARK_HOME and HADOOP_HOME to that Spark directory.
Download and unzip Java Development Kit (JDK).
Set the environment variable JAVA_HOME to that JDK directory.
Add %SPARK_HOME%\bin% and %JAVA_HOME%\bintoPATH` environment variable.
Create a virtualenv/venv³ and pip install pyspark (again, the (MAJOR.MINOR) version must match winutils’).
Navigate to the Scripts directory of the newly created virtualenv/venv and copy python.exe as python3.exe.

PySpark should work now; or does it? Nope, two caveats as of Nov 2023:

The latest Python version PySpark supports is 3.10, i.e., 3.11 and 3.12 will error out for certain operations, e.g., .createDataFrame().
The latest JDK Spark supports is 17.0, i.e.g, newer versions like the latest Long Term Support (LTS) JDK may cause issues.

Yes, Windows. Don’t judge. Most companies in Asia still issue Windows laptops. ↩︎
Another way is to use Hadoop Bare Naked Local FS with some (non-trivial) code to get it running. ↩︎
You develop with virtualenv/venv, don’t you? Do you?! ↩︎