Setting PySpark up on Windows1 for local runs is relatively easy… when you know the supplementary steps and hidden requirements necessary to make it work.
As one of the dependencies is winutils2, the Spark version that
can run on Windows depends on the maintainer’s release of winutils.exe
and hadoop.dll
.
For example, as of 10 Nov 2023, because the latest winutils version is 3.3.5,
Windows users can only use Spark 3.3.x, i.e., Spark’s MAJOR.MINOR version
must match winutils’ MAJOOR.MINOR version.
‘Nuff said, the steps to setting up PySpark on Windows are as follows:
- Download
winutils.exe
andhadoop.dll
from winutils repo. - Download Spark that matches winutils’ (MAJOR.MINOR) version.
- Unzip Spark and copy
winutils.exe
andhadoop.dll
to Spark’sbin
directory. - Set the environment variables
SPARK_HOME
andHADOOP_HOME
to that Spark directory. - Download and unzip Java Development Kit (JDK).
- Set the environment variable
JAVA_HOME
to that JDK directory. - Add
%SPARK_HOME%\bin%
and %JAVA_HOME%\binto
PATH` environment variable. - Create a virtualenv/venv3 and pip install pyspark (again, the (MAJOR.MINOR) version must match winutils’).
- Navigate to the Scripts directory of the newly created virtualenv/venv
and copy
python.exe
aspython3.exe
.
PySpark should work now; or does it? Nope, two caveats as of Nov 2023:
- The latest Python version PySpark supports is 3.10, i.e., 3.11 and 3.12
will error out for certain operations, e.g.,
.createDataFrame()
. - The latest JDK Spark supports is 17.0, i.e.g, newer versions like the latest Long Term Support (LTS) JDK may cause issues.
-
Yes, Windows. Don’t judge. Most companies in Asia still issue Windows laptops. ↩︎
-
Another way is to use Hadoop Bare Naked Local FS with some (non-trivial) code to get it running. ↩︎
-
You develop with virtualenv/venv, don’t you? Do you?! ↩︎