Existing approaches and drawbacks
All alternative approaches – like Conda, packed virtual environments, PEX, and cluster-pack – require the packages to be installed locally first. That virtual Python environment (venv) is then wrapped into an archive file and shipped to the cluster nodes. Installing the packages and preparing the archive is usually done outside the PySpark application, either manually in a terminal or by running a shell script (Conda, venv-pack, PEX). Running such commands might be difficult for users, especially in a notebook environment like Jupyter.
For some approaches, the archive contains the entire Python installation, not just the packages required by the PySpark application, which might not be needed by the cluster nodes. Shipping just the minimal set of required packages can improve start-up speed.
Those archived environments complement the PySpark application. Consecutive versions of the app very likely depend on the same package versions. This makes the archives replicate the identical Python environments over and over again. Alternatively, installing the packages from a package repository on runtime significantly reduces the footprint of a deployed PySpark application.
Install dependencies from Python code
An alternative approach is provided by the spark-extension package, which has the following objectives:
- Small footprint: Only required packages are shipped to the cluster nodes, not the entire environment that starts the PySpark application. Dependencies are downloaded at runtime to keep the profile of the application low.
- Isolation: Other PySpark apps must not see the packages.
- Native code: Support packages with native code (compiled libraries).
- Interactive: Support package installation without restarting an active PySpark session.
- Based on PIP: Utilize PIP features like fetching from PyPi, caching files, building wheels, or version ranges.
The spark-extension package allows the user to install PIP packages and Poetry projects from within an active Spark session. This makes it especially useful for working with interactive notebooks like Jupyter. There is no need to restart the Spark session to install more packages.
These three steps are needed to install packages:
- Add
spark-extensionto your PySpark application by either installing thepyspark-extensionPyPi package or by adding thespark-extensionMaven package (note the differences in the package name) as a dependency as follows:
1.1. If you install the pyspark-extension PyPi package (on the driver only):
pip install pyspark-extension==2.11.1.3.5
1.2. If you add the spark-extension Maven package as a dependency:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.jars.packages", "uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.5") \
.getOrCreate()
spark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.5 [script.py]
pyspark --packages uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.5
2. Import the gresearch.spark package
# noinspection PyUnresolvedReferences from gresearch.spark import *
3. Install PIP packages or Poetry projects
spark.install_pip_package("pandas", "pyarrow")
spark.install_poetry_project("../my-poetry-project/", poetry_python="../venv-poetry/bin/python")
The install_pip_package function supports all pip install arguments:
# install packages with version specs
spark.install_pip_package("pandas==1.4.3", "pyarrow~=8.0.0")
# install packages from package sources (e.g. git clone https://github.com/pandas-dev/pandas.git)
spark.install_pip_package("./pandas/")
# install packages from git repo
spark.install_pip_package("git+https://github.com/pandas-dev/pandas.git@main")
# use a pip cache directory to cache downloaded and built whl files
spark.install_pip_package("pandas", "pyarrow", "--cache-dir", "/home/user/.cache/pip")
# use an alternative index url (other than https://pypi.org/simple)
spark.install_pip_package("pandas", "pyarrow", "--index-url", "https://artifacts.company.com/pypi/simple")
# install pip packages quietly (only disables output of PIP)
spark.install_pip_package("pandas", "pyarrow", "--quiet")