Existing approaches and drawbacks
All alternative approaches – like Conda, packed virtual environments, PEX, and cluster-pack – require the packages to be installed locally first. That virtual Python environment (venv) is then wrapped into an archive file and shipped to the cluster nodes. Installing the packages and preparing the archive is usually done outside the PySpark application, either manually in a terminal or by running a shell script (Conda, venv-pack, PEX). Running such commands might be difficult for users, especially in a notebook environment like Jupyter.
For some approaches, the archive contains the entire Python installation, not just the packages required by the PySpark application, which might not be needed by the cluster nodes. Shipping just the minimal set of required packages can improve start-up speed.
Those archived environments complement the PySpark application. Consecutive versions of the app very likely depend on the same package versions. This makes the archives replicate the identical Python environments over and over again. Alternatively, installing the packages from a package repository on runtime significantly reduces the footprint of a deployed PySpark application.
Install dependencies from Python code
An alternative approach is provided by the spark-extension package, which has the following objectives:
- Small footprint: Only required packages are shipped to the cluster nodes, not the entire environment that starts the PySpark application. Dependencies are downloaded at runtime to keep the profile of the application low.
- Isolation: Other PySpark apps must not see the packages.
- Native code: Support packages with native code (compiled libraries).
- Interactive: Support package installation without restarting an active PySpark session.
- Based on PIP: Utilize PIP features like fetching from PyPi, caching files, building wheels, or version ranges.
The spark-extension
package allows the user to install PIP packages and Poetry projects from within an active Spark session. This makes it especially useful for working with interactive notebooks like Jupyter. There is no need to restart the Spark session to install more packages.
These three steps are needed to install packages:
- Add
spark-extension
to your PySpark application by either installing thepyspark-extension
PyPi package or by adding thespark-extension
Maven package (note the differences in the package name) as a dependency as follows:
1.1. If you install the pyspark-extension
PyPi package (on the driver only):
pip install pyspark-extension==2.11.1.3.5
1.2. If you add the spark-extension
Maven package as a dependency:
from pyspark.sql import SparkSession spark = SparkSession.builder \ .config("spark.jars.packages", "uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.5") \ .getOrCreate()
spark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.5 [script.py]
pyspark --packages uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.5
2. Import the gresearch.spark
package
# noinspection PyUnresolvedReferences from gresearch.spark import *
3. Install PIP packages or Poetry projects
spark.install_pip_package("pandas", "pyarrow") spark.install_poetry_project("../my-poetry-project/", poetry_python="../venv-poetry/bin/python")
The install_pip_package
function supports all pip install
arguments:
# install packages with version specs spark.install_pip_package("pandas==1.4.3", "pyarrow~=8.0.0") # install packages from package sources (e.g. git clone https://github.com/pandas-dev/pandas.git) spark.install_pip_package("./pandas/") # install packages from git repo spark.install_pip_package("git+https://github.com/pandas-dev/pandas.git@main") # use a pip cache directory to cache downloaded and built whl files spark.install_pip_package("pandas", "pyarrow", "--cache-dir", "/home/user/.cache/pip") # use an alternative index url (other than https://pypi.org/simple) spark.install_pip_package("pandas", "pyarrow", "--index-url", "https://artifacts.company.com/pypi/simple") # install pip packages quietly (only disables output of PIP) spark.install_pip_package("pandas", "pyarrow", "--quiet")