Python Packaging
Brief Introduction to Python Packages
A package is a collection of Python modules that can be imported into Python scripts. Compared with general Python scripts, a package organizes the Python code in a structured way to achieve specific functionalities, e.g., image processing, machine learning algorithms. numpy and scipy and good examples of Python packages. The basic package usage of includes import ... and from ... import ....
Packages has several advantages in developing and distributing Python code. It is the standard way to distribute the code and can be used as the building block for different projects. The version control also enables constant update and progress tracking. Some guidelines writing high-quality packages can be found here.
Writing Projects as a Package
In general, it is hard complete a project (e.g., a ML project) with a single script. Therefore, we need to organize our code in a way such that they are easy to read, debug, and maintain. A package provide a good structure to organize the project code, which facilitates the development and possible distribution. It also favors the editable development. Of course, if we only need simple scripts for the project implementation, package structure is not necessary.
To do this, we need to analyze the project requirements and break the requirement into different functional modules. Each module can be further implemented by some functions or classes. Then, we pack all functional modules into a package and import the package to our example scripts to achieve various goals.
We take the Image To Latex Project as a typical Python project example. In the project, the basic functions are organized in image_to_latex/, which serves as a package. Scripts that achieve different goals by calling the package are put in scripts/. In this way, other users can directly use the image-to-latex package to implement their projects by referring to the examples in the scripts/.
? Note that we could implement all requirements in a package. For example, we implement the training and UI inside the package and encapsulate it with a run() function, so that we only need to call the run() function in the package to have a holistic Image-to-Latex recognition system. While it is doable, it may introduce extra complexity in the package. We should treat a package as a tool for the project. We import the functionalities of a package to build the project and achieve the requirements.
We refer to the post Coding Practice for Python Projects for a reference structure for Python projects.
Packaging Configuration
If we want to distribute our Python code, we first need to package the code to make it into an agreed format and then ship it. The distributed package is also called the library. There are two types of distributions for Python packages: source and binary. Source distributions are suitable for small libraries. In practice, binary distributed are more common and are shipped in the wheel, a package format to ship libraries with compiled artifacts. Mature Python libraries can be uploaded to PyPI to be found and used by all Python users. We also refer to Overview of Python Packaging.
There are three configuration files related to Python packaging: setup.py, setup.cfg, and pyproject.toml. These files are used to package the Python code into distributable libraries. They are common but confusing. We use discuss their history for clarity.
A Brief History of Python Build Tools
We need extra tools to covert our code to packages. This tool is called the Python build tool. Nowadays, there are many build tools available and we need different configuration file to configure these tools, which sometimes can be confusing. Therefore, it is benefitial to know some history of Python build tools. The following are based on the post Extra: A History of Python Build Tools.
As we see, especially for binary libraries, we need tools to compile it (build it) and then package it. People have developed many tools to achieve this. This tool is also called the Python build tool. At the very beginning, in Python 2.2, distutils was a module of Python’s standard library that allowed users to install and distribute their own packages. Then, it is superseded by setuptools and was deprecated in PEP 632.
To use setuptools to build a Python project, we generally need setup() functions in the module. What we do is that we create a setup.py script in the project which call functions setuptools, and run the script to build our python project. Until now, people have needed to write a Python script to build a project. If we want to change some building parameters to the project, we need to read and understand the setup.py script and change it, which is considered a good style since there is too much logic to configure a project. Therefore, to make the configuration more clear, people extract settings (or options) in the setup.py (more specifically, settings in setup() function) to a new configuration file setup.cfg. Then, it is sufficient to change building options in the configuration file. There is a need to write complex code in setup.py. See What’s the difference between setup.py and setup.cfg in python projects.
However, setuptools is not in the Python’s standard library. It means that if we want to package a Python project, we first need to install the setuptools package and use it to parse the required packages. For example, we create a python package foo which uses numpy. In order to build the package, we first meed to install setuptools and tell it the required package is numpy. In fact, we need both numpy and setuptools to build my project. In the era of distutils, this was not a problem for Python developers, as distutils was shipped as part of Python’s standard library. So can we use a configuration file to specify setuptools as a required package? Besides, there are other Python building tools, for example, flit, hatch, pdm, poetry, trampolim, and whey. Can we also use a configuration file to specify which one to use?
This consideration is reflected in PEP 517/518 in 2015, where people tried to use a standardized configuration file pyproject.toml to specify the build configurations. Since majority of Python projects were built by setuptools. First, two configuration files, pyproject.toml and setup.cfg, were used to specify built configurations, where the first specifies using setuptools and the second specifies the setup options.
Now move to 2020, PEP 621 decided to incorporate project metadata (build options) to the pyproject.toml. In this way, there is no need for setup.cfg since everything can be written into a single pyproject.toml. With PEP 660, the Python community standardized a way to use wheel files to create editable installs, and therefore, setup.py is no longer required. Therefore, for the current Python project, we only need to include one pyproject.toml. setup.py and setup.cfg are no longer needed for build configurations. It is also recommended by PyPA that modern Python projects use pyproject.toml for build configurations.
Difference Between Three Files
From the history, we know that
setup.pyis a Python script for building a Python project using utilities from the packagesetuptools.setup.cfgis a straightforward configuration file for thesetup()function in thesetup.py. It is created to reduce the complex logic needed to set build configurations. People can modify configurations directly in this file.pyproject.tmolis a new configuration file that unifies the build end selection and builds configurations. It is recommended to usepyproject.tomlfor build configuration.
This post is helpful. Understanding setup.py, setup.cfg and pyproject.toml in Python
Usage
Some useful references:
- A Practical Guide to Using Setup.py
- A Practical Guide to Setuptools and Pyproject.toml
- Writing your
pyproject.tomlTutorial by PyPA - Configuring setuptools using
pyproject.tomlfiles Tutorial by SETUPTOOLS
Use setup.py Only
from setuptools import setup, find_packages
setup(
name='my_proj',
version='0.0.1',
description='pip install test with setup.py only',
packages=find_packages(include=['my_proj', 'my_proj.*']),
install_requires=[
'numpy>=1.26.0',
'scipy>=1.13.0'
],
extras_require={
'interactive': ['matplotlib>=3.6.0',],
}
)
Use setup.py and setup.cfg
from setuptools import setup, find_packages
setup()
[metadata]
name = my_proj
version = 0.0.1
description = pip install test with setup.cfg
[options]
packages = find:
install_requires =
numpy >= 1.26.0
scipy >= 1.13.0
[options.extras_require]
interactive = matplotlib>=3.6.0
[options.packages.find]
include = my_proj, my_proj*
We can find more specifications of setup.cfg in Configuring setuptools using setup.cfg files
Note: We cannot use a single setup.cfg for building. A setup.py file containing a setup() function call is still required even if the configuration resides in setup.cfg. They need to be used together. See Configuring setuptools using setup.cfg files
Use pyproject.toml and setup.cfg
The setup.cfg remains the same as the previous approach.
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"
Use pyproject.toml Only
[build-system]
requires = ["setuptools >= 61.0"]
build-backend = "setuptools.build_meta"
[project]
name = "my_proj"
version = "0.0.1"
description = "pip install test with pyproject.toml only."
requires-python = ">=3.8"
dependencies = [
"numpy>=1.26.0",
"scipy>=1.13.0",
]
[project.optional-dependencies]
interactive = ["matplotlib>=3.6.0"]
[tool.setuptools.packages.find]
where = ["."]
include = ["my_proj", "my_proj.*"]
requirements.txt vs pyproject.toml
One for environments, one for building packages. Although overlaps in some cases, different purposes. Often cause confusions.
Although there are repeated contents in two files, i.e., dependencies, we should understand that they serve completely different purposes. The same applies to requirements.txt and setup.py. See install_requires vs requirements files
pyptoject.toml describes the dependency through dependencies and optional-dependencies tables. The dependencies are used to build the project. It is designed for build tools. We can think of it as a listing of “Abstract” requirements that a project needs to run correctly.
However, one dependency may further depend on other dependencies. requirements.txt lists all pip install arguments in a file. It aims to show what packages are needed to configure the package running environment. In other words, requirements.txt tells you what packages are needed to achieve a complete environment. It often contains an exhaustive listing of pinned versions.
Note that requirements.txt is not used for building the package. We run pip install -r requirements.txt to configure the environment that supports the package running. But package itself is not built yet. It is more like a design document for deployment stuff, letting you know the environment.
If we simply write scripts and do not want to build a package, then requirements.txt is sufficient. We simply install necessary packages and we can run the script. But we need to pay attention that import and from ... import ... for self-written scripts are not good practice if we do not organize scripts in a package structure.