Linux package management

Updated: 2005-03-02
Created: 2003

A less schematic discussion of many of the same concepts is available here.

How to map packages onto filesystems

  1. A package contains a collection of names and files.
  2. So does a filesystem; indeed a package contains a mini filesystem.
  3. Traditional method: just copy the package into the filesystem.
  4. Problems:
    • Cannot uninstall easily.
    • Overlaps among packages undetected.
    • Overwrites because of overlaps cannot be undone.
  5. Why is this bad?
    • Security.
    • Lack of reproducibility for mass installs.
    • Lack of reinstallability, especially configs.

Requirements

Bear in mind the concept of unavoidable functionality: it must be implemented, the only choice is whether in the computer or the user's head (e.g. spooling, compilers, ...).

  1. Cascading package merging, undoable. Reasons:
    • Patches/upgrades.
    • Configuration: default config, site config, host config, each in a separate package.
  2. Very simple requirements, implementation hard:
    • Add package to filesystem.
    • Given a package, list all files in it (by filesystem, which includes listing not installed its because overriden).
    • Given a filesystem (which can be just a simple file), list all packages in it (this includes listing all files that are not in any package).
    • Remove package from filesystem, restoring state before adding (undoing overrides).
  3. Additional requirements:
    • List set of package installation prerequisite capabilities.
    • List set of package provided capabilities.
    • Handle very large numbers of packages and files.

The problem with overlapping packages

  1. Overlaps must be detected.
  2. Overlapped files must be saved on install.
  3. Saved overlaps must be restored on uninstall.
  4. This can happen to several levels.
  5. Sophisticated state tracking.
  6. What about partially overlapping files? Not a problem to be solved at this level; thus the trend towards splitting files into directories.

Filesystem structure

  1. A filesystem is a classification system.
  2. It is implemented as a set of names that map (many-to-one) to a set of files.
  3. Directories need not actually be implemented, can be entirely virtual (but beware search permissions). Names don't necessarily have any given structure.

Multidimensional view

  1. Each name can consist of keywords, listed in any order.
  2. usr/lib/emacs/site-init.el same as lib/site-init.el/emacs/usr.
  3. Any set of keywords defines a directory.
  4. Any unique subset of a file's keywords identifies it.
  5. cd changes the set of default keywords.
  6. This solves the package problem.
  7. This does not exist, we need kludges.

Hierarchical view

  1. The set of keywords must be listed in an order given at file construction.
  2. Each different ordering defines a different name.
  3. Which ordering for package installation? In practice two:
    • Package name tree prefix:
      fileutils/bin/ls
      fileutils/man/ls.1
      textutils/bin/cat
      textutils/man/cat.1
      
    • Merged trees, no leading package name, only categories:
      bin/ls
      man/ls.1
      bin/cat
      man/cat.1
      

Which view?

  1. Package name leading solves the package problem.
  2. However, because of paths, UNIX/Linux uses the merged tree structure (except for /opt), with several trees and subtrees.
  3. How to preserve package ownerships in merged trees?
    • In-band multiple views, via link (hard or symbolic) farms: one canonical (because of overlap restore) package leading view (the depot), one merged view.
    • Out-of-band databases: one canonical merged view, database tracks package ownership.
  4. In-band solves to a large extent the package problem, but has other problems:
    • Hard links don't span partitions, don't apply to directories, cpio not suitable.
    • Symbolic links are ugly, inefficient, fragile.
  5. Out-of-band requires a lot of extra work. Does not solve easily the overlap restore problem.

Package structure

Files
  • The single most important part is the list of files that are part of the package. In theory this is all that should be necessary and desirable; any other information might render the package installation stateful.
  • The other issue is whether the paths are absolute or relative, or both. Ideally relative, but rare.
Scripts
  • Often packages contain pre/post install/remove scripts. These are bad news, because the package state is carried inside more or less invisible or incomprehensible code.
  • They are usually used to edit configuraiton files or to start/stop daemons, automagically, something that the user should do themselves.
Metadata
  • Metadata is not bad as long as it does not affect the semantics of the package.
  • It usually includes both package and packager information.
  • Particularly important is version information: for the original sources and for the particular package instance.

Dependencies

Not so much package specific, but package system specific.

Some packages contain only dependencies, usually called virtual packages.

Build requisites
  • If the package manager provides a particular build logic, a package might be tagged with the list of packages that must be installed in order to build it.
  • Absolutely essential for distribution builders.
  • It creates a number of very tricky situations.
Runtime requisites
  • List of packages or capabilities.
  • Sometimes list of shared libraries (bad idea).
Runtime provides
These are most useful if generic: most packages require functionality from another package, not a specific package.

Some solutions

  1. Lots of package formats, there is a converter, Alien, and a related comparison of the package formats it can convert.

Depot

  1. The major root of all link farm systems, developed at CMU.
  2. Each volume has a depot subdirectory in which packages are installed with package name leading.
  3. Installation merges into the filesystem by creating symbolic links.
  4. Optimisations: if a filesystem directory contains files from a single package, that is done as a symbolic link to the package directory. This can change if files from other packages have to be installed in that directory.
  5. Easy to list bits of filesystem that are not in any package, and which package any bit of filesystem comes from (both encoded in the symbolic link).
  6. Restoring overlaps requires extra state.

Lots of others ('stow' etc.)

stow
Symbolic link farm.
GRAFT
Another symbolic link farm.
SEPP
Sumbolic link farm, but quite mad (a single per package wrapper is used to invoke every cmmand in the package).
Slackware package tools
tar archives with scripts and manifest.
Portage
Used by the Gentoo distribution.
SLP
used by the Stampede distribution.

Major Linux ones

  1. All popular package managers are out-of-band; perhaps this is not that good.
  2. RPM is the LSB package manager, so the others matter really little; perhaps this is not that good.

RPM

  1. Very very badly documented.
  2. Out of band; state is kept in binary format in a Berkeley DB database. Various versions of DB have been used.
  3. Package files are in-band, with a binary header followed by a cpio archive.
  4. Overrides are sort of handled: checksums, and some files may be renamed to .rpmsave (overriden) or .rpmnew (not overriding). Probably its best feature.
  5. Only popular package manager that uses cpio; not a bad choice overall, as tar is a bigger mess, especially with long filenames. Historical reasons too.
  6. All package metadata contained in a .spec file. Poorly designed format, in particular for relocatable packages.
  7. Each distribution defines different .spec file extensions. Because of this and different distribution filesystem layouts, RPM packages are not portable.
  8. LSB supposedly standardises RPM and filesystem layouts.
  9. Conectiva Linux had modified Debian's APT to work with RPM instead of DPKG.

DPKG

  1. Out of band installation; state is kept as a set of text files, about five files for each package.
  2. Package files are in-band, all belong to an ar archive that contains a couple of tar archives and a tag file.
  3. Very poor implementation choices, in particular the state directory can contain dozens of thousands of files.
  4. More complete dependency management than others.
  5. Clever, but grossly inefficient, frontends.
  6. Not in the LSB, fortunately.

POSIX/Solaris pkgadd

  1. Used by most proprietary UNIXes, also used at one time by Caldera, which ahs now switched to RPM.
  2. Package is a tar archive with in-band scripts.
  3. Package is first unpacked in a temporary directory, and then copied to its definitive resting place.

Other issues

Package naming

General principle: left to right increasing specificity.

Filesystem layout

It should have major hierarchies (e.g. /, usr) and frameworks/subhierarchies (e.g. TeX, X11R6, grass) in which multiple related (by use of the same libraries or data formats) get merged.

Package building

Resources