Modifications:
- [x] Move `spack.util.string` to `llnl.string`
- [x] Remove dependency of `llnl` on `spack.error`
- [x] Move path of `spack.util.path` to `llnl.path`
- [x] Move `spack.util.environment.get_host_*` to `spack.spec`
Instead of showing
```
==> Error: Timed out waiting for a write lock.
```
show
```
==> Error: Timed out waiting for a write lock after 1.200ms and 4 attempts on file: /some/file
```
s.t. we actually get to see where acquiring a lock failed even when not
running in debug mode.
And use pretty time units everywhere, so we don't get 1.45e-9 seconds
but 1.450ns etc.
This adds lockfile tracking to Spack's lock mechanism, so that we ensure that there
is only one open file descriptor per inode.
The `fcntl` locks that Spack uses are associated with an inode and a process.
This is convenient, because if a process exits, it releases its locks.
Unfortunately, this also means that if you close a file, *all* locks associated
with that file's inode are released, regardless of whether the process has any
other open file descriptors on it.
Because of this, we need to track open lock files so that we only close them when
a process no longer needs them. We do this by tracking each lockfile by its
inode and process id. This has several nice properties:
1. Tracking by pid ensures that, if we fork, we don't inadvertently track the parent
process's lockfiles. `fcntl` locks are not inherited across forks, so we'll
just track new lockfiles in the child.
2. Tracking by inode ensures that referencs are counted per inode, and that we don't
inadvertently close a file whose inode still has open locks.
3. Tracking by both pid and inode ensures that we only open lockfiles the minimum
number of times necessary for the locks we have.
Note: as mentioned elsewhere, these locks aren't thread safe -- they're designed to
work in Python and assume the GIL.
Tasks:
- [x] Introduce an `OpenFileTracker` class to track open file descriptors by inode.
- [x] Reference-count open file descriptors and only close them if they're no longer
needed (this avoids inadvertently releasing locks that should not be released).
* fix remaining flake8 errors
* imports: sort imports everywhere in Spack
We enabled import order checking in #23947, but fixing things manually drives
people crazy. This used `spack style --fix --all` from #24071 to automatically
sort everything in Spack so PR submitters won't have to deal with it.
This should go in after #24071, as it assumes we're using `isort`, not
`flake8-import-order` to order things. `isort` seems to be more flexible and
allows `llnl` mports to be in their own group before `spack` ones, so this
seems like a good switch.
In debug mode, processes taking an exclusive lock write out their node name to
the lock file. We were using `getfqdn()` for this, but it seems to produce
inconsistent results when used from within some github actions containers.
We get this error because getfqdn() seems to return a short name in one place
and a fully qualified name in another:
```
File "/home/runner/work/spack/spack/lib/spack/spack/test/llnl/util/lock.py", line 1211, in p1
assert lock.host == self.host
AssertionError: assert 'fv-az290-764....cloudapp.net' == 'fv-az290-764'
- fv-az290-764.internal.cloudapp.net
+ fv-az290-764
!!!!!!!!!!!!!!!!!!!! Interrupted: stopping after 1 failures !!!!!!!!!!!!!!!!!!!!
== 1 failed, 2547 passed, 7 skipped, 22 xfailed, 2 xpassed in 1238.67 seconds ==
```
This seems to stem from https://bugs.python.org/issue5004.
We don't really need to get a fully qualified hostname for debugging, so use
`gethostname()` because its results are more consistent. This seems to fix the
issue.
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
- [x] add `concretize.lp`, `spack.yaml`, etc. to licensed files
- [x] update all licensed files to say 2013-2021 using
`spack license update-copyright-year`
- [x] appease mypy with some additions to package.py that needed
for oneapi.py
* switch from bool to int debug levels
* Added debug options and changed lock logging to use more detailed values
* Limit installer and timestamp PIDs to standard debug output
* Reduced verbosity of fetch/stage/install output, changing most to debug level 1
* Combine lock log methods; change build process install to debug
* Changed binary cache install messages to extraction messages
Fixes#9394Closes#13217.
## Background
Spack provides the ability to enable/disable parallel builds through two options: package `parallel` and configuration `build_jobs`. This PR changes the algorithm to allow multiple, simultaneous processes to coordinate the installation of the same spec (and specs with overlapping dependencies.).
The `parallel` (boolean) property sets the default for its package though the value can be overridden in the `install` method.
Spack's current parallel builds are limited to build tools supporting `jobs` arguments (e.g., `Makefiles`). The number of jobs actually used is calculated as`min(config:build_jobs, # cores, 16)`, which can be overridden in the package or on the command line (i.e., `spack install -j <# jobs>`).
This PR adds support for distributed (single- and multi-node) parallel builds. The goals of this work include improving the efficiency of installing packages with many dependencies and reducing the repetition associated with concurrent installations of (dependency) packages.
## Approach
### File System Locks
Coordination between concurrent installs of overlapping packages to a Spack instance is accomplished through bottom-up dependency DAG processing and file system locks. The runs can be a combination of interactive and batch processes affecting the same file system. Exclusive prefix locks are required to install a package while shared prefix locks are required to check if the package is installed.
Failures are communicated through a separate exclusive prefix failure lock, for concurrent processes, combined with a persistent store, for separate, related build processes. The resulting file contains the failing spec to facilitate manual debugging.
### Priority Queue
Management of dependency builds changed from reliance on recursion to use of a priority queue where the priority of a spec is based on the number of its remaining uninstalled dependencies.
Using a queue required a change to dependency build exception handling with the most visible issue being that the `install` method *must* install something in the prefix. Consequently, packages can no longer get away with an install method consisting of `pass`, for example.
## Caveats
- This still only parallelizes a single-rooted build. Multi-rooted installs (e.g., for environments) are TBD in a future PR.
Tasks:
- [x] Adjust package lock timeout to correspond to value used in the demo
- [x] Adjust database lock timeout to reduce contention on startup of concurrent
`spack install <spec>` calls
- [x] Replace (test) package's `install: pass` methods with file creation since post-install
`sanity_check_prefix` will otherwise error out with `Install failed .. Nothing was installed!`
- [x] Resolve remaining existing test failures
- [x] Respond to alalazo's initial feedback
- [x] Remove `bin/demo-locks.py`
- [x] Add new tests to address new coverage issues
- [x] Replace built-in package's `def install(..): pass` to "install" something
(i.e., only `apple-libunwind`)
- [x] Increase code coverage
Our `LockTransaction` class was reading overly aggressively. In cases
like this:
```
1 with spack.store.db.read_transaction():
2 with spack.store.db.write_transaction():
3 ...
```
The `ReadTransaction` on line 1 would read in the DB, but the
WriteTransaction on line 2 would read in the DB *again*, even though we
had a read lock the whole time. `WriteTransaction`s were only
considering nested writes to decide when to read, but they didn't know
when we already had a read lock.
- [x] `Lock.acquire_write()` return `False` in cases where we already had
a read lock.
If a write transaction was nested inside a read transaction, it would not
write properly on release, e.g., in a sequence like this, inside our
`LockTransaction` class:
```
1 with spack.store.db.read_transaction():
2 with spack.store.db.write_transaction():
3 ...
4 with spack.store.db.read_transaction():
...
```
The WriteTransaction on line 2 had no way of knowing that its
`__exit__()` call was the last *write* in the nesting, and it would skip
calling its write function.
The `__exit__()` call of the `ReadTransaction` on line 1 wouldn't know
how to write, and the file would never be written.
The DB would be correct in memory, but the `ReadTransaction` on line 4
would re-read the whole DB assuming that other processes may have
modified it. Since the DB was never written, we got stale data.
- [x] Make `Lock.release_write()` return `True` whenever we release the
*last write* in a nest.
Lock transactions were actually writing *after* the lock was
released. The code was looking at the result of `release_write()` before
writing, then writing based on whether the lock was released. This is
pretty obviously wrong.
- [x] Refactor `Lock` so that a release function can be passed to the
`Lock` and called *only* when a lock is really released.
- [x] Refactor `LockTransaction` classes to use the release function
instead of checking the return value of `release_read()` / `release_write()`
- remove the old LGPL license headers from all files in Spack
- add SPDX headers to all files
- core and most packages are (Apache-2.0 OR MIT)
- a very small number of remaining packages are LGPL-2.1-only
Fixes#9166
This is intended to reduce errors related to lock timeouts by making
the following changes:
* Improves error reporting when acquiring a lock fails (addressing
#9166) - there is no longer an attempt to release the lock if an
acquire fails
* By default locks taken on individual packages no longer have a
timeout. This allows multiple spack instances to install overlapping
dependency DAGs. For debugging purposes, a timeout can be added by
setting 'package_lock_timeout' in config.yaml
* Reduces the polling frequency when trying to acquire a lock, to
reduce impact in the case where NFS is overtaxed. A simple
adaptive strategy is implemented, which starts with a polling
interval of .1 seconds and quickly increases to .5 seconds
(originally it would poll up to 10^5 times per second).
A test is added to check the polling interval generation logic.
* The timeout for Spack's whole-database lock (e.g. for managing
information about installed packages) is increased from 60s to
120s
* Users can configure the whole-database lock timeout using the
'db_lock_timout' setting in config.yaml
Generally, Spack locks (those created using spack.llnl.util.lock.Lock)
now have no timeout by default
This does not address implementations of NFS that do not support file
locking, or detect cases where services that may be required
(nfslock/statd) aren't running.
Users may want to be able to more-aggressively release locks when
they know they are the only one using their Spack instance, and they
encounter lock errors after a crash (e.g. a remote terminal disconnect
mentioned in #8915).
- Fixes a bug in `llnl.util.lock`
- Locks in the current directory would fail because the parent directory
was the empty string.
- Fix this and return '.' for the parent of locks in the current
directory.
- Clean up error messages for when a lock can't be created, or when an
exclusive (write) lock can't be taken on a file.
- Add a number of subclasses of LockError to distinguish timeouts from
permission issues.
- Add an explicit check to prevent the user from taking a write lock on a
read-only file.
- We had a check for this for when we try to *upgrade* a lock on an RO
file, but not for an initial write lock attempt.
- Add more tests for different lock permission scenarios.
- write locks previously wrote information about the lock holder (host
and pid), and read locks woudl read this in.
- This is really only for debugging, so only enable it then
- add some tests that target debug info, and improve multiproc lock test
output
- spack.util.lock behaves the same as llnl.util.lock, but Lock._lock and
Lock._unlock do nothing.
- can be disabled with a control variable.
- configuration options can enable/disable locking:
- `locks` option in spack configuration controls whether Spack will use filesystem locks or not.
- `-l` and `-L` command-line options can force-disable or force-enable locking.
- Spack will check for group- and world-writability before disabling
locks, and it will not allow a group- or world-writable instance to
have locks disabled.
- update documentation
- Lock test can be run either as a node-local test or as an MPI test.
- Lock test is now parametrized by filesystem, so you can test the
locking capabilities of your NFS, Lustre, or GPFS filesystem. See docs
for details.
- Closing and re-opening to upgrade to write will lose all existing read
locks on this process.
- If we didn't allow ranges, sleeping until no reads would work.
- With ranges, we may never be able to take some legal write locks
without invalidating all reads. e.g., if a write lock has distinct
range from all reads, it should just work, but we'd have to close the
file, reopen, and re-take reads.
- It's easier to just check whether the file is writable in the first
place and open for writing from the start.
- Lock now only opens files read-only if we *can't* write them.
Major stuff:
- Created a FileCache for managing user cache files in Spack. Currently just
handles virtuals.
- Moved virtual cache from the repository to the home directory so that users do
not need write access to Spack repositories to use them.
- Refactored `Transaction` class in `database.py` -- moved it to
`LockTransaction` in `lock.py` and made it reusable by other classes.
Other additions:
- Added tests for file cache and transactions.
- Added a few more tests for database
- Fixed bug in DB where writes could happen even if exceptions were raised
during a transaction.
- `spack uninstall` now attempts to repair the database when it discovers that a
prefix doesn't exist but a DB record does.