spack/lib/spack
Harmen Stoppels 0c0831861c
Avoid quadratic complexity in log parser (#26568)
TL;DR: there are matching groups trying to match 1 or more occurrences of
something. We don't use the matching group. Therefore it's sufficient to test
for 1 occurrence. This reduce quadratic complexity to linear time.

---

When parsing logs of an mpich build, I'm getting a 4 minute (!!) wait
with 16 threads for regexes to run:

```
In [1]: %time p.parse("mpich.log")
Wall time: 4min 14s
```

That's really unacceptably slow... 

After some digging, it seems a few regexes tend to have `O(n^2)` scaling
where `n` is the string / log line length. I don't think they *necessarily*
should scale like that, but it seems that way. The common pattern is this

```
([^:]+): error
```

which matches `: error` literally, and then one or more non-colons before that. So
for a log line like this:

```
abcdefghijklmnopqrstuvwxyz: error etc etc
```

Any of these are potential group matches when using `search` in Python:

```
abcdefghijklmnopqrstuvwxyz
 bcdefghijklmnopqrstuvwxyz
  cdefghijklmnopqrstuvwxyz
                         ⋮
                        yz
                         z
```

but clearly the capture group should return the longest match.

My hypothesis is that Python has a very bad implementation of `search`
that somehow considers all of these, even though it can be implemented
in linear time by scanning for `: error` first, and then greedily expanding
the longest possible `[^:]+` match to the left. If Python indeed considers
all possible matches, then with `n` matches of length `1 .. n` you
see the `O(n^2)` slowness (i verified this by replacing + with {1,k}
and doubling `k`, it doubles the execution time indeed).

This PR fixes this by removing the `+`, so effectively changing the 
O(n^2) into a O(n) worst case.

The reason we are fine with dropping `+` is that we don't use the
capture group anywhere, so, we just ensure `:: error` is not a match
but `x: error` is.

After going from O(n^2) to O(n), the 15MB mpich build log is parsed
in `1.288s`, so about 200x faster.

Just to be sure I've also updated `^CMake Error.*:` to `^CMake Error`,
so that it does not match with all the possible `:`'s in the line.
Another option is to use `.*?` there to make it quit scanning as soon as
possible, but what line that starts with `CMake Error` that does not have
a colon is really a false positive...
2021-10-12 00:05:11 -07:00
..
docs installer: Support showing status information in terminal title (#16259) 2021-10-11 17:54:59 +02:00
env cc: make error messages more clear 2021-10-04 18:30:19 -07:00
external Avoid quadratic complexity in log parser (#26568) 2021-10-12 00:05:11 -07:00
llnl installer: Support showing status information in terminal title (#16259) 2021-10-11 17:54:59 +02:00
spack installer: Support showing status information in terminal title (#16259) 2021-10-11 17:54:59 +02:00