2017-10-01 13:39:21 +08:00
|
|
|
# -----------------------------------------------------------------------------
|
|
|
|
# CMake - Cross Platform Makefile Generator
|
|
|
|
# Copyright 2000-2017 Kitware, Inc. and Contributors
|
|
|
|
# All rights reserved.
|
|
|
|
#
|
|
|
|
# Redistribution and use in source and binary forms, with or without
|
|
|
|
# modification, are permitted provided that the following conditions
|
|
|
|
# are met:
|
|
|
|
#
|
|
|
|
# * Redistributions of source code must retain the above copyright
|
|
|
|
# notice, this list of conditions and the following disclaimer.
|
|
|
|
#
|
|
|
|
# * Redistributions in binary form must reproduce the above copyright
|
|
|
|
# notice, this list of conditions and the following disclaimer in the
|
|
|
|
# documentation and/or other materials provided with the distribution.
|
|
|
|
#
|
|
|
|
# * Neither the name of Kitware, Inc. nor the names of Contributors
|
|
|
|
# may be used to endorse or promote products derived from this
|
|
|
|
# software without specific prior written permission.
|
|
|
|
#
|
|
|
|
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
|
|
|
# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
|
|
|
# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
|
|
|
# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
|
|
|
# HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
|
|
|
|
# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
|
|
|
|
# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
|
|
|
# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
|
|
|
|
# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
|
|
|
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
|
|
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
#
|
|
|
|
# -----------------------------------------------------------------------------
|
|
|
|
#
|
|
|
|
# The above copyright and license notice applies to distributions of
|
|
|
|
# CMake in source and binary form. Third-party software packages supplied
|
|
|
|
# with CMake under compatible licenses provide their own copyright notices
|
|
|
|
# documented in corresponding subdirectories or source files.
|
|
|
|
#
|
|
|
|
# -----------------------------------------------------------------------------
|
|
|
|
#
|
|
|
|
# CMake was initially developed by Kitware with the following sponsorship:
|
|
|
|
#
|
|
|
|
# * National Library of Medicine at the National Institutes of Health
|
|
|
|
# as part of the Insight Segmentation and Registration Toolkit (ITK).
|
|
|
|
#
|
|
|
|
# * US National Labs (Los Alamos, Livermore, Sandia) ASC Parallel
|
|
|
|
# Visualization Initiative.
|
|
|
|
#
|
|
|
|
# * National Alliance for Medical Image Computing (NAMIC) is funded by the
|
|
|
|
# National Institutes of Health through the NIH Roadmap for Medical
|
|
|
|
# Research, Grant U54 EB005149.
|
|
|
|
#
|
|
|
|
# * Kitware, Inc.
|
|
|
|
# -----------------------------------------------------------------------------
|
|
|
|
"""Functions to parse build logs and extract error messages.
|
|
|
|
|
|
|
|
This is a python port of the regular expressions CTest uses to parse log
|
|
|
|
files here:
|
|
|
|
|
|
|
|
https://github.com/Kitware/CMake/blob/master/Source/CTest/cmCTestBuildHandler.cxx
|
|
|
|
|
|
|
|
This file takes the regexes verbatim from there and adds some parsing
|
|
|
|
algorithms that duplicate the way CTest scrapes log files. To keep this
|
|
|
|
up to date with CTest, just make sure the ``*_matches`` and
|
|
|
|
``*_exceptions`` lists are kept up to date with CTest's build handler.
|
|
|
|
"""
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
from __future__ import print_function
|
|
|
|
from __future__ import division
|
|
|
|
|
2017-10-01 13:39:21 +08:00
|
|
|
import re
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
import math
|
|
|
|
import multiprocessing
|
2022-11-15 17:07:54 +08:00
|
|
|
import io
|
2022-08-27 07:37:56 +08:00
|
|
|
import sys
|
|
|
|
import threading
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
import time
|
|
|
|
from contextlib import contextmanager
|
|
|
|
|
2017-10-01 13:39:21 +08:00
|
|
|
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
_error_matches = [
|
2021-09-26 16:35:04 +08:00
|
|
|
"^FAIL: ",
|
|
|
|
"^FATAL: ",
|
|
|
|
"^failed ",
|
|
|
|
"FAILED",
|
|
|
|
"Failed test",
|
2017-10-01 13:39:21 +08:00
|
|
|
"^[Bb]us [Ee]rror",
|
|
|
|
"^[Ss]egmentation [Vv]iolation",
|
|
|
|
"^[Ss]egmentation [Ff]ault",
|
|
|
|
":.*[Pp]ermission [Dd]enied",
|
Avoid quadratic complexity in log parser (#26568)
TL;DR: there are matching groups trying to match 1 or more occurrences of
something. We don't use the matching group. Therefore it's sufficient to test
for 1 occurrence. This reduce quadratic complexity to linear time.
---
When parsing logs of an mpich build, I'm getting a 4 minute (!!) wait
with 16 threads for regexes to run:
```
In [1]: %time p.parse("mpich.log")
Wall time: 4min 14s
```
That's really unacceptably slow...
After some digging, it seems a few regexes tend to have `O(n^2)` scaling
where `n` is the string / log line length. I don't think they *necessarily*
should scale like that, but it seems that way. The common pattern is this
```
([^:]+): error
```
which matches `: error` literally, and then one or more non-colons before that. So
for a log line like this:
```
abcdefghijklmnopqrstuvwxyz: error etc etc
```
Any of these are potential group matches when using `search` in Python:
```
abcdefghijklmnopqrstuvwxyz
bcdefghijklmnopqrstuvwxyz
cdefghijklmnopqrstuvwxyz
⋮
yz
z
```
but clearly the capture group should return the longest match.
My hypothesis is that Python has a very bad implementation of `search`
that somehow considers all of these, even though it can be implemented
in linear time by scanning for `: error` first, and then greedily expanding
the longest possible `[^:]+` match to the left. If Python indeed considers
all possible matches, then with `n` matches of length `1 .. n` you
see the `O(n^2)` slowness (i verified this by replacing + with {1,k}
and doubling `k`, it doubles the execution time indeed).
This PR fixes this by removing the `+`, so effectively changing the
O(n^2) into a O(n) worst case.
The reason we are fine with dropping `+` is that we don't use the
capture group anywhere, so, we just ensure `:: error` is not a match
but `x: error` is.
After going from O(n^2) to O(n), the 15MB mpich build log is parsed
in `1.288s`, so about 200x faster.
Just to be sure I've also updated `^CMake Error.*:` to `^CMake Error`,
so that it does not match with all the possible `:`'s in the line.
Another option is to use `.*?` there to make it quit scanning as soon as
possible, but what line that starts with `CMake Error` that does not have
a colon is really a false positive...
2021-10-12 15:05:11 +08:00
|
|
|
"[^ :]:[0-9]+: [^ \\t]",
|
|
|
|
"[^:]: error[ \\t]*[0-9]+[ \\t]*:",
|
2017-10-01 13:39:21 +08:00
|
|
|
"^Error ([0-9]+):",
|
|
|
|
"^Fatal",
|
2018-06-03 12:07:02 +08:00
|
|
|
"^[Ee]rror: ",
|
2017-10-01 13:39:21 +08:00
|
|
|
"^Error ",
|
|
|
|
"[0-9] ERROR: ",
|
|
|
|
"^\"[^\"]+\", line [0-9]+: [^Ww]",
|
|
|
|
"^cc[^C]*CC: ERROR File = ([^,]+), Line = ([0-9]+)",
|
|
|
|
"^ld([^:])*:([ \\t])*ERROR([^:])*:",
|
|
|
|
"^ild:([ \\t])*\\(undefined symbol\\)",
|
Avoid quadratic complexity in log parser (#26568)
TL;DR: there are matching groups trying to match 1 or more occurrences of
something. We don't use the matching group. Therefore it's sufficient to test
for 1 occurrence. This reduce quadratic complexity to linear time.
---
When parsing logs of an mpich build, I'm getting a 4 minute (!!) wait
with 16 threads for regexes to run:
```
In [1]: %time p.parse("mpich.log")
Wall time: 4min 14s
```
That's really unacceptably slow...
After some digging, it seems a few regexes tend to have `O(n^2)` scaling
where `n` is the string / log line length. I don't think they *necessarily*
should scale like that, but it seems that way. The common pattern is this
```
([^:]+): error
```
which matches `: error` literally, and then one or more non-colons before that. So
for a log line like this:
```
abcdefghijklmnopqrstuvwxyz: error etc etc
```
Any of these are potential group matches when using `search` in Python:
```
abcdefghijklmnopqrstuvwxyz
bcdefghijklmnopqrstuvwxyz
cdefghijklmnopqrstuvwxyz
⋮
yz
z
```
but clearly the capture group should return the longest match.
My hypothesis is that Python has a very bad implementation of `search`
that somehow considers all of these, even though it can be implemented
in linear time by scanning for `: error` first, and then greedily expanding
the longest possible `[^:]+` match to the left. If Python indeed considers
all possible matches, then with `n` matches of length `1 .. n` you
see the `O(n^2)` slowness (i verified this by replacing + with {1,k}
and doubling `k`, it doubles the execution time indeed).
This PR fixes this by removing the `+`, so effectively changing the
O(n^2) into a O(n) worst case.
The reason we are fine with dropping `+` is that we don't use the
capture group anywhere, so, we just ensure `:: error` is not a match
but `x: error` is.
After going from O(n^2) to O(n), the 15MB mpich build log is parsed
in `1.288s`, so about 200x faster.
Just to be sure I've also updated `^CMake Error.*:` to `^CMake Error`,
so that it does not match with all the possible `:`'s in the line.
Another option is to use `.*?` there to make it quit scanning as soon as
possible, but what line that starts with `CMake Error` that does not have
a colon is really a false positive...
2021-10-12 15:05:11 +08:00
|
|
|
"[^ :] : (error|fatal error|catastrophic error)",
|
|
|
|
"[^:]: (Error:|error|undefined reference|multiply defined)",
|
|
|
|
"[^:]\\([^\\)]+\\) ?: (error|fatal error|catastrophic error)",
|
2017-10-01 13:39:21 +08:00
|
|
|
"^fatal error C[0-9]+:",
|
|
|
|
": syntax error ",
|
|
|
|
"^collect2: ld returned 1 exit status",
|
|
|
|
"ld terminated with signal",
|
|
|
|
"Unsatisfied symbol",
|
|
|
|
"^Unresolved:",
|
|
|
|
"Undefined symbol",
|
|
|
|
"^Undefined[ \\t]+first referenced",
|
Avoid quadratic complexity in log parser (#26568)
TL;DR: there are matching groups trying to match 1 or more occurrences of
something. We don't use the matching group. Therefore it's sufficient to test
for 1 occurrence. This reduce quadratic complexity to linear time.
---
When parsing logs of an mpich build, I'm getting a 4 minute (!!) wait
with 16 threads for regexes to run:
```
In [1]: %time p.parse("mpich.log")
Wall time: 4min 14s
```
That's really unacceptably slow...
After some digging, it seems a few regexes tend to have `O(n^2)` scaling
where `n` is the string / log line length. I don't think they *necessarily*
should scale like that, but it seems that way. The common pattern is this
```
([^:]+): error
```
which matches `: error` literally, and then one or more non-colons before that. So
for a log line like this:
```
abcdefghijklmnopqrstuvwxyz: error etc etc
```
Any of these are potential group matches when using `search` in Python:
```
abcdefghijklmnopqrstuvwxyz
bcdefghijklmnopqrstuvwxyz
cdefghijklmnopqrstuvwxyz
⋮
yz
z
```
but clearly the capture group should return the longest match.
My hypothesis is that Python has a very bad implementation of `search`
that somehow considers all of these, even though it can be implemented
in linear time by scanning for `: error` first, and then greedily expanding
the longest possible `[^:]+` match to the left. If Python indeed considers
all possible matches, then with `n` matches of length `1 .. n` you
see the `O(n^2)` slowness (i verified this by replacing + with {1,k}
and doubling `k`, it doubles the execution time indeed).
This PR fixes this by removing the `+`, so effectively changing the
O(n^2) into a O(n) worst case.
The reason we are fine with dropping `+` is that we don't use the
capture group anywhere, so, we just ensure `:: error` is not a match
but `x: error` is.
After going from O(n^2) to O(n), the 15MB mpich build log is parsed
in `1.288s`, so about 200x faster.
Just to be sure I've also updated `^CMake Error.*:` to `^CMake Error`,
so that it does not match with all the possible `:`'s in the line.
Another option is to use `.*?` there to make it quit scanning as soon as
possible, but what line that starts with `CMake Error` that does not have
a colon is really a false positive...
2021-10-12 15:05:11 +08:00
|
|
|
"^CMake Error",
|
2017-10-01 13:39:21 +08:00
|
|
|
":[ \\t]cannot find",
|
|
|
|
":[ \\t]can't find",
|
|
|
|
": \\*\\*\\* No rule to make target [`'].*\\'. Stop",
|
|
|
|
": \\*\\*\\* No targets specified and no makefile found",
|
|
|
|
": Invalid loader fixup for symbol",
|
|
|
|
": Invalid fixups exist",
|
|
|
|
": Can't find library for",
|
|
|
|
": internal link edit command failed",
|
|
|
|
": Unrecognized option [`'].*\\'",
|
|
|
|
"\", line [0-9]+\\.[0-9]+: [0-9]+-[0-9]+ \\([^WI]\\)",
|
|
|
|
"ld: 0706-006 Cannot find or open library file: -l ",
|
|
|
|
"ild: \\(argument error\\) can't find library argument ::",
|
|
|
|
"^could not be found and will not be loaded.",
|
2021-09-26 16:35:04 +08:00
|
|
|
"^WARNING: '.*' is missing on your system",
|
2017-10-01 13:39:21 +08:00
|
|
|
"s:616 string too big",
|
|
|
|
"make: Fatal error: ",
|
|
|
|
"ld: 0711-993 Error occurred while writing to the output file:",
|
|
|
|
"ld: fatal: ",
|
|
|
|
"final link failed:",
|
|
|
|
"make: \\*\\*\\*.*Error",
|
|
|
|
"make\\[.*\\]: \\*\\*\\*.*Error",
|
|
|
|
"\\*\\*\\* Error code",
|
|
|
|
"nternal error:",
|
|
|
|
"Makefile:[0-9]+: \\*\\*\\* .* Stop\\.",
|
|
|
|
": No such file or directory",
|
|
|
|
": Invalid argument",
|
|
|
|
"^The project cannot be built\\.",
|
|
|
|
"^\\[ERROR\\]",
|
|
|
|
"^Command .* failed with exit code",
|
|
|
|
]
|
|
|
|
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
_error_exceptions = [
|
2017-10-01 13:39:21 +08:00
|
|
|
"instantiated from ",
|
|
|
|
"candidates are:",
|
|
|
|
": warning",
|
Avoid quadratic complexity in log parser (#26568)
TL;DR: there are matching groups trying to match 1 or more occurrences of
something. We don't use the matching group. Therefore it's sufficient to test
for 1 occurrence. This reduce quadratic complexity to linear time.
---
When parsing logs of an mpich build, I'm getting a 4 minute (!!) wait
with 16 threads for regexes to run:
```
In [1]: %time p.parse("mpich.log")
Wall time: 4min 14s
```
That's really unacceptably slow...
After some digging, it seems a few regexes tend to have `O(n^2)` scaling
where `n` is the string / log line length. I don't think they *necessarily*
should scale like that, but it seems that way. The common pattern is this
```
([^:]+): error
```
which matches `: error` literally, and then one or more non-colons before that. So
for a log line like this:
```
abcdefghijklmnopqrstuvwxyz: error etc etc
```
Any of these are potential group matches when using `search` in Python:
```
abcdefghijklmnopqrstuvwxyz
bcdefghijklmnopqrstuvwxyz
cdefghijklmnopqrstuvwxyz
⋮
yz
z
```
but clearly the capture group should return the longest match.
My hypothesis is that Python has a very bad implementation of `search`
that somehow considers all of these, even though it can be implemented
in linear time by scanning for `: error` first, and then greedily expanding
the longest possible `[^:]+` match to the left. If Python indeed considers
all possible matches, then with `n` matches of length `1 .. n` you
see the `O(n^2)` slowness (i verified this by replacing + with {1,k}
and doubling `k`, it doubles the execution time indeed).
This PR fixes this by removing the `+`, so effectively changing the
O(n^2) into a O(n) worst case.
The reason we are fine with dropping `+` is that we don't use the
capture group anywhere, so, we just ensure `:: error` is not a match
but `x: error` is.
After going from O(n^2) to O(n), the 15MB mpich build log is parsed
in `1.288s`, so about 200x faster.
Just to be sure I've also updated `^CMake Error.*:` to `^CMake Error`,
so that it does not match with all the possible `:`'s in the line.
Another option is to use `.*?` there to make it quit scanning as soon as
possible, but what line that starts with `CMake Error` that does not have
a colon is really a false positive...
2021-10-12 15:05:11 +08:00
|
|
|
": WARNING",
|
2017-10-01 13:39:21 +08:00
|
|
|
": \\(Warning\\)",
|
|
|
|
": note",
|
2021-09-26 16:35:04 +08:00
|
|
|
" ok",
|
2017-10-01 13:39:21 +08:00
|
|
|
"Note:",
|
|
|
|
":[ \\t]+Where:",
|
Avoid quadratic complexity in log parser (#26568)
TL;DR: there are matching groups trying to match 1 or more occurrences of
something. We don't use the matching group. Therefore it's sufficient to test
for 1 occurrence. This reduce quadratic complexity to linear time.
---
When parsing logs of an mpich build, I'm getting a 4 minute (!!) wait
with 16 threads for regexes to run:
```
In [1]: %time p.parse("mpich.log")
Wall time: 4min 14s
```
That's really unacceptably slow...
After some digging, it seems a few regexes tend to have `O(n^2)` scaling
where `n` is the string / log line length. I don't think they *necessarily*
should scale like that, but it seems that way. The common pattern is this
```
([^:]+): error
```
which matches `: error` literally, and then one or more non-colons before that. So
for a log line like this:
```
abcdefghijklmnopqrstuvwxyz: error etc etc
```
Any of these are potential group matches when using `search` in Python:
```
abcdefghijklmnopqrstuvwxyz
bcdefghijklmnopqrstuvwxyz
cdefghijklmnopqrstuvwxyz
⋮
yz
z
```
but clearly the capture group should return the longest match.
My hypothesis is that Python has a very bad implementation of `search`
that somehow considers all of these, even though it can be implemented
in linear time by scanning for `: error` first, and then greedily expanding
the longest possible `[^:]+` match to the left. If Python indeed considers
all possible matches, then with `n` matches of length `1 .. n` you
see the `O(n^2)` slowness (i verified this by replacing + with {1,k}
and doubling `k`, it doubles the execution time indeed).
This PR fixes this by removing the `+`, so effectively changing the
O(n^2) into a O(n) worst case.
The reason we are fine with dropping `+` is that we don't use the
capture group anywhere, so, we just ensure `:: error` is not a match
but `x: error` is.
After going from O(n^2) to O(n), the 15MB mpich build log is parsed
in `1.288s`, so about 200x faster.
Just to be sure I've also updated `^CMake Error.*:` to `^CMake Error`,
so that it does not match with all the possible `:`'s in the line.
Another option is to use `.*?` there to make it quit scanning as soon as
possible, but what line that starts with `CMake Error` that does not have
a colon is really a false positive...
2021-10-12 15:05:11 +08:00
|
|
|
"[^ :]:[0-9]+: Warning",
|
2017-10-01 13:39:21 +08:00
|
|
|
"------ Build started: .* ------",
|
|
|
|
]
|
|
|
|
|
|
|
|
#: Regexes to match file/line numbers in error/warning messages
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
_warning_matches = [
|
Avoid quadratic complexity in log parser (#26568)
TL;DR: there are matching groups trying to match 1 or more occurrences of
something. We don't use the matching group. Therefore it's sufficient to test
for 1 occurrence. This reduce quadratic complexity to linear time.
---
When parsing logs of an mpich build, I'm getting a 4 minute (!!) wait
with 16 threads for regexes to run:
```
In [1]: %time p.parse("mpich.log")
Wall time: 4min 14s
```
That's really unacceptably slow...
After some digging, it seems a few regexes tend to have `O(n^2)` scaling
where `n` is the string / log line length. I don't think they *necessarily*
should scale like that, but it seems that way. The common pattern is this
```
([^:]+): error
```
which matches `: error` literally, and then one or more non-colons before that. So
for a log line like this:
```
abcdefghijklmnopqrstuvwxyz: error etc etc
```
Any of these are potential group matches when using `search` in Python:
```
abcdefghijklmnopqrstuvwxyz
bcdefghijklmnopqrstuvwxyz
cdefghijklmnopqrstuvwxyz
⋮
yz
z
```
but clearly the capture group should return the longest match.
My hypothesis is that Python has a very bad implementation of `search`
that somehow considers all of these, even though it can be implemented
in linear time by scanning for `: error` first, and then greedily expanding
the longest possible `[^:]+` match to the left. If Python indeed considers
all possible matches, then with `n` matches of length `1 .. n` you
see the `O(n^2)` slowness (i verified this by replacing + with {1,k}
and doubling `k`, it doubles the execution time indeed).
This PR fixes this by removing the `+`, so effectively changing the
O(n^2) into a O(n) worst case.
The reason we are fine with dropping `+` is that we don't use the
capture group anywhere, so, we just ensure `:: error` is not a match
but `x: error` is.
After going from O(n^2) to O(n), the 15MB mpich build log is parsed
in `1.288s`, so about 200x faster.
Just to be sure I've also updated `^CMake Error.*:` to `^CMake Error`,
so that it does not match with all the possible `:`'s in the line.
Another option is to use `.*?` there to make it quit scanning as soon as
possible, but what line that starts with `CMake Error` that does not have
a colon is really a false positive...
2021-10-12 15:05:11 +08:00
|
|
|
"[^ :]:[0-9]+: warning:",
|
|
|
|
"[^ :]:[0-9]+: note:",
|
2017-10-01 13:39:21 +08:00
|
|
|
"^cc[^C]*CC: WARNING File = ([^,]+), Line = ([0-9]+)",
|
|
|
|
"^ld([^:])*:([ \\t])*WARNING([^:])*:",
|
Avoid quadratic complexity in log parser (#26568)
TL;DR: there are matching groups trying to match 1 or more occurrences of
something. We don't use the matching group. Therefore it's sufficient to test
for 1 occurrence. This reduce quadratic complexity to linear time.
---
When parsing logs of an mpich build, I'm getting a 4 minute (!!) wait
with 16 threads for regexes to run:
```
In [1]: %time p.parse("mpich.log")
Wall time: 4min 14s
```
That's really unacceptably slow...
After some digging, it seems a few regexes tend to have `O(n^2)` scaling
where `n` is the string / log line length. I don't think they *necessarily*
should scale like that, but it seems that way. The common pattern is this
```
([^:]+): error
```
which matches `: error` literally, and then one or more non-colons before that. So
for a log line like this:
```
abcdefghijklmnopqrstuvwxyz: error etc etc
```
Any of these are potential group matches when using `search` in Python:
```
abcdefghijklmnopqrstuvwxyz
bcdefghijklmnopqrstuvwxyz
cdefghijklmnopqrstuvwxyz
⋮
yz
z
```
but clearly the capture group should return the longest match.
My hypothesis is that Python has a very bad implementation of `search`
that somehow considers all of these, even though it can be implemented
in linear time by scanning for `: error` first, and then greedily expanding
the longest possible `[^:]+` match to the left. If Python indeed considers
all possible matches, then with `n` matches of length `1 .. n` you
see the `O(n^2)` slowness (i verified this by replacing + with {1,k}
and doubling `k`, it doubles the execution time indeed).
This PR fixes this by removing the `+`, so effectively changing the
O(n^2) into a O(n) worst case.
The reason we are fine with dropping `+` is that we don't use the
capture group anywhere, so, we just ensure `:: error` is not a match
but `x: error` is.
After going from O(n^2) to O(n), the 15MB mpich build log is parsed
in `1.288s`, so about 200x faster.
Just to be sure I've also updated `^CMake Error.*:` to `^CMake Error`,
so that it does not match with all the possible `:`'s in the line.
Another option is to use `.*?` there to make it quit scanning as soon as
possible, but what line that starts with `CMake Error` that does not have
a colon is really a false positive...
2021-10-12 15:05:11 +08:00
|
|
|
"[^:]: warning [0-9]+:",
|
2017-10-01 13:39:21 +08:00
|
|
|
"^\"[^\"]+\", line [0-9]+: [Ww](arning|arnung)",
|
Avoid quadratic complexity in log parser (#26568)
TL;DR: there are matching groups trying to match 1 or more occurrences of
something. We don't use the matching group. Therefore it's sufficient to test
for 1 occurrence. This reduce quadratic complexity to linear time.
---
When parsing logs of an mpich build, I'm getting a 4 minute (!!) wait
with 16 threads for regexes to run:
```
In [1]: %time p.parse("mpich.log")
Wall time: 4min 14s
```
That's really unacceptably slow...
After some digging, it seems a few regexes tend to have `O(n^2)` scaling
where `n` is the string / log line length. I don't think they *necessarily*
should scale like that, but it seems that way. The common pattern is this
```
([^:]+): error
```
which matches `: error` literally, and then one or more non-colons before that. So
for a log line like this:
```
abcdefghijklmnopqrstuvwxyz: error etc etc
```
Any of these are potential group matches when using `search` in Python:
```
abcdefghijklmnopqrstuvwxyz
bcdefghijklmnopqrstuvwxyz
cdefghijklmnopqrstuvwxyz
⋮
yz
z
```
but clearly the capture group should return the longest match.
My hypothesis is that Python has a very bad implementation of `search`
that somehow considers all of these, even though it can be implemented
in linear time by scanning for `: error` first, and then greedily expanding
the longest possible `[^:]+` match to the left. If Python indeed considers
all possible matches, then with `n` matches of length `1 .. n` you
see the `O(n^2)` slowness (i verified this by replacing + with {1,k}
and doubling `k`, it doubles the execution time indeed).
This PR fixes this by removing the `+`, so effectively changing the
O(n^2) into a O(n) worst case.
The reason we are fine with dropping `+` is that we don't use the
capture group anywhere, so, we just ensure `:: error` is not a match
but `x: error` is.
After going from O(n^2) to O(n), the 15MB mpich build log is parsed
in `1.288s`, so about 200x faster.
Just to be sure I've also updated `^CMake Error.*:` to `^CMake Error`,
so that it does not match with all the possible `:`'s in the line.
Another option is to use `.*?` there to make it quit scanning as soon as
possible, but what line that starts with `CMake Error` that does not have
a colon is really a false positive...
2021-10-12 15:05:11 +08:00
|
|
|
"[^:]: warning[ \\t]*[0-9]+[ \\t]*:",
|
|
|
|
"^(Warning|Warnung) ([0-9]+):",
|
|
|
|
"^(Warning|Warnung)[ :]",
|
2017-10-01 13:39:21 +08:00
|
|
|
"WARNING: ",
|
Avoid quadratic complexity in log parser (#26568)
TL;DR: there are matching groups trying to match 1 or more occurrences of
something. We don't use the matching group. Therefore it's sufficient to test
for 1 occurrence. This reduce quadratic complexity to linear time.
---
When parsing logs of an mpich build, I'm getting a 4 minute (!!) wait
with 16 threads for regexes to run:
```
In [1]: %time p.parse("mpich.log")
Wall time: 4min 14s
```
That's really unacceptably slow...
After some digging, it seems a few regexes tend to have `O(n^2)` scaling
where `n` is the string / log line length. I don't think they *necessarily*
should scale like that, but it seems that way. The common pattern is this
```
([^:]+): error
```
which matches `: error` literally, and then one or more non-colons before that. So
for a log line like this:
```
abcdefghijklmnopqrstuvwxyz: error etc etc
```
Any of these are potential group matches when using `search` in Python:
```
abcdefghijklmnopqrstuvwxyz
bcdefghijklmnopqrstuvwxyz
cdefghijklmnopqrstuvwxyz
⋮
yz
z
```
but clearly the capture group should return the longest match.
My hypothesis is that Python has a very bad implementation of `search`
that somehow considers all of these, even though it can be implemented
in linear time by scanning for `: error` first, and then greedily expanding
the longest possible `[^:]+` match to the left. If Python indeed considers
all possible matches, then with `n` matches of length `1 .. n` you
see the `O(n^2)` slowness (i verified this by replacing + with {1,k}
and doubling `k`, it doubles the execution time indeed).
This PR fixes this by removing the `+`, so effectively changing the
O(n^2) into a O(n) worst case.
The reason we are fine with dropping `+` is that we don't use the
capture group anywhere, so, we just ensure `:: error` is not a match
but `x: error` is.
After going from O(n^2) to O(n), the 15MB mpich build log is parsed
in `1.288s`, so about 200x faster.
Just to be sure I've also updated `^CMake Error.*:` to `^CMake Error`,
so that it does not match with all the possible `:`'s in the line.
Another option is to use `.*?` there to make it quit scanning as soon as
possible, but what line that starts with `CMake Error` that does not have
a colon is really a false positive...
2021-10-12 15:05:11 +08:00
|
|
|
"[^ :] : warning",
|
|
|
|
"[^:]: warning",
|
2017-10-01 13:39:21 +08:00
|
|
|
"\", line [0-9]+\\.[0-9]+: [0-9]+-[0-9]+ \\([WI]\\)",
|
Avoid quadratic complexity in log parser (#26568)
TL;DR: there are matching groups trying to match 1 or more occurrences of
something. We don't use the matching group. Therefore it's sufficient to test
for 1 occurrence. This reduce quadratic complexity to linear time.
---
When parsing logs of an mpich build, I'm getting a 4 minute (!!) wait
with 16 threads for regexes to run:
```
In [1]: %time p.parse("mpich.log")
Wall time: 4min 14s
```
That's really unacceptably slow...
After some digging, it seems a few regexes tend to have `O(n^2)` scaling
where `n` is the string / log line length. I don't think they *necessarily*
should scale like that, but it seems that way. The common pattern is this
```
([^:]+): error
```
which matches `: error` literally, and then one or more non-colons before that. So
for a log line like this:
```
abcdefghijklmnopqrstuvwxyz: error etc etc
```
Any of these are potential group matches when using `search` in Python:
```
abcdefghijklmnopqrstuvwxyz
bcdefghijklmnopqrstuvwxyz
cdefghijklmnopqrstuvwxyz
⋮
yz
z
```
but clearly the capture group should return the longest match.
My hypothesis is that Python has a very bad implementation of `search`
that somehow considers all of these, even though it can be implemented
in linear time by scanning for `: error` first, and then greedily expanding
the longest possible `[^:]+` match to the left. If Python indeed considers
all possible matches, then with `n` matches of length `1 .. n` you
see the `O(n^2)` slowness (i verified this by replacing + with {1,k}
and doubling `k`, it doubles the execution time indeed).
This PR fixes this by removing the `+`, so effectively changing the
O(n^2) into a O(n) worst case.
The reason we are fine with dropping `+` is that we don't use the
capture group anywhere, so, we just ensure `:: error` is not a match
but `x: error` is.
After going from O(n^2) to O(n), the 15MB mpich build log is parsed
in `1.288s`, so about 200x faster.
Just to be sure I've also updated `^CMake Error.*:` to `^CMake Error`,
so that it does not match with all the possible `:`'s in the line.
Another option is to use `.*?` there to make it quit scanning as soon as
possible, but what line that starts with `CMake Error` that does not have
a colon is really a false positive...
2021-10-12 15:05:11 +08:00
|
|
|
"^cxx: Warning:",
|
|
|
|
"file: .* has no symbols",
|
|
|
|
"[^ :]:[0-9]+: (Warning|Warnung)",
|
2017-10-01 13:39:21 +08:00
|
|
|
"\\([0-9]*\\): remark #[0-9]*",
|
|
|
|
"\".*\", line [0-9]+: remark\\([0-9]*\\):",
|
|
|
|
"cc-[0-9]* CC: REMARK File = .*, Line = [0-9]*",
|
Avoid quadratic complexity in log parser (#26568)
TL;DR: there are matching groups trying to match 1 or more occurrences of
something. We don't use the matching group. Therefore it's sufficient to test
for 1 occurrence. This reduce quadratic complexity to linear time.
---
When parsing logs of an mpich build, I'm getting a 4 minute (!!) wait
with 16 threads for regexes to run:
```
In [1]: %time p.parse("mpich.log")
Wall time: 4min 14s
```
That's really unacceptably slow...
After some digging, it seems a few regexes tend to have `O(n^2)` scaling
where `n` is the string / log line length. I don't think they *necessarily*
should scale like that, but it seems that way. The common pattern is this
```
([^:]+): error
```
which matches `: error` literally, and then one or more non-colons before that. So
for a log line like this:
```
abcdefghijklmnopqrstuvwxyz: error etc etc
```
Any of these are potential group matches when using `search` in Python:
```
abcdefghijklmnopqrstuvwxyz
bcdefghijklmnopqrstuvwxyz
cdefghijklmnopqrstuvwxyz
⋮
yz
z
```
but clearly the capture group should return the longest match.
My hypothesis is that Python has a very bad implementation of `search`
that somehow considers all of these, even though it can be implemented
in linear time by scanning for `: error` first, and then greedily expanding
the longest possible `[^:]+` match to the left. If Python indeed considers
all possible matches, then with `n` matches of length `1 .. n` you
see the `O(n^2)` slowness (i verified this by replacing + with {1,k}
and doubling `k`, it doubles the execution time indeed).
This PR fixes this by removing the `+`, so effectively changing the
O(n^2) into a O(n) worst case.
The reason we are fine with dropping `+` is that we don't use the
capture group anywhere, so, we just ensure `:: error` is not a match
but `x: error` is.
After going from O(n^2) to O(n), the 15MB mpich build log is parsed
in `1.288s`, so about 200x faster.
Just to be sure I've also updated `^CMake Error.*:` to `^CMake Error`,
so that it does not match with all the possible `:`'s in the line.
Another option is to use `.*?` there to make it quit scanning as soon as
possible, but what line that starts with `CMake Error` that does not have
a colon is really a false positive...
2021-10-12 15:05:11 +08:00
|
|
|
"^CMake Warning",
|
2017-10-01 13:39:21 +08:00
|
|
|
"^\\[WARNING\\]",
|
|
|
|
]
|
|
|
|
|
|
|
|
#: Regexes to match file/line numbers in error/warning messages
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
_warning_exceptions = [
|
2017-10-01 13:39:21 +08:00
|
|
|
"/usr/.*/X11/Xlib\\.h:[0-9]+: war.*: ANSI C\\+\\+ forbids declaration",
|
|
|
|
"/usr/.*/X11/Xutil\\.h:[0-9]+: war.*: ANSI C\\+\\+ forbids declaration",
|
|
|
|
"/usr/.*/X11/XResource\\.h:[0-9]+: war.*: ANSI C\\+\\+ forbids declaration",
|
|
|
|
"WARNING 84 :",
|
|
|
|
"WARNING 47 :",
|
|
|
|
"warning: Clock skew detected. Your build may be incomplete.",
|
|
|
|
"/usr/openwin/include/GL/[^:]+:",
|
|
|
|
"bind_at_load",
|
|
|
|
"XrmQGetResource",
|
|
|
|
"IceFlush",
|
|
|
|
"warning LNK4089: all references to [^ \\t]+ discarded by .OPT:REF",
|
|
|
|
"ld32: WARNING 85: definition of dataKey in",
|
|
|
|
"cc: warning 422: Unknown option \"\\+b",
|
|
|
|
"_with_warning_C",
|
|
|
|
]
|
|
|
|
|
|
|
|
#: Regexes to match file/line numbers in error/warning messages
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
_file_line_matches = [
|
2017-10-01 13:39:21 +08:00
|
|
|
"^Warning W[0-9]+ ([a-zA-Z.\\:/0-9_+ ~-]+) ([0-9]+):",
|
|
|
|
"^([a-zA-Z./0-9_+ ~-]+):([0-9]+):",
|
|
|
|
"^([a-zA-Z.\\:/0-9_+ ~-]+)\\(([0-9]+)\\)",
|
|
|
|
"^[0-9]+>([a-zA-Z.\\:/0-9_+ ~-]+)\\(([0-9]+)\\)",
|
|
|
|
"^([a-zA-Z./0-9_+ ~-]+)\\(([0-9]+)\\)",
|
|
|
|
"\"([a-zA-Z./0-9_+ ~-]+)\", line ([0-9]+)",
|
|
|
|
"File = ([a-zA-Z./0-9_+ ~-]+), Line = ([0-9]+)",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
class LogEvent(object):
|
|
|
|
"""Class representing interesting events (e.g., errors) in a build log."""
|
|
|
|
def __init__(self, text, line_no,
|
|
|
|
source_file=None, source_line_no=None,
|
|
|
|
pre_context=None, post_context=None):
|
|
|
|
self.text = text
|
|
|
|
self.line_no = line_no
|
|
|
|
self.source_file = source_file,
|
|
|
|
self.source_line_no = source_line_no,
|
|
|
|
self.pre_context = pre_context if pre_context is not None else []
|
|
|
|
self.post_context = post_context if post_context is not None else []
|
|
|
|
self.repeat_count = 0
|
|
|
|
|
|
|
|
@property
|
|
|
|
def start(self):
|
|
|
|
"""First line in the log with text for the event or its context."""
|
|
|
|
return self.line_no - len(self.pre_context)
|
|
|
|
|
|
|
|
@property
|
|
|
|
def end(self):
|
|
|
|
"""Last line in the log with text for event or its context."""
|
|
|
|
return self.line_no + len(self.post_context) + 1
|
|
|
|
|
|
|
|
def __getitem__(self, line_no):
|
|
|
|
"""Index event text and context by actual line number in file."""
|
|
|
|
if line_no == self.line_no:
|
|
|
|
return self.text
|
|
|
|
elif line_no < self.line_no:
|
|
|
|
return self.pre_context[line_no - self.line_no]
|
|
|
|
elif line_no > self.line_no:
|
|
|
|
return self.post_context[line_no - self.line_no - 1]
|
|
|
|
|
|
|
|
def __str__(self):
|
|
|
|
"""Returns event lines and context."""
|
2022-11-15 17:07:54 +08:00
|
|
|
out = io.StringIO()
|
2017-10-01 13:39:21 +08:00
|
|
|
for i in range(self.start, self.end):
|
|
|
|
if i == self.line_no:
|
|
|
|
out.write(' >> %-6d%s' % (i, self[i]))
|
|
|
|
else:
|
|
|
|
out.write(' %-6d%s' % (i, self[i]))
|
|
|
|
return out.getvalue()
|
|
|
|
|
|
|
|
|
|
|
|
class BuildError(LogEvent):
|
|
|
|
"""LogEvent subclass for build errors."""
|
|
|
|
|
|
|
|
|
|
|
|
class BuildWarning(LogEvent):
|
|
|
|
"""LogEvent subclass for build warnings."""
|
|
|
|
|
|
|
|
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
def chunks(l, n):
|
|
|
|
"""Divide l into n approximately-even chunks."""
|
|
|
|
chunksize = int(math.ceil(len(l) / n))
|
|
|
|
return [l[i:i + chunksize] for i in range(0, len(l), chunksize)]
|
|
|
|
|
|
|
|
|
|
|
|
@contextmanager
|
|
|
|
def _time(times, i):
|
|
|
|
start = time.time()
|
|
|
|
yield
|
|
|
|
end = time.time()
|
|
|
|
times[i] += end - start
|
|
|
|
|
|
|
|
|
2017-10-01 13:39:21 +08:00
|
|
|
def _match(matches, exceptions, line):
|
|
|
|
"""True if line matches a regex in matches and none in exceptions."""
|
|
|
|
return (any(m.search(line) for m in matches) and
|
|
|
|
not any(e.search(line) for e in exceptions))
|
|
|
|
|
|
|
|
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
def _profile_match(matches, exceptions, line, match_times, exc_times):
|
|
|
|
"""Profiled version of match().
|
|
|
|
|
|
|
|
Timing is expensive so we have two whole functions. This is much
|
|
|
|
longer because we have to break up the ``any()`` calls.
|
|
|
|
|
|
|
|
"""
|
|
|
|
for i, m in enumerate(matches):
|
|
|
|
with _time(match_times, i):
|
|
|
|
if m.search(line):
|
|
|
|
break
|
|
|
|
else:
|
|
|
|
return False
|
|
|
|
|
|
|
|
for i, m in enumerate(exceptions):
|
|
|
|
with _time(exc_times, i):
|
|
|
|
if m.search(line):
|
|
|
|
return False
|
|
|
|
else:
|
|
|
|
return True
|
|
|
|
|
|
|
|
|
|
|
|
def _parse(lines, offset, profile):
|
|
|
|
def compile(regex_array):
|
Avoid quadratic complexity in log parser (#26568)
TL;DR: there are matching groups trying to match 1 or more occurrences of
something. We don't use the matching group. Therefore it's sufficient to test
for 1 occurrence. This reduce quadratic complexity to linear time.
---
When parsing logs of an mpich build, I'm getting a 4 minute (!!) wait
with 16 threads for regexes to run:
```
In [1]: %time p.parse("mpich.log")
Wall time: 4min 14s
```
That's really unacceptably slow...
After some digging, it seems a few regexes tend to have `O(n^2)` scaling
where `n` is the string / log line length. I don't think they *necessarily*
should scale like that, but it seems that way. The common pattern is this
```
([^:]+): error
```
which matches `: error` literally, and then one or more non-colons before that. So
for a log line like this:
```
abcdefghijklmnopqrstuvwxyz: error etc etc
```
Any of these are potential group matches when using `search` in Python:
```
abcdefghijklmnopqrstuvwxyz
bcdefghijklmnopqrstuvwxyz
cdefghijklmnopqrstuvwxyz
⋮
yz
z
```
but clearly the capture group should return the longest match.
My hypothesis is that Python has a very bad implementation of `search`
that somehow considers all of these, even though it can be implemented
in linear time by scanning for `: error` first, and then greedily expanding
the longest possible `[^:]+` match to the left. If Python indeed considers
all possible matches, then with `n` matches of length `1 .. n` you
see the `O(n^2)` slowness (i verified this by replacing + with {1,k}
and doubling `k`, it doubles the execution time indeed).
This PR fixes this by removing the `+`, so effectively changing the
O(n^2) into a O(n) worst case.
The reason we are fine with dropping `+` is that we don't use the
capture group anywhere, so, we just ensure `:: error` is not a match
but `x: error` is.
After going from O(n^2) to O(n), the 15MB mpich build log is parsed
in `1.288s`, so about 200x faster.
Just to be sure I've also updated `^CMake Error.*:` to `^CMake Error`,
so that it does not match with all the possible `:`'s in the line.
Another option is to use `.*?` there to make it quit scanning as soon as
possible, but what line that starts with `CMake Error` that does not have
a colon is really a false positive...
2021-10-12 15:05:11 +08:00
|
|
|
return [re.compile(regex) for regex in regex_array]
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
|
|
|
|
error_matches = compile(_error_matches)
|
|
|
|
error_exceptions = compile(_error_exceptions)
|
|
|
|
warning_matches = compile(_warning_matches)
|
|
|
|
warning_exceptions = compile(_warning_exceptions)
|
|
|
|
file_line_matches = compile(_file_line_matches)
|
|
|
|
|
|
|
|
matcher, args = _match, []
|
|
|
|
timings = []
|
|
|
|
if profile:
|
|
|
|
matcher = _profile_match
|
|
|
|
timings = [
|
|
|
|
[0.0] * len(error_matches), [0.0] * len(error_exceptions),
|
|
|
|
[0.0] * len(warning_matches), [0.0] * len(warning_exceptions)]
|
|
|
|
|
|
|
|
errors = []
|
|
|
|
warnings = []
|
|
|
|
for i, line in enumerate(lines):
|
|
|
|
# use CTest's regular expressions to scrape the log for events
|
|
|
|
if matcher(error_matches, error_exceptions, line, *timings[:2]):
|
|
|
|
event = BuildError(line.strip(), offset + i + 1)
|
|
|
|
errors.append(event)
|
|
|
|
elif matcher(warning_matches, warning_exceptions, line, *timings[2:]):
|
|
|
|
event = BuildWarning(line.strip(), offset + i + 1)
|
|
|
|
warnings.append(event)
|
|
|
|
else:
|
|
|
|
continue
|
|
|
|
|
|
|
|
# get file/line number for each event, if possible
|
|
|
|
for flm in file_line_matches:
|
|
|
|
match = flm.search(line)
|
|
|
|
if match:
|
2018-04-20 03:19:41 +08:00
|
|
|
event.source_file, event.source_line_no = match.groups()
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
|
|
|
|
return errors, warnings, timings
|
|
|
|
|
|
|
|
|
|
|
|
def _parse_unpack(args):
|
|
|
|
return _parse(*args)
|
2017-10-01 13:39:21 +08:00
|
|
|
|
|
|
|
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
class CTestLogParser(object):
|
|
|
|
"""Log file parser that extracts errors and warnings."""
|
|
|
|
def __init__(self, profile=False):
|
|
|
|
# whether to record timing information
|
|
|
|
self.timings = []
|
|
|
|
self.profile = profile
|
|
|
|
|
|
|
|
def print_timings(self):
|
|
|
|
"""Print out profile of time spent in different regular expressions."""
|
|
|
|
def stringify(elt):
|
|
|
|
return elt if isinstance(elt, str) else elt.pattern
|
|
|
|
|
|
|
|
index = 0
|
|
|
|
for name, arr in [('error_matches', _error_matches),
|
|
|
|
('error_exceptions', _error_exceptions),
|
|
|
|
('warning_matches', _warning_matches),
|
|
|
|
('warning_exceptions', _warning_exceptions)]:
|
|
|
|
|
|
|
|
print()
|
|
|
|
print(name)
|
|
|
|
for i, elt in enumerate(arr):
|
|
|
|
print("%16.2f %s" % (
|
|
|
|
self.timings[index][i] * 1e6, stringify(elt)))
|
|
|
|
index += 1
|
|
|
|
|
|
|
|
|
|
|
|
def parse(self, stream, context=6, jobs=None):
|
2017-10-01 13:39:21 +08:00
|
|
|
"""Parse a log file by searching each line for errors and warnings.
|
|
|
|
|
|
|
|
Args:
|
|
|
|
stream (str or file-like): filename or stream to read from
|
|
|
|
context (int): lines of context to extract around each log event
|
|
|
|
|
|
|
|
Returns:
|
2018-04-20 03:19:00 +08:00
|
|
|
(tuple): two lists containing ``BuildError`` and
|
2017-10-01 13:39:21 +08:00
|
|
|
``BuildWarning`` objects.
|
|
|
|
"""
|
2022-11-15 17:07:54 +08:00
|
|
|
if isinstance(stream, str):
|
2017-10-01 13:39:21 +08:00
|
|
|
with open(stream) as f:
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
return self.parse(f, context, jobs)
|
2017-10-01 13:39:21 +08:00
|
|
|
|
|
|
|
lines = [line for line in stream]
|
|
|
|
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
if jobs is None:
|
|
|
|
jobs = multiprocessing.cpu_count()
|
|
|
|
|
|
|
|
# single-thread small logs
|
|
|
|
if len(lines) < 10 * jobs:
|
|
|
|
errors, warnings, self.timings = _parse(lines, 0, self.profile)
|
|
|
|
|
|
|
|
else:
|
|
|
|
# Build arguments for parallel jobs
|
|
|
|
args = []
|
|
|
|
offset = 0
|
|
|
|
for chunk in chunks(lines, jobs):
|
|
|
|
args.append((chunk, offset, self.profile))
|
|
|
|
offset += len(chunk)
|
|
|
|
|
|
|
|
# create a pool and farm out the matching job
|
|
|
|
pool = multiprocessing.Pool(jobs)
|
|
|
|
try:
|
|
|
|
# this is a workaround for a Python bug in Pool with ctrl-C
|
2022-08-27 07:37:56 +08:00
|
|
|
if sys.version_info >= (3, 2):
|
|
|
|
max_timeout = threading.TIMEOUT_MAX
|
|
|
|
else:
|
|
|
|
max_timeout = 9999999
|
|
|
|
results = pool.map_async(_parse_unpack, args, 1).get(max_timeout)
|
|
|
|
|
Improve log parsing performance (#7093)
* Allow dashes in command names and fix command name handling
- Command should allow dashes in their names like the reest of spack,
e.g. `spack log-parse`
- It might be too late for `spack build-cache` (since it is already
called `spack buildcache`), but we should try a bit to avoid
inconsistencies in naming conventions
- The code was inconsistent about where commands should be called by
their python module name (e.g. `log_parse`) and where the actual
command name should be used (e.g. `log-parse`).
- This made it hard to make a command with a dash in the name, and it
made `SpackCommand` fail to recognize commands with dashes.
- The code now uses the user-facing name with dashes for function
parameters, then converts that the module name when needed.
* Improve performance of log parsing
- A number of regular expressions from ctest_log_parser have really poor
performance, most due to untethered expressions with * or + (i.e., they
don't start with ^, so the repetition has to be checked for every
position in the string with Python's backtracking regex implementation)
- I can't verify that CTest's regexes work with an added ^, so I don't
really want to touch them. I tried adding this and found that it
caused some tests to break.
- Instead of using only "efficient" regular expressions, Added a
prefilter() class that allows the parser to quickly check a
precondition before evaluating any of the expensive regexes.
- Preconditions do things like check whether the string contains "error"
or "warning" (linear time things) before evaluating regexes that would
require them. It's sad that Python doesn't use Thompson string
matching (see https://swtch.com/~rsc/regexp/regexp1.html)
- Even with Python's slow implementation, this makes the parser ~200x
faster on the input we tried it on.
* Add `spack log-parse` command and improve the display of parsed logs
- Add better coloring and line wrapping to the log parse output. This
makes nasty build output look better with the line numbers.
- `spack log-parse` allows the log parsing logic used at the end of
builds to be executed on arbitrary files, which is handy even outside
of spack.
- Also provides a profile option -- we can profile arbitrary files and
show which regular expressions in the magic CTest parser take the most
time.
* Parallelize log parsing
- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
2018-02-01 13:57:56 +08:00
|
|
|
errors, warnings, timings = zip(*results)
|
|
|
|
finally:
|
|
|
|
pool.terminate()
|
|
|
|
|
|
|
|
# merge results
|
|
|
|
errors = sum(errors, [])
|
|
|
|
warnings = sum(warnings, [])
|
|
|
|
|
|
|
|
if self.profile:
|
|
|
|
self.timings = [
|
|
|
|
[sum(i) for i in zip(*t)] for t in zip(*timings)]
|
|
|
|
|
|
|
|
# add log context to all events
|
|
|
|
for event in (errors + warnings):
|
|
|
|
i = event.line_no - 1
|
2017-10-01 13:39:21 +08:00
|
|
|
event.pre_context = [
|
|
|
|
l.rstrip() for l in lines[i - context:i]]
|
|
|
|
event.post_context = [
|
|
|
|
l.rstrip() for l in lines[i + 1:i + context + 1]]
|
|
|
|
|
|
|
|
return errors, warnings
|