Improve log parsing performance (#7093)

* Allow dashes in command names and fix command name handling

- Command should allow dashes in their names like the reest of spack,
  e.g. `spack log-parse`
  - It might be too late for `spack build-cache` (since it is already
    called `spack buildcache`), but we should try a bit to avoid
    inconsistencies in naming conventions

- The code was inconsistent about where commands should be called by
  their python module name (e.g. `log_parse`) and where the actual
  command name should be used (e.g. `log-parse`).

- This made it hard to make a command with a dash in the name, and it
  made `SpackCommand` fail to recognize commands with dashes.

- The code now uses the user-facing name with dashes for function
  parameters, then converts that the module name when needed.

* Improve performance of log parsing

- A number of regular expressions from ctest_log_parser have really poor
  performance, most due to untethered expressions with * or + (i.e., they
  don't start with ^, so the repetition has to be checked for every
  position in the string with Python's backtracking regex implementation)

- I can't verify that CTest's regexes work with an added ^, so I don't
  really want to touch them.  I tried adding this and found that it
  caused some tests to break.

- Instead of using only "efficient" regular expressions, Added a
  prefilter() class that allows the parser to quickly check a
  precondition before evaluating any of the expensive regexes.

- Preconditions do things like check whether the string contains "error"
  or "warning" (linear time things) before evaluating regexes that would
  require them.  It's sad that Python doesn't use Thompson string
  matching (see https://swtch.com/~rsc/regexp/regexp1.html)

- Even with Python's slow implementation, this makes the parser ~200x
  faster on the input we tried it on.

* Add `spack log-parse` command and improve the display of parsed logs

- Add better coloring and line wrapping to the log parse output.  This
  makes nasty build output look better with the line numbers.

- `spack log-parse` allows the log parsing logic used at the end of
  builds to be executed on arbitrary files, which is handy even outside
  of spack.

- Also provides a profile option -- we can profile arbitrary files and
  show which regular expressions in the magic CTest parser take the most
  time.

* Parallelize log parsing

- Log parsing now uses multiple threads for long logs
- Lines from logs are divided into chnks and farmed out to <ncpus>
- Add -j option to `spack log-parse`
This commit is contained in:
Todd Gamblin
2018-01-31 21:57:56 -08:00
committed by becker33
parent 514f0bf5c5
commit df758e1cfc
6 changed files with 403 additions and 91 deletions

View File

@@ -65,18 +65,66 @@
up to date with CTest, just make sure the ``*_matches`` and
``*_exceptions`` lists are kept up to date with CTest's build handler.
"""
from __future__ import print_function
from __future__ import division
import re
import math
import multiprocessing
import time
from contextlib import contextmanager
from six import StringIO
from six import string_types
class prefilter(object):
"""Make regular expressions faster with a simple prefiltering predicate.
error_matches = [
Some regular expressions seem to be much more costly than others. In
most cases, we can evaluate a simple precondition, e.g.::
lambda x: "error" in x
to avoid evaluating expensive regexes on all lines in a file. This
can reduce parse time for large files by orders of magnitude when
evaluating lots of expressions.
A ``prefilter`` object is designed to act like a regex,, but
``search`` and ``match`` check the precondition before bothering to
evaluate the regular expression.
Note that ``match`` and ``search`` just return ``True`` and ``False``
at the moment. Make them return a ``MatchObject`` or ``None`` if it
becomes necessary.
"""
def __init__(self, precondition, *patterns):
self.patterns = [re.compile(p) for p in patterns]
self.pre = precondition
self.pattern = "\n ".join(
('MERGED:',) + patterns)
def search(self, text):
return self.pre(text) and any(p.search(text) for p in self.patterns)
def match(self, text):
return self.pre(text) and any(p.match(text) for p in self.patterns)
_error_matches = [
prefilter(
lambda x: any(s in x for s in (
'Error:', 'error', 'undefined reference', 'multiply defined')),
"([^:]+): error[ \\t]*[0-9]+[ \\t]*:",
"([^:]+): (Error:|error|undefined reference|multiply defined)",
"([^ :]+) : (error|fatal error|catastrophic error)",
"([^:]+)\\(([^\\)]+)\\) ?: (error|fatal error|catastrophic error)"),
prefilter(
lambda s: s.count(':') >= 2,
"[^ :]+:[0-9]+: [^ \\t]"),
"^[Bb]us [Ee]rror",
"^[Ss]egmentation [Vv]iolation",
"^[Ss]egmentation [Ff]ault",
":.*[Pp]ermission [Dd]enied",
"([^ :]+):([0-9]+): ([^ \\t])",
"([^:]+): error[ \\t]*[0-9]+[ \\t]*:",
"^Error ([0-9]+):",
"^Fatal",
"^Error: ",
@@ -86,9 +134,6 @@
"^cc[^C]*CC: ERROR File = ([^,]+), Line = ([0-9]+)",
"^ld([^:])*:([ \\t])*ERROR([^:])*:",
"^ild:([ \\t])*\\(undefined symbol\\)",
"([^ :]+) : (error|fatal error|catastrophic error)",
"([^:]+): (Error:|error|undefined reference|multiply defined)",
"([^:]+)\\(([^\\)]+)\\) ?: (error|fatal error|catastrophic error)",
"^fatal error C[0-9]+:",
": syntax error ",
"^collect2: ld returned 1 exit status",
@@ -128,7 +173,7 @@
"^Command .* failed with exit code",
]
error_exceptions = [
_error_exceptions = [
"instantiated from ",
"candidates are:",
": warning",
@@ -143,32 +188,38 @@
]
#: Regexes to match file/line numbers in error/warning messages
warning_matches = [
"([^ :]+):([0-9]+): warning:",
"([^ :]+):([0-9]+): note:",
_warning_matches = [
prefilter(
lambda x: 'warning' in x,
"([^ :]+):([0-9]+): warning:",
"([^:]+): warning ([0-9]+):",
"([^:]+): warning[ \\t]*[0-9]+[ \\t]*:",
"([^ :]+) : warning",
"([^:]+): warning"),
prefilter(
lambda x: 'note:' in x,
"^([^ :]+):([0-9]+): note:"),
prefilter(
lambda x: any(s in x for s in ('Warning', 'Warnung')),
"^(Warning|Warnung) ([0-9]+):",
"^(Warning|Warnung)[ :]",
"^cxx: Warning:",
"([^ :]+):([0-9]+): (Warning|Warnung)",
"^CMake Warning.*:"),
"file: .* has no symbols",
"^cc[^C]*CC: WARNING File = ([^,]+), Line = ([0-9]+)",
"^ld([^:])*:([ \\t])*WARNING([^:])*:",
"([^:]+): warning ([0-9]+):",
"^\"[^\"]+\", line [0-9]+: [Ww](arning|arnung)",
"([^:]+): warning[ \\t]*[0-9]+[ \\t]*:",
"^(Warning|Warnung) ([0-9]+):",
"^(Warning|Warnung)[ :]",
"WARNING: ",
"([^ :]+) : warning",
"([^:]+): warning",
"\", line [0-9]+\\.[0-9]+: [0-9]+-[0-9]+ \\([WI]\\)",
"^cxx: Warning:",
".*file: .* has no symbols",
"([^ :]+):([0-9]+): (Warning|Warnung)",
"\\([0-9]*\\): remark #[0-9]*",
"\".*\", line [0-9]+: remark\\([0-9]*\\):",
"cc-[0-9]* CC: REMARK File = .*, Line = [0-9]*",
"^CMake Warning.*:",
"^\\[WARNING\\]",
]
#: Regexes to match file/line numbers in error/warning messages
warning_exceptions = [
_warning_exceptions = [
"/usr/.*/X11/Xlib\\.h:[0-9]+: war.*: ANSI C\\+\\+ forbids declaration",
"/usr/.*/X11/Xutil\\.h:[0-9]+: war.*: ANSI C\\+\\+ forbids declaration",
"/usr/.*/X11/XResource\\.h:[0-9]+: war.*: ANSI C\\+\\+ forbids declaration",
@@ -188,7 +239,7 @@
]
#: Regexes to match file/line numbers in error/warning messages
file_line_matches = [
_file_line_matches = [
"^Warning W[0-9]+ ([a-zA-Z.\\:/0-9_+ ~-]+) ([0-9]+):",
"^([a-zA-Z./0-9_+ ~-]+):([0-9]+):",
"^([a-zA-Z.\\:/0-9_+ ~-]+)\\(([0-9]+)\\)",
@@ -250,25 +301,120 @@ class BuildWarning(LogEvent):
"""LogEvent subclass for build warnings."""
def chunks(l, n):
"""Divide l into n approximately-even chunks."""
chunksize = int(math.ceil(len(l) / n))
return [l[i:i + chunksize] for i in range(0, len(l), chunksize)]
@contextmanager
def _time(times, i):
start = time.time()
yield
end = time.time()
times[i] += end - start
def _match(matches, exceptions, line):
"""True if line matches a regex in matches and none in exceptions."""
return (any(m.search(line) for m in matches) and
not any(e.search(line) for e in exceptions))
def _profile_match(matches, exceptions, line, match_times, exc_times):
"""Profiled version of match().
Timing is expensive so we have two whole functions. This is much
longer because we have to break up the ``any()`` calls.
"""
for i, m in enumerate(matches):
with _time(match_times, i):
if m.search(line):
break
else:
return False
for i, m in enumerate(exceptions):
with _time(exc_times, i):
if m.search(line):
return False
else:
return True
def _parse(lines, offset, profile):
def compile(regex_array):
return [regex if isinstance(regex, prefilter) else re.compile(regex)
for regex in regex_array]
error_matches = compile(_error_matches)
error_exceptions = compile(_error_exceptions)
warning_matches = compile(_warning_matches)
warning_exceptions = compile(_warning_exceptions)
file_line_matches = compile(_file_line_matches)
matcher, args = _match, []
timings = []
if profile:
matcher = _profile_match
timings = [
[0.0] * len(error_matches), [0.0] * len(error_exceptions),
[0.0] * len(warning_matches), [0.0] * len(warning_exceptions)]
errors = []
warnings = []
for i, line in enumerate(lines):
# use CTest's regular expressions to scrape the log for events
if matcher(error_matches, error_exceptions, line, *timings[:2]):
event = BuildError(line.strip(), offset + i + 1)
errors.append(event)
elif matcher(warning_matches, warning_exceptions, line, *timings[2:]):
event = BuildWarning(line.strip(), offset + i + 1)
warnings.append(event)
else:
continue
# get file/line number for each event, if possible
for flm in file_line_matches:
match = flm.search(line)
if match:
event.source_file, source_line_no = match.groups()
return errors, warnings, timings
def _parse_unpack(args):
return _parse(*args)
class CTestLogParser(object):
"""Log file parser that extracts errors and warnings."""
def __init__(self):
def compile(regex_array):
return [re.compile(regex) for regex in regex_array]
def __init__(self, profile=False):
# whether to record timing information
self.timings = []
self.profile = profile
self.error_matches = compile(error_matches)
self.error_exceptions = compile(error_exceptions)
self.warning_matches = compile(warning_matches)
self.warning_exceptions = compile(warning_exceptions)
self.file_line_matches = compile(file_line_matches)
def print_timings(self):
"""Print out profile of time spent in different regular expressions."""
def stringify(elt):
return elt if isinstance(elt, str) else elt.pattern
def parse(self, stream, context=6):
index = 0
for name, arr in [('error_matches', _error_matches),
('error_exceptions', _error_exceptions),
('warning_matches', _warning_matches),
('warning_exceptions', _warning_exceptions)]:
print()
print(name)
for i, elt in enumerate(arr):
print("%16.2f %s" % (
self.timings[index][i] * 1e6, stringify(elt)))
index += 1
def parse(self, stream, context=6, jobs=None):
"""Parse a log file by searching each line for errors and warnings.
Args:
@@ -281,30 +427,45 @@ def parse(self, stream, context=6):
"""
if isinstance(stream, string_types):
with open(stream) as f:
return self.parse(f)
return self.parse(f, context, jobs)
lines = [line for line in stream]
errors = []
warnings = []
for i, line in enumerate(lines):
# use CTest's regular expressions to scrape the log for events
if _match(self.error_matches, self.error_exceptions, line):
event = BuildError(line.strip(), i + 1)
errors.append(event)
elif _match(self.warning_matches, self.warning_exceptions, line):
event = BuildWarning(line.strip(), i + 1)
warnings.append(event)
else:
continue
if jobs is None:
jobs = multiprocessing.cpu_count()
# get file/line number for each event, if possible
for flm in self.file_line_matches:
match = flm.search(line)
if match:
event.source_file, source_line_no = match.groups()
# single-thread small logs
if len(lines) < 10 * jobs:
errors, warnings, self.timings = _parse(lines, 0, self.profile)
# add log context, as well
else:
# Build arguments for parallel jobs
args = []
offset = 0
for chunk in chunks(lines, jobs):
args.append((chunk, offset, self.profile))
offset += len(chunk)
# create a pool and farm out the matching job
pool = multiprocessing.Pool(jobs)
try:
# this is a workaround for a Python bug in Pool with ctrl-C
results = pool.map_async(_parse_unpack, args, 1).get(9999999)
errors, warnings, timings = zip(*results)
finally:
pool.terminate()
# merge results
errors = sum(errors, [])
warnings = sum(warnings, [])
if self.profile:
self.timings = [
[sum(i) for i in zip(*t)] for t in zip(*timings)]
# add log context to all events
for event in (errors + warnings):
i = event.line_no - 1
event.pre_context = [
l.rstrip() for l in lines[i - context:i]]
event.post_context = [