Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node.follow_system_log may not find a really existing log line #8720

Open
1 of 2 tasks
vponomaryov opened this issue Sep 16, 2024 · 4 comments
Open
1 of 2 tasks

node.follow_system_log may not find a really existing log line #8720

vponomaryov opened this issue Sep 16, 2024 · 4 comments
Assignees

Comments

@vponomaryov
Copy link
Contributor

vponomaryov commented Sep 16, 2024

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Running disrupt_load_and_stream nemesis we expect appearance of start and end log messages for the load and stream operation.

In current case in first 2 used nodes logs the expected log lines were correctly found:

node-5:

2024-09-14 03:34:08,395 f:load_utils.py   l:116  c:utils                p:INFO  > Running load and stream on the node longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-5 for keyspace1.standard1'
...
2024-09-14T03:34:19.865+00:00 longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-5     !INFO | scylla[5477]:  [shard  2:strm] sstables_loader - Done loading new SSTables for keyspace=keyspace1, table=standard1, load_and_stream=true, primary_replica_only=false, status=succeeded
...
2024-09-14 03:34:19,942 f:wait.py         l:161  c:sdcm.wait            p:DEBUG > Start line patterns ['(?:storage_service|sstables_loader) - load_and_stream:'] were found.
...
2024-09-14 03:34:19,954 f:wait.py         l:167  c:sdcm.wait            p:DEBUG > End line patterns ['(?:storage_service|sstables_loader) - Done loading new SSTables for keyspace=keyspace1, table=standard1, load_and_stream=true.*status=(.*)'] were found.
...

node-4:

2024-09-14 03:34:25,024 f:load_utils.py   l:116  c:utils                p:INFO  > Running load and stream on the node longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-4 for keyspace1.standard1'
...
Sep 14 03:34:34.436147 longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-4 scylla[5486]:  [shard  2:strm] sstables_loader - Done loading new SSTables for keyspace=keyspace1, table=standard1, load_and_stream=true, primary_replica_only=false, status=succeeded
...
2024-09-14 03:34:34,570 f:wait.py         l:161  c:sdcm.wait            p:DEBUG > Start line patterns ['(?:storage_service|sstables_loader) - load_and_stream:'] were found.
...
2024-09-14 03:34:34,741 f:wait.py         l:167  c:sdcm.wait            p:DEBUG > End line patterns ['(?:storage_service|sstables_loader) - Done loading new SSTables for keyspace=keyspace1, table=standard1, load_and_stream=true.*status=(.*)'] were found.
...

But in case of the next node-3 the end log line was not detected and the 5 minutes timeout error was raised:
node-3:

2024-09-14 03:34:39,579 f:load_utils.py   l:116  c:utils                p:INFO  > Running load and stream on the node longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-3 for keyspace1.standard1'
...
Sep 14 03:34:48.813688 longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-3 scylla[5499]:  [shard  2:strm] sstables_loader - Done loading new SSTables for keyspace=keyspace1, table=standard1, load_and_stream=true, primary_replica_only=false, status=succeeded
...
2024-09-14 03:34:49,126 f:wait.py         l:161  c:sdcm.wait            p:DEBUG > Start line patterns ['(?:storage_service|sstables_loader) - load_and_stream:'] were found.
...
2024-09-14 03:39:40.777: (DisruptionEvent Severity.ERROR) period_type=end event_id=9716dc9f-3bb2-4d74-9713-f9616c04899b duration=6m47s: nemesis_name=LoadAndStream target_node=Node longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-1 [34.73.216.73 | 10.142.0.157] errors=Timeout occurred while waiting for end log line ['(?:storage_service|sstables_loader) - Done loading new SSTables for keyspace=keyspace1, table=standard1, load_and_stream=true.*status=(.*)'] on node: longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-3
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5220, in wrapper
    result = method(*args[1:], **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1679, in disrupt_load_and_stream
    SstableLoadUtils.run_load_and_stream(load_on_node, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/sstable/load_utils.py", line 113, in run_load_and_stream
    with wait_for_log_lines(node, start_line_patterns=[cls.LOAD_AND_STREAM_RUN_EXPR],
  File "/usr/local/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/wait.py", line 166, in wait_for_log_lines
    raise TimeoutError(end_ctx)
TimeoutError: Timeout occurred while waiting for end log line ['(?:storage_service|sstables_loader) - Done loading new SSTables for keyspace=keyspace1, table=standard1, load_and_stream=true.*status=(.*)'] on node: longevity-large-partitions-200k-pks-db-node-f25dd9b8-0-3

Steps to Reproduce

  1. Run longevity with the disrupt_load_and_stream nemesis
  2. See error

Expected behavior: If a log line exists it must be detected

Actual behavior: Existing log line was not detected

Impact

False negative.

How frequently does it reproduce?

Unknown

Installation details

SCT Version: master
Scylla version (or git commit hash): master

Logs

@fruch
Copy link
Contributor

fruch commented Sep 17, 2024

@soyacz

what do you think, how can we get to a situation we start waiting after the action here ?

soyacz added a commit to soyacz/scylla-cluster-tests that referenced this issue Sep 18, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: scylladb#8720
@soyacz
Copy link
Contributor

soyacz commented Sep 18, 2024

First issue is that it discovered late the start line (it appeared at 03:34:39 and detected at 03:34:49), maybe sct was too busy, or message not flushed to disk.
I don't know why.
I tried analyze the code, but found only saturating cpu issue when waiting for logs: #8743
I don't know if it helps, but I think we should add bit of sleeping before recreating log reader iterator.

@fruch
Copy link
Contributor

fruch commented Sep 18, 2024

First issue is that it discovered late the start line (it appeared at 03:34:39 and detected at 03:34:49), maybe sct was too busy, or message not flushed to disk. I don't know why. I tried analyze the code, but found only saturating cpu issue when waiting for logs: #8743 I don't know if it helps, but I think we should add bit of sleeping before recreating log reader iterator.

maybe we can combine the patterns and wait for both at the same time ?

so at least start using both generator at the same time ?

soyacz added a commit to soyacz/scylla-cluster-tests that referenced this issue Sep 18, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: scylladb#8720
@soyacz
Copy link
Contributor

soyacz commented Sep 18, 2024

First issue is that it discovered late the start line (it appeared at 03:34:39 and detected at 03:34:49), maybe sct was too busy, or message not flushed to disk. I don't know why. I tried analyze the code, but found only saturating cpu issue when waiting for logs: #8743 I don't know if it helps, but I think we should add bit of sleeping before recreating log reader iterator.

maybe we can combine the patterns and wait for both at the same time ?

so at least start using both generator at the same time ?

We are starting generators at the same time - before triggering action. That's why this issue is a mystery for me.

soyacz added a commit to soyacz/scylla-cluster-tests that referenced this issue Sep 18, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: scylladb#8720
soyacz added a commit to soyacz/scylla-cluster-tests that referenced this issue Sep 19, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: scylladb#8720
soyacz added a commit to soyacz/scylla-cluster-tests that referenced this issue Sep 19, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: scylladb#8720
soyacz added a commit to soyacz/scylla-cluster-tests that referenced this issue Sep 20, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: scylladb#8720
fruch pushed a commit that referenced this issue Sep 23, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: #8720
mergify bot pushed a commit that referenced this issue Sep 23, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: #8720
(cherry picked from commit 092da01)
mergify bot pushed a commit that referenced this issue Sep 23, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: #8720
(cherry picked from commit 092da01)
mergify bot pushed a commit that referenced this issue Sep 23, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: #8720
(cherry picked from commit 092da01)
mergify bot pushed a commit that referenced this issue Sep 23, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: #8720
(cherry picked from commit 092da01)

# Conflicts:
#	unit_tests/test_wait.py
soyacz added a commit that referenced this issue Sep 23, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: #8720
(cherry picked from commit 092da01)
soyacz added a commit that referenced this issue Sep 23, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: #8720
(cherry picked from commit 092da01)
soyacz added a commit that referenced this issue Sep 23, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: #8720
(cherry picked from commit 092da01)
soyacz added a commit that referenced this issue Sep 23, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: #8720
(cherry picked from commit 092da01)
soyacz added a commit that referenced this issue Sep 23, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: #8720
(cherry picked from commit 092da01)
soyacz added a commit that referenced this issue Sep 23, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: #8720
(cherry picked from commit 092da01)
soyacz added a commit that referenced this issue Sep 23, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: #8720
(cherry picked from commit 092da01)
soyacz added a commit that referenced this issue Sep 23, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: #8720
(cherry picked from commit 092da01)
mergify bot pushed a commit that referenced this issue Oct 9, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: #8720
(cherry picked from commit 092da01)
soyacz added a commit that referenced this issue Oct 9, 2024
We missed waiting time before recreating iterator. This causes
unnecessary high cpu utilization.

refs: #8720
(cherry picked from commit 092da01)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants