Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When nodetool command timeout, kill scylla-jmx with sigquit #303

Open
fruch opened this issue Feb 4, 2021 · 19 comments · Fixed by #304 or #307
Open

When nodetool command timeout, kill scylla-jmx with sigquit #303

fruch opened this issue Feb 4, 2021 · 19 comments · Fixed by #304 or #307

Comments

@fruch
Copy link
Contributor

fruch commented Feb 4, 2021

We had few reports across the broad that nodetool commands are getting stuck.

a suggestion was raise by @elcallio to try catch and collect enough information when those things happen:
scylladb/scylladb#7991 (comment)

@bhalevy
Copy link
Member

bhalevy commented Feb 7, 2021

Refs scylladb/scylladb#7244

@bhalevy
Copy link
Member

bhalevy commented Feb 7, 2021

@fruch who's working on this?

@fruch
Copy link
Contributor Author

fruch commented Feb 7, 2021

@fruch who's working on this?

Currently no one... there are more incidents of that getting stuck ?

@bhalevy
Copy link
Member

bhalevy commented Feb 7, 2021

Yes, this frequently happens in dtest, see scylladb/scylladb#7244
We need to get to this ASAP.

@fruch
Copy link
Contributor Author

fruch commented Feb 7, 2021

I'll take a closer look (it mainly happen on those sec. indexes tests ? or on more then one test file / class ?)

@fruch fruch changed the title When nodetool command timeout, kill scylla-jmx with sigint When nodetool command timeout, kill scylla-jmx with sigquit Feb 8, 2021
@fruch
Copy link
Contributor Author

fruch commented Feb 8, 2021

@elcallio seem that other places point to using SIGQUIT, (and not SIGINT), https://access.redhat.com/solutions/18178

I wasn't able to reproduce those issue, but I want to make sure I'm using the correct signal (and capturing the correct logs)

fruch added a commit to fruch/scylla-ccm that referenced this issue Feb 8, 2021
…ired`

When nodetool command get timeout, we try to send `SIGQUIT` to get
a threaddump inforamtion into scylla-jmx stdout.

Close: scylladb#303
Ref: scylladb/scylladb#7991 (comment)
fruch added a commit to fruch/scylla-ccm that referenced this issue Feb 8, 2021
…ired`

When nodetool command get timeout, we try to send `SIGQUIT` to get
a threaddump inforamtion into scylla-jmx stdout.

Close: scylladb#303
Ref: scylladb/scylladb#7991 (comment)
fruch added a commit to fruch/scylla-ccm that referenced this issue Feb 9, 2021
…ired`

When nodetool command get timeout, we try to send `SIGQUIT` to get
a threaddump inforamtion into scylla-jmx stdout.

Close: scylladb#303
Ref: scylladb/scylladb#7991 (comment)
@fruch fruch closed this as completed in #304 Feb 9, 2021
fruch added a commit that referenced this issue Feb 9, 2021
When nodetool command get timeout, we try to send `SIGQUIT` to get
a threaddump inforamtion into scylla-jmx stdout.

Close: #303
Ref: scylladb/scylladb#7991 (comment)
fruch added a commit to fruch/scylla-ccm that referenced this issue Feb 14, 2021
The `SIGQUIT` is follow too soon by a forcefull kill, i.e. we don't let
the process enough time to print the output to stdout

Close: scylladb#303
Ref: scylladb/scylladb#7991 (comment)
fruch added a commit that referenced this issue Feb 14, 2021
The `SIGQUIT` is follow too soon by a forcefull kill, i.e. we don't let
the process enough time to print the output to stdout

Close: #303
Ref: scylladb/scylladb#7991 (comment)
@bhalevy
Copy link
Member

bhalevy commented Feb 17, 2021

Unfortunately, we still don't see the stacktrace :-(
https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/dtest.log

2021-02-16 13:43:55,626 23284   ccm                            ERROR    | node2: nodetool timeout, going to kill scylla-jmx with SIGQUIT

2021-02-16 13:44:00,633 23284   dtest                          DEBUG    | secondary_indexes_test.py:TestSecondaryIndexes.test_insert_data_after_recreating_cf - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_insert_data_after_recreating_cf>, (<class 'subprocess.TimeoutExpired'>, TimeoutExpired(['/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool', '-h', '127.0.60.2', '-p', '7260', 'info'], 60), <traceback object at 0x7f116ab0f3c0>))]

But https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/1613483043675_secondary_indexes_test.TestSecondaryIndexes.test_insert_data_after_recreating_cf/node2_system.log.jmx

Using config file: /jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-4i__py64/test/node2/conf/scylla.yaml
Connecting to http://127.0.60.2:10000
Starting the JMX server
JMX is enabled to receive remote connections on port: 7260

@bhalevy bhalevy reopened this Feb 17, 2021
@bhalevy
Copy link
Member

bhalevy commented Feb 17, 2021

Interestingly, but likely unrelated, shortly after, in a following test, there's this:

2021-02-16 13:47:50,886 2315    dtest                          DEBUG    | secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_after_index_build - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_remove_node_after_index_build>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab0e500>))]

https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/745/artifact/logs-release.2/1613483955639_snapshot_test.TestSchemaFileInSnapshot.test_restoring_by_schema_with_mv_use_sstableloader_multiple_tables/node1_system.log.jmx

Using config file: /jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-f_t8avbo/test/node1/conf/scylla.yaml
Error: JMX connector server communication error: service:jmx:rmi://76836c581099:7177
sun.management.AgentConfigurationError: java.rmi.server.ExportException: Port already in use: 7177; nested exception is: 
	java.net.BindException: Address already in use (Bind failed)
	at sun.management.jmxremote.ConnectorBootstrap.exportMBeanServer(ConnectorBootstrap.java:800)
	at sun.management.jmxremote.ConnectorBootstrap.startRemoteConnectorServer(ConnectorBootstrap.java:468)
	at sun.management.Agent.startAgent(Agent.java:262)
	at sun.management.Agent.startAgent(Agent.java:452)
Caused by: java.rmi.server.ExportException: Port already in use: 7177; nested exception is: 
	java.net.BindException: Address already in use (Bind failed)
	at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:346)
	at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:254)
	at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:411)
	at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:147)
	at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:236)
	at sun.management.jmxremote.ConnectorBootstrap$PermanentExporter.exportObject(ConnectorBootstrap.java:199)
	at javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:146)
	at javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:122)
	at javax.management.remote.rmi.RMIConnectorServer.start(RMIConnectorServer.java:404)
	at sun.management.jmxremote.ConnectorBootstrap.exportMBeanServer(ConnectorBootstrap.java:796)
	... 3 more
Caused by: java.net.BindException: Address already in use (Bind failed)
	at java.net.PlainSocketImpl.socketBind(Native Method)
	at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)
	at java.net.ServerSocket.bind(ServerSocket.java:375)
	at java.net.ServerSocket.<init>(ServerSocket.java:237)
	at java.net.ServerSocket.<init>(ServerSocket.java:128)
	at sun.rmi.transport.proxy.RMIDirectSocketFactory.createServerSocket(RMIDirectSocketFactory.java:45)
	at sun.rmi.transport.proxy.RMIMasterSocketFactory.createServerSocket(RMIMasterSocketFactory.java:345)
	at sun.rmi.transport.tcp.TCPEndpoint.newServerSocket(TCPEndpoint.java:666)
	at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:335)

Note that a different post is in use.

@bhalevy
Copy link
Member

bhalevy commented Feb 17, 2021

It smells like scylla-jmx might not exit cleanly and is holding on to the api port:

bhalevy@dt tmp$ egrep 'Unable to connect to Scylla API server|TimeoutExpired' dtest-745.log
2021-02-16 10:35:32,995 30817   dtest                          DEBUG    | nodetool_additional_test.py:TestNodetool.scrub_with_one_node_expect_data_loss_test - Test failed with errors: [(<nodetool_additional_test.TestNodetool testMethod=scrub_with_one_node_expect_data_loss_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.55.1 -p 7155 scrub ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab86230>))]
2021-02-16 13:44:00,633 23284   dtest                          DEBUG    | secondary_indexes_test.py:TestSecondaryIndexes.test_insert_data_after_recreating_cf - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_insert_data_after_recreating_cf>, (<class 'subprocess.TimeoutExpired'>, TimeoutExpired(['/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool', '-h', '127.0.60.2', '-p', '7260', 'info'], 60), <traceback object at 0x7f116ab0f3c0>))]
2021-02-16 13:47:50,886 2315    dtest                          DEBUG    | secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_after_index_build - Test failed with errors: [(<secondary_indexes_test.TestSecondaryIndexes testMethod=test_remove_node_after_index_build>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab0e500>))]
2021-02-16 13:59:12,626 20989   dtest                          DEBUG    | snapshot_test.py:TestSchemaFileInSnapshot.test_restoring_by_schema_with_mv_use_sstableloader_multiple_tables - Test failed with errors: [(<snapshot_test.TestSchemaFileInSnapshot testMethod=test_restoring_by_schema_with_mv_use_sstableloader_multiple_tables>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab97910>))]
2021-02-16 14:13:00,291 1890    dtest                          DEBUG    | sstableloader_test.py:TestMigration_with_2_1_x.migrate_sstable_with_compact_storage_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_2_1_x testMethod=migrate_sstable_with_compact_storage_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ac84910>))]
2021-02-16 14:24:57,385 32096   dtest                          DEBUG    | sstableloader_test.py:TestMigration_with_3_0_mc_prepared.migrate_sstable_without_compression_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_3_0_mc_prepared testMethod=migrate_sstable_without_compression_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab8ca50>))]
2021-02-16 14:28:56,148 18175   dtest                          DEBUG    | sstableloader_test.py:TestMigration_with_3_0_md_prepared.migrate_sstable_with_range_tombstone_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_3_0_md_prepared testMethod=migrate_sstable_with_range_tombstone_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116abac730>))]
2021-02-16 14:34:15,017 10842   dtest                          DEBUG    | sstableloader_test.py:TestMigration_with_3_0_x_prepared.migrate_sstable_with_wrong_partitioner_test - Test failed with errors: [(<sstableloader_test.TestMigration_with_3_0_x_prepared testMethod=migrate_sstable_with_wrong_partitioner_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush -- ks' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab4b050>))]
2021-02-16 15:11:51,915 14222   dtest                          DEBUG    | update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_kill_new_node_while_bootstrapping_test - Test failed with errors: [(<update_cluster_layout_tests.TestUpdateClusterLayout testMethod=simple_kill_new_node_while_bootstrapping_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 status' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116a973f00>))]
2021-02-16 15:37:48,009 9598    dtest                          DEBUG    | wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_large_cell_detector_with_ttl_on_row - Test failed with errors: [(<wide_rows_test.TestWideRows_with_LeveledCompactionStrategy testMethod=test_large_cell_detector_with_ttl_on_row>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116ab1e050>))]
2021-02-16 15:52:13,261 22285   dtest                          DEBUG    | wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_multiple_rows_with_large_cells_detector - Test failed with errors: [(<wide_rows_test.TestWideRows_with_LeveledCompactionStrategy testMethod=test_multiple_rows_with_large_cells_detector>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f1170357640>))]
2021-02-16 15:58:03,477 6169    dtest                          DEBUG    | wide_rows_test.py:TestWideRows_with_SizeTieredCompactionStrategy.test_large_cell_in_materialized_view - Test failed with errors: [(<wide_rows_test.TestWideRows_with_SizeTieredCompactionStrategy testMethod=test_large_cell_in_materialized_view>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116a9686e0>))]
2021-02-16 16:39:23,616 9355    dtest                          DEBUG    | materialized_views_test.py:TestMaterializedViews.base_replica_repair_test - Test failed with errors: [(<materialized_views_test.TestMaterializedViews testMethod=base_replica_repair_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.77.1 -p 7177 repair ks t' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f11686f9cd0>))]
2021-02-16 16:54:31,954 23123   dtest                          DEBUG    | materialized_views_test.py:TestMaterializedViews.drop_mv_during_base_table_writes_test - Test failed with errors: [(<materialized_views_test.TestMaterializedViews testMethod=drop_mv_during_base_table_writes_test>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f117035c230>))]
2021-02-16 18:15:09,388 26671   dtest                          DEBUG    | bypass_cache_test.py:TestBypassCache.test_alter_table_caching_enable - Test failed with errors: [(<bypass_cache_test.TestBypassCache testMethod=test_alter_table_caching_enable>, (<class 'ccmlib.node.NodetoolError'>, NodetoolError("Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/495b7b5596ab5f2bd1f1149f5b56c0e550716f79/scylla-tools-java/bin/nodetool -h 127.0.72.1 -p 7172 flush' failed; exit status: 1; stdout: nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)\nSee 'nodetool help' or 'nodetool help <command>'.\n"), <traceback object at 0x7f116a9b6b90>))]

@fruch
Copy link
Contributor Author

fruch commented Feb 17, 2021

@bhalevy, oh great so it even made things worse ?

@bhalevy
Copy link
Member

bhalevy commented Feb 17, 2021

@bhalevy, oh great so it even made things worse ?

I'm not sure, it may have exposed an existing issue.
Looking at scylla_node.do_stop

def do_stop(self, gently=True):
"""
Stop the node.
- gently: Let Scylla and Scylla JMX clean up and shut down properly.
Otherwise do a 'kill -9' which shuts down faster.
"""
if not self.is_running():
return False
self._update_jmx_pid(wait=False)
if self.scylla_manager and self.scylla_manager.is_agent_available:
self._update_scylla_agent_pid()
for proc in [self._process_jmx, self._process_scylla, self._process_agent]:
if proc:
if gently:
try:
proc.terminate()
except OSError:
pass
else:
try:
proc.kill()
except OSError:
pass
else:
signal_mapping = {True: signal.SIGTERM, False: signal.SIGKILL}
for pid in [self.jmx_pid, self.pid, self.agent_pid]:
if pid:
try:
os.kill(pid, signal_mapping[gently])
except OSError:
pass
return True

It looks like we're not stopping scylla_jmx if the node is not considered running.
I think that to be on the safe side will we shouldn't do this optimization and attempt to stop/kill all process regardless of node.is_running(), certainly with gently=False.

@fruch
Copy link
Contributor Author

fruch commented Feb 17, 2021

@bhalevy so 5sec wasn't enough for it to actually dump the threaddump...

@fruch
Copy link
Contributor Author

fruch commented Feb 17, 2021

@bhalevy so 5sec wasn't enough for it to actually dump the threaddump...

is those happening only on run on monster ? i.e. when all the suite is run in parallel ?

@bhalevy
Copy link
Member

bhalevy commented Feb 17, 2021

is those happening only on run on monster ? i.e. when all the suite is run in parallel ?

The timeouts happen also with no parallelism.
See https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release-parallel/33/artifact/logs-release.005/dtest.log
that's with `NOSE_PROCESSES=1'

@bhalevy
Copy link
Member

bhalevy commented Feb 17, 2021

The port-in-use issue could be unrelated.
If the scylla process dies for any reason we will not stop scylla_jmx.

@bhalevy
Copy link
Member

bhalevy commented Feb 17, 2021

@bhalevy so 5sec wasn't enough for it to actually dump the threaddump...

@fruch I doubt that it's a matter of time.
Dumping the stack trace should be instantaneous.
Maybe it's hung so hard it doesn't even respond to the signal.

@bhalevy
Copy link
Member

bhalevy commented Feb 17, 2021

Cc @penberg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants