I have a use case where I am utilizing a zookeeper::ZooKeeper
client instance to maintain an ephemeral znode while my application does other work. I've found that the client panics in its reconnection logic on an internal thread when I kill the zookeeper server that I am testing with. This leaves my application running but without the client connection in a functional state.
The backtrace that I see is the following:
thread 'io' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/libcore/result.rs:1009:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
1: std::sys_common::backtrace::_print
at src/libstd/sys_common/backtrace.rs:71
2: std::panicking::default_hook::{{closure}}
at src/libstd/sys_common/backtrace.rs:59
at src/libstd/panicking.rs:211
3: std::panicking::default_hook
at src/libstd/panicking.rs:227
4: <std::panicking::begin_panic::PanicPayload<A> as core::panic::BoxMeUp>::get
at src/libstd/panicking.rs:491
5: std::panicking::continue_panic_fmt
at src/libstd/panicking.rs:398
6: std::panicking::try::do_call
at src/libstd/panicking.rs:325
7: core::char::methods::<impl char>::escape_debug
at src/libcore/panicking.rs:95
8: core::alloc::Layout::repeat
at /rustc/9fda7c2237db910e41d6a712e9a2139b352e558b/src/libcore/macros.rs:26
9: <zookeeper::acl::Acl as core::clone::Clone>::clone
at /rustc/9fda7c2237db910e41d6a712e9a2139b352e558b/src/libcore/result.rs:808
10: zookeeper::io::ZkIo::reconnect
at /Users/dtw/.cargo/registry/src/github.com-1ecc6299db9ec823/zookeeper-0.5.5/src/io.rs:326
11: zookeeper::io::ZkIo::ready_zk
at /Users/dtw/.cargo/registry/src/github.com-1ecc6299db9ec823/zookeeper-0.5.5/src/io.rs:429
12: zookeeper::io::ZkIo::ready
at /Users/dtw/.cargo/registry/src/github.com-1ecc6299db9ec823/zookeeper-0.5.5/src/io.rs:366
13: zookeeper::io::ZkIo::ready_timer
at /Users/dtw/.cargo/registry/src/github.com-1ecc6299db9ec823/zookeeper-0.5.5/src/io.rs:549
14: zookeeper::zookeeper::ZooKeeper::connect::{{closure}}
at /Users/dtw/.cargo/registry/src/github.com-1ecc6299db9ec823/zookeeper-0.5.5/src/zookeeper.rs:78
I believe this is due to the unwrap()
call at this line: https://github.com/bonifaido/rust-zookeeper/blob/e25f2a0ee6cc2667430054f08c8c69fca1c8c4e9/src/io.rs#L326
I also have a listener on the connection that right now just logs the state transitions of the client. I see the client go through the Connected
-> NotConnected
and NotConnected
-> Connecting
state transitions before the panic happens.
In order to reproduce this behavior I've been using Docker to start and stop a local ZK server using the Docker Hub official Zookeeper Docker image. To run the server and expose a port, you can run docker run --rm -p 2181:2181 --name test-zookeeper -d zookeeper
on a machine with docker
installed.
I could handle the disconnect from within my application by watching for the NotConnected
event and taking action from there (either exiting the rest of the application or trying to rebuild the client) but I think it would be nice to resolve some of this from within the client library as well. It doesn't seem like the client's internal thread should panic, leaving the last client state event the caller receives to be Connecting
.
Two options that come to mind for handling this situation are:
- Instead of panicking, publish some sort client state indicating it is permanently failed/not connected. It looks like
ZkState::Closed
might already fit the situation and could potentially be published in this case.
- Add a bit more logic to the reconnect routine to continually retry or perhaps have a definable policy to try more times before entering into the state I describe in option one.
What do you think about these options? Would you be amenable to a PR to at the least handle the case where the reconnect fails and we publish a ZkState::Closed
event to the listeners?