Timed out while reading configuration over HTTP
Hi I am struggling with an issue connecting to cb.
I have a cluster of 3 couchbase 2.0 community servers running on amazon ec2s. client servers are connected to them using the Java API through Tomcat
today I got a Crash on on of the client machines :
The system is going down for power off NOW!
...
INFO: Closing Spring root WebApplicationContext
Feb 10, 2013 8:35:53 AM org.apache.catalina.loader.WebappClassLoader clearReferencesThreads
SEVERE: The web application [] appears to have started a thread named [Memcached IO over {MemcachedConnection to ip-10-196-218-47.ec2.internal/10.196.218.47:11210 domU-12-31-39-02-31-09.compute-1.internal/10.248.54.247:11210 ip-10-46-217-81.ec2.internal/10.46.217.81:11210}] but has failed to stop it. This is very likely to create a memory leak.
Feb 10, 2013 8:35:53 AM org.apache.catalina.loader.WebappClassLoader clearReferencesThreads
SEVERE: The web application [] appears to have started a thread named [New I/O client worker #1-1] but has failed to stop it. This is very likely to create a memory leak.
...I've terminated this instance and created a new one but it can't seem to connect to the cluster , I get this error in Java :
013-02-10 09:43:25.247 WARN com.couchbase.client.vbucket.ConfigurationProviderHTTP: Timed out while reading configuration over HTTP java.net.SocketTimeoutException: connect timed out
I've got two machines that are connected and working, and one that fails to connect, I tried it with new machines and they all fail. I've tried connecting through different servers in the cb cluster and still fail.
CBStats all on the bucket and node gives me :
accepting_conns: 1 auth_cmds: 2831 auth_errors: 0 bucket_active_conns: 1 bucket_conns: 18 bytes: 1382987584 bytes_read: 18683928043 bytes_written: 295454749194 cas_badval: 21308 cas_hits: 1137248 cas_misses: 0 cmd_flush: 0 cmd_get: 11812704 cmd_set: 329016 conn_yields: 24850 connection_structures: 5000 curr_connections: 37 curr_conns_on_port_11209: 27 curr_conns_on_port_11210: 8 curr_items: 289612 curr_items_tot: 572649 curr_temp_items: 0 daemon_connections: 4 decr_hits: 0 decr_misses: 0 delete_hits: 38951 delete_misses: 4611 ep_access_scanner_last_runtime: 0 ep_access_scanner_num_items: 33276 ep_access_scanner_task_time: 2013-02-11 10:00:00 ep_allow_data_loss_during_shutdown: 1 ep_alog_block_size: 4096 ep_alog_path: /mnt/ebs/PoolPartyBucket/access.log ep_alog_sleep_time: 1440 ep_alog_task_time: 10 ep_backend: couchdb ep_bg_fetch_delay: 0 ep_bg_fetched: 169 ep_bg_load: 861623 ep_bg_load_avg: 5098 ep_bg_max_load: 79133 ep_bg_max_wait: 3817 ep_bg_meta_fetched: 0 ep_bg_min_load: 96 ep_bg_min_wait: 42 ep_bg_num_samples: 169 ep_bg_remaining_jobs: 0 ep_bg_wait: 33946 ep_bg_wait_avg: 200 ep_chk_max_items: 5000 ep_chk_period: 1800 ep_chk_persistence_remains: 0 ep_chk_persistence_timeout: 10 ep_chk_remover_stime: 5 ep_commit_num: 2515908 ep_commit_time: 11 ep_commit_time_total: 26399696 ep_concurrentDB: 1 ep_config_file: ep_couch_bucket: PoolPartyBucket ep_couch_host: localhost ep_couch_port: 11213 ep_couch_reconnect_sleeptime: 250 ep_couch_response_timeout: 180000 ep_data_age: 1 ep_data_age_highwat: 37 ep_data_traffic_enabled: 0 ep_dbinit: 0 ep_dbname: /mnt/ebs/PoolPartyBucket ep_degraded_mode: 0 ep_diskqueue_drain: 2927055 ep_diskqueue_fill: 2926454 ep_diskqueue_items: 0 ep_diskqueue_memory: 0 ep_diskqueue_pending: 0 ep_exp_pager_stime: 3600 ep_expired_access: 0 ep_expired_pager: 0 ep_expiry_window: 3 ep_failpartialwarmup: 0 ep_flush_all: false ep_flush_duration_total: 33458 ep_flushall_enabled: 0 ep_flusher_state: running ep_flusher_todo: 0 ep_getl_default_timeout: 15 ep_getl_max_timeout: 30 ep_ht_locks: 5 ep_ht_size: 3079 ep_inconsistent_slave_chk: 0 ep_initfile: ep_io_num_read: 169 ep_io_num_write: 2921439 ep_io_read_bytes: 213771 ep_io_write_bytes: 17503118437 ep_item_begin_failed: 0 ep_item_commit_failed: 0 ep_item_flush_expired: 0 ep_item_flush_failed: 0 ep_item_num_based_new_chk: 1 ep_items_rm_from_checkpoints: 1770140 ep_keep_closed_chks: 0 ep_klog_block_size: 4096 ep_klog_compactor_queue_cap: 500000 ep_klog_compactor_stime: 3600 ep_klog_flush: commit2 ep_klog_max_entry_ratio: 10 ep_klog_max_log_size: 2147483647 ep_klog_path: ep_klog_sync: commit2 ep_kv_size: 1268463867 ep_max_checkpoints: 2 ep_max_data_size: 2097152000 ep_max_item_size: 20971520 ep_max_size: 2097152000 ep_max_txn_size: 10000 ep_max_vbuckets: 1024 ep_mem_high_wat: 1572864000 ep_mem_low_wat: 1258291200 ep_mem_tracker_enabled: true ep_min_data_age: 0 ep_mlog_compactor_runs: 0 ep_mutation_mem_threshold: 0 ep_num_access_scanner_runs: 23 ep_num_eject_failures: 105711 ep_num_expiry_pager_runs: 548 ep_num_non_resident: 199322 ep_num_not_my_vbuckets: 5 ep_num_ops_del_meta: 0 ep_num_ops_get_meta: 319 ep_num_ops_set_meta: 0 ep_num_pager_runs: 2 ep_num_value_ejects: 199759 ep_oom_errors: 0 ep_overhead: 18173688 ep_pager_active_vb_pcnt: 40 ep_pager_unbiased_period: 60 ep_pending_ops: 0 ep_pending_ops_max: 0 ep_pending_ops_max_duration: 0 ep_pending_ops_total: 0 ep_postInitfile: ep_queue_age_cap: 900 ep_queue_size: 0 ep_startup_time: 1358522939 ep_storage_age: 0 ep_storage_age_highwat: 35 ep_store_max_concurrency: 10 ep_store_max_readers: 9 ep_store_max_readwrite: 1 ep_stored_val_type: ep_tap_ack_grace_period: 300 ep_tap_ack_initial_sequence_number: 1 ep_tap_ack_interval: 1000 ep_tap_ack_window_size: 10 ep_tap_backfill_resident: 0.9 ep_tap_backlog_limit: 5000 ep_tap_backoff_period: 5 ep_tap_bg_fetch_requeued: 0 ep_tap_bg_fetched: 3043860 ep_tap_bg_max_pending: 500 ep_tap_keepalive: 300 ep_tap_noop_interval: 20 ep_tap_requeue_sleep_time: 0.1 ep_tap_throttle_cap_pcnt: 10 ep_tap_throttle_queue_cap: 1000000 ep_tap_throttle_threshold: 90 ep_tmp_oom_errors: 0 ep_too_old: 0 ep_too_young: 0 ep_total_cache_size: 1261592079 ep_total_del_items: 121353 ep_total_enqueued: 2926454 ep_total_new_items: 608274 ep_total_persisted: 2921439 ep_uncommitted_items: 0 ep_value_size: 1205492644 ep_vb0: 0 ep_vb_snapshot_total: 403443 ep_vb_total: 682 ep_vbucket_del: 0 ep_vbucket_del_fail: 0 ep_version: 2.0.0r_145_g4ac7f92 ep_waitforwarmup: 0 ep_warmup: 1 ep_warmup_batch_size: 1000 ep_warmup_dups: 0 ep_warmup_min_items_threshold: 100 ep_warmup_min_memory_threshold: 100 ep_warmup_oom: 0 ep_warmup_thread: complete ep_warmup_time: 17646 get_hits: 11774820 get_misses: 37884 incr_hits: 0 incr_misses: 0 libevent: 2.0.11-stable limit_maxbytes: 67108864 listen_disabled_num: 0 max_conns_on_port_11209: 1000 max_conns_on_port_11210: 9000 mem_used: 1382987584 pid: 1512 pointer_size: 64 rejected_conns: 0 rusage_system: 4168.646269 rusage_user: 28050.627656 tap_checkpoint_end_received: 247989 tap_checkpoint_end_sent: 246888 tap_checkpoint_start_received: 249839 tap_checkpoint_start_sent: 247791 tap_connect_received: 246 tap_delete_received: 82679 tap_delete_sent: 3032897 tap_mutation_received: 1350278 tap_mutation_sent: 16131847 tap_opaque_received: 6656 tap_opaque_sent: 46796 tap_vbucket_set_received: 682 threads: 4 time: 1360499327 total_connections: 4740 uptime: 1976414 vb_active_curr_items: 289612 vb_active_eject: 62755 vb_active_expired: 0 vb_active_ht_memory: 8563192 vb_active_itm_memory: 758064076 vb_active_meta_data_memory: 29914947 vb_active_num: 341 vb_active_num_non_resident: 62586 vb_active_num_ref_ejects: 53 vb_active_num_ref_items: 43915 vb_active_ops_create: 325000 vb_active_ops_delete: 55701 vb_active_ops_reject: 0 vb_active_ops_update: 1170615 vb_active_perc_mem_resident: 78 vb_active_queue_age: 0 vb_active_queue_drain: 1553967 vb_active_queue_fill: 1553797 vb_active_queue_memory: 0 vb_active_queue_pending: 0 vb_active_queue_size: 0 vb_dead_num: 0 vb_pending_curr_items: 0 vb_pending_eject: 0 vb_pending_expired: 0 vb_pending_ht_memory: 0 vb_pending_itm_memory: 0 vb_pending_meta_data_memory: 0 vb_pending_num: 0 vb_pending_num_non_resident: 0 vb_pending_num_ref_ejects: 0 vb_pending_num_ref_items: 0 vb_pending_ops_create: 0 vb_pending_ops_delete: 0 vb_pending_ops_reject: 0 vb_pending_ops_update: 0 vb_pending_perc_mem_resident: 0 vb_pending_queue_age: 0 vb_pending_queue_drain: 0 vb_pending_queue_fill: 0 vb_pending_queue_memory: 0 vb_pending_queue_pending: 0 vb_pending_queue_size: 0 vb_replica_curr_items: 283037 vb_replica_eject: 137004 vb_replica_expired: 0 vb_replica_ht_memory: 8563192 vb_replica_itm_memory: 503528003 vb_replica_meta_data_memory: 29262613 vb_replica_num: 341 vb_replica_num_non_resident: 136736 vb_replica_num_ref_ejects: 145 vb_replica_num_ref_items: 1939 vb_replica_ops_create: 283274 vb_replica_ops_delete: 65652 vb_replica_ops_reject: 0 vb_replica_ops_update: 1021197 vb_replica_perc_mem_resident: 51 vb_replica_queue_age: 0 vb_replica_queue_drain: 1373088 vb_replica_queue_fill: 1372657 vb_replica_queue_memory: 0 vb_replica_queue_pending: 0 vb_replica_queue_size: 0 version: 1.4.4_600_g7ea975a
so I know the cluster can accept new connections
I can telnet to all ports without a problem from the client to any CB node.
Everything was working fine except inconsistent number of replicated Items that started in a previous crash. but other than that for the last 3 weeks I haven't experience any trouble with the CB cluster.
Can anyone provide any insight for the root of this problem?
Thanks
Thanks for reply, I created a new cluster and replicated all the data to it via XDCR. then I switched the client to connect to it. but the same issues reappeared, initial connection was fine on two servers but the third one failed, and after a while a server closed the connection which is weird since everything was working fine before.
I replaced the Java API to version 1.1 and used Connection Factory to create the connection and now things appears to work fine, I haven't seen a problem connecting since the update but not enough time has past to tell if its reliable.
I want to implement connection shutdown like you suggested, but I am not sure whats the correct way to do it, since my application needs to be constantly connected to the CB servers, should I do it with Connection Manager like the beer example?
Whats the point of shutting down a client when its useless without a connection, I mean a shutdown might be needed by the nodes, but the client is killed in my system if a connection fails so is there still a point for a client shutdown?
Why is the memcached thread suspended? for a month it worked fine and now in 2 days it happend every few hours, could this be related to the configuration fetching problem?
Thanks a lot for the help!
Yuval
Hi,
So I am still having a problem when a new instance tries to connect to the couchbase cluster, other issues seems to be solved.
Again when I have 2 instances running and connected to the cb cluster, I am starting a new instance and I receive this warning after enabling logging
2013-03-28 12:39:08 DEBUG ConfigurationProviderHTTP:? - Attempting to read configuration from URI: http://xx.xxx.xxx.xxx:8091/pools
2013-03-28 12:39:09 WARN ConfigurationProviderHTTP:? - Timed out while reading configuration over HTTP
java.net.SocketTimeoutException: connect timed out
when I try to get http://xx.xxx.xxx.xxx:8091/pools using curl, I am getting the file correctly. from the same server.
what should I do next to solve this issue?
Thanks
For the first issue, you'll want to be calling the .shutdown() method on the client object at the appropriate time. That should make that issue go away.
For the second, I know you have connectivity, but there could be something wrong when trying to get the bucket configuration. Are there any unusual entries in the logs on the cluster?
You may also try getting the configuration for a bucket like a client would. During bootstrap, it'll go to a URL like:
http://(node):8091/pools/default/bucketsStreaming/(bucketname)
You may want to try that with some other HTTP client to see if it's seeing the same thing.
If you want to see this from the Java client, turn the debug to either CONFIG or DEBUG level on com.couchbase.client.vbucket.* (depending on whether you're using JDK logging or Log4J-- see the docs). That would be good info on what exact URL it's trying to retrieve when timing out.