Cluster with one node down fails to show documents

I have a simple cluster in my demo environment with only 2 nodes (each with 6GB RAM and 2 vCPUs). One is failed at the moment. However, I have severe problems using the database from the one active node…

If I just try to open the Documents in the web client then it fails to show any documents and just throws this at me:

{ "error": "{exit,{{nodedown,'ns_1@db2.dalsgaard-data.dk'},\n {gen_server,call,\n [{'ns_memcached-data','ns_1@db2.dalsgaard-data.dk'},\n {get_keys,[512,513,514,515,516,517,518,519,520,521,522,\n 523,524,525,526,527,528,529,530,531,532,533,\n 534,535,536,537,538,539,540,541,542,543,544,\n 545,546,547,548,549,550,551,552,553,554,555,\n 556,557,558,559,560,561,562,563,564,565,566,\n 567,568,569,570,571,572,573,574,575,576,577,\n 578,579,580,581,582,583,584,585,586,587,588,\n 589,590,591,592,593,594,595,596,597,598,599,\n 600,601,602,603,604,605,606,607,608,609,610,\n 611,612,613,614,615,616,617,618,619,620,621,\n 622,623,624,625,626,627,628,629,630,631,632,\n 633,634,635,636,637,638,639,640,641,642,643,\n 644,645,646,647,648,649,650,651,652,653,654,\n 655,656,657,658,659,660,661,662,663,664,665,\n 666,667,668,669,670,671,672,673,674,675,676,\n 677,678,679,680,681,682,683,684,685,686,687,\n 688,689,690,691,692,693,694,695,696,697,698,\n 699,700,701,702,703,704,705,706,707,708,709,\n 710,711,712,713,714,715,716,717,718,719,720,\n 721,722,723,724,725,726,727,728,729,730,731,\n 732,733,734,735,736,737,738,739,740,741,742,\n 743,744,745,746,747,748,749,750,751,752,753,\n 754,755,756,757,758,759,760,761,762,763,764,\n 765,766,767,768,769,770,771,772,773,774,775,\n 776,777,778,779,780,781,782,783,784,785,786,\n 787,788,789,790,791,792,793,794,795,796,797,\n 798,799,800,801,802,803,804,805,806,807,808,\n 809,810,811,812,813,814,815,816,817,818,819,\n 820,821,822,823,824,825,826,827,828,829,830,\n 831,832,833,834,835,836,837,838,839,840,841,\n 842,843,844,845,846,847,848,849,850,851,852,\n 853,854,855,856,857,858,859,860,861,862,863,\n 864,865,866,867,868,869,870,871,872,873,874,\n 875,876,877,878,879,880,881,882,883,884,885,\n 886,887,888,889,890,891,892,893,894,895,896,\n 897,898,899,900,901,902,903,904,905,906,907,\n 908,909,910,911,912,913,914,915,916,917,918,\n 919,920,921,922,923,924,925,926,927,928,929,\n 930,931,932,933,934,935,936,937,938,939,940,\n 941,942,943,944,945,946,947,948,949,950,951,\n 952,953,954,955,956,957,958,959,960,961,962,\n 963,964,965,966,967,968,969,970,971,972,973,\n 974,975,976,977,978,979,980,981,982,983,984,\n 985,986,987,988,989,990,991,992,993,994,995,\n 996,997,998,999,1000,1001,1002,1003,1004,1005,\n 1006,1007,1008,1009,1010,1011,1012,1013,1014,\n 1015,1016,1017,1018,1019,1020,1021,1022,1023],\n [{include_docs,false},\n {inclusive_end,true},\n {limit,10},\n {start_key,undefined},\n {end_key,undefined}]},\n infinity]}}}", "reason": "unknown error" }

Shouldn’t it continue to respond from the one active node???

How can I best troubleshoot this issue?

I’m using: Community Edition 6.0.0 build 1693 on CentOS

Actually, the reason why the other node is “down” is a little strange…

I can ping it then it responds quickly… but I cannot open the Couchbase admin page (nor Webmin). I cannot ssh to it - but if I open it directly (via Virtual Infrastructure client - its a VM) then everything runs smoothly (and with low CPU), so I guess there is more to it - and not sure why this happens. The two nodes have been set up in exactly the same way - and the other never fails…

Not sure what to look for in this scenario?

… but the main question about the cluster is still very relevant!

@jda,

Try giving your CB cluster a little more horse power.

Try testing with 3 nodes with the recommended specs in the link below.

Thanks for your reply @househippo

But this is a demo/test installation that has 1-3 users… Surely a config of 6GB per server with two CPUs should work ???

And I understand that the more cluster nodes you have the merrier… But with two nodes you do in effect have a “mirror” - and that should still keep running on the one available node until number two is back up and running???