Couchbase: possible reasons for 10x difference in cbs-pillowfight latency test, when running in a cluster mode

https://stackoverflow.com//questions/23013429

21-12-2019
|

Question

So I've started a simple test,

cbs-pillowfight -h localhost -b default -i 1 -I 10000 -T

Got:

[10717.252368] Run
              +---------+---------+---------+---------+
[ 20 -  29]us |## - 257
[ 30 -  39]us |# - 106
[ 40 -  49]us |###################### - 2173
[ 50 -  59]us |################ - 1539
[ 60 -  69]us |######################################## - 3809
[ 70 -  79]us |################ - 1601
[ 80 -  89]us |## - 254
[ 90 -  99]us |# - 101
[100 - 109]us | - 43
[110 - 119]us | - 17
[120 - 129]us | - 48
[130 - 139]us | - 23
[140 - 149]us | - 14
[150 - 159]us | - 5
[160 - 169]us | - 5
[170 - 179]us | - 1
[180 - 189]us | - 3
[210 - 219]us | - 1
[270 - 279]us | - 1
              +----------------------------------------

Then, a cluster was created by adding this node to another i7 node. 'Default' bucket is definitely smaller than 1Gb, it has 1 replica and 2 writers, flush is not set.

Now, same command produces (both hosts used ):

50% in 100-200 ns, 1% in 200-900 ns, 49% in 900ns to "1 to 9 ms!" WTF.

After adding -r (ratio) switch set to 90% SETs,

25% in 100-200ns, 74% in 900ns, remaining in 900ns to "1 to 9 ms!"

So it seems that write performance suffers much in clustered mode; why it can be such a large, 10x drop? Network is clean, there are no highload services running..

UPD1.

Forgot to add the ideal case: -r 100.

25% in 100-200 ns, 74% in 900 ns.

This makes me think, that:

A) benchmark code is blocking somewhere (quick reading shown no signs)
B) server is doing some non-logged magic on SETs I can't understand to reconfigure. Replication factor? Isn't that a nonsense for a small dataset? That's what I'm trying to ask here.
C) network problem. But wireshark shows nothing.

UPD2.

Stopped both nodes, moved them to tmpfs. For a "normal" responses, got 20ns improval. But slow responses remain slow.

..[cut]
 50 -  59]us |## - 164
[ 60 -  69]us |#### - 321
[ 70 -  79]us |######## - 561
[ 80 -  89]us |########## - 701
[ 90 -  99]us |############ - 844
[100 - 109]us |########## - 717
[110 - 119]us |####### - 514
[120 - 129]us |##### - 336
[130 - 139]us |### - 230
[140 - 149]us |## - 175
[150 - 159]us |## - 135
[160 - 169]us |# - 81
..[cut]
[930 - 939]us | - 24
[940 - 949]us |## - 139
[950 - 959]us |##### - 339
[960 - 969]us |####### - 474
[970 - 979]us |####### - 534
[980 - 989]us |###### - 467
[990 - 999]us |##### - 342
[  1 -   9]ms |######################################## - 2681
[ 10 -  19]ms | - 1
..[cut]

UPD3: screenshot. screenshot @ imgur

Solution

Problem is "solved" by switching to three-node configuration on gigabit network.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow