Question

Short story: My DDS subscriber cannot see data from my DDS publisher. What am I missing?

Long story:

QNX 6.4.1 VM A -- Broken Publisher. IP ends with .113
QNX 6.4.1 VM B -- Working Publisher. IP ends with .114
Windows 7      -- Subscriber/Main Dev box. IP ends with .203
RTI DDS 5.0    -- Middleware version

I have a QNX VM (hosted on the network, not on my machine) that is publishing some data via RTI DDS. The data never shows up in my Windows 7 subscriber application.

Interestingly enough, I can put the same code on VM B, and the subscriber gets data. Thinking this must be a Windows 7 firewall issue, I swapped VM A's IP address with VM B, but this did not help.

Using Wireshark, I can see some heartbeat traffic from VM A, but no data. From VM B, I see the heartbeat traffic and the data. Below is a sanitized Wireshark snippet. Wireshark Output

NDDS_DISCOVERY_PEERS is set to include both multicast and the explicit IP address of the other side of each conversation. The QOS profiles are the same, and the RTI Analyzer indicates the Match Analysis was successful (all green).

VM A: NDDS_DISCOVERY_PEERS=udpv4://239.255.0.1,udpv4://127.0.0.1,udpv4://BLAH.203

VM B: NDDS_DISCOVERY_PEERS=udpv4://239.255.0.1,udpv4://127.0.0.1,udpv4://BLAH.203

Windows 7 box: NDDS_DISCOVERY_PEERS=udpv4://239.255.0.1,udpv4://127.0.0.1,udpv4://BLAH.113,udpv4://BLAH.114

When I include them in the NDDS_DISCOVERY_PEERS line, other folks on the network can see DDS traffic from VM A with DDS SPY on their Windows 7 box. My Windows 7 box can not.

Windows 7 event log does not appear to show any firewall or WFP stopping the data packets.

RTI DDS Spy run from my Windows 7 machine shows that VM A (0A061071) writers are visible on the network, but no data is being received. It also shows that the readers in my subscriber on my Windows 7 machine are visible, though it shows up at an odd IP address.

Bonus question (out of curiosity only, NOT the primary question): why does traffic on my local machine show up in DDS SPY as 192.168.11.1 instead of my machine's IP or even 127.0.0.1?

RTI DDS SPY Output

Main Question: What am I missing?

Update: route print on my Windows 7 box appears to show that I have joined a multicast group with VM A. netsh interface ip show joins seemed to concur.

Investigation Update:

  1. I rebooted the VM to no effect.

  2. I rebooted the Windows box to no effect.

  3. I removed the multicast from the NDDS_DISCOVERY_PEERS environment variables on both sides to no effect.

  4. The Windows 7 box has three network interfaces (plus loopback): 1 LAN connection and 2 (unrelated) VM adapters. We are working with the LAN connection. The QNX VM has one network interface (plus loopback). Note that the working VM and the broken VM use a different ethernet driver than each other, as they are slightly different flavors of QNX 6.4.1. The broken one has wm0 as the interface, and the working one has en0 as the interface. I don't believe this is the issue, but it is a difference.

  5. I ran DDS SPY on the "broken" QNX VM while it was playing back, and I got DDS data. I don't have a good method to sniff the network between where the VM is hosted and my Windows 7 machine to see if it makes it out of the interface, but looking at the transmitted packet count out of the ethernet interface on the QNX VM indicates that it is definitely transmitting something, and the Wireshark captures on the Windows 7 machine itself show that at least some traffic is making it through.

  6. Other folks on the LAN here can see the DDS traffic from the "broken" VM, which leads me to believe it is a Windows setup issue, rather than a broken VM--I just can't see what it could be. I've re-checked the firewall, to no avail. I would have thought that if it were a firewall issue, the problem would have gone away when I swapped IP addresses between VM A and VM B. In any case, the Windows 7 firewall is currently off, to no avail.

  7. Below are several screens of Wireshark output. I skipped a bunch between the third and the fourth, as after the fourth, the traffic tended to look like the bottom of the fourth until the end.

Image 1 Image 2 Image 3 (Skipped a bunch here) Image 4 (Pretty much continues on like the last 11 lines above)

What else should I try?

Update: To answer Rose's question below, using rtiddsping -publisher on the bad VM and rtiddsping -subscriber works appropriately.

I wonder if this issue is caused by the weird IP address. The IP address it happens to publish and somehow latch on to is a local VM ethernet adapter (separate from VM A). See the screenshot below.

Win7 Ipconfig

The address I would like it to attach to is 10.6.6.203. Any way I can specify that?

Was it helpful?

Solution

More than a year later this happened to me again with a different virtual machine. I had it working yesterday, so I was very suspicious. I scoured all my code changes for the past 24 hours for issues, but didn't find any. Then I decided to see if IT had pushed any patches to my computer.

Guess what? The Windows Firewall had been aggressively updated since the day before. Rules missing or changed, etc. The log said packets were being dropped. I opened the firewall filters up a bit, and suddenly, everything worked again. I hesitate to close this issue, as I am not 100% this was exactly the same as last year, but it feels very similar. I suspect that last year the settings in the firewall were not logging the packet drops.

Long and short of it: if DDS suddenly stops working, check your firewall settings.

OTHER TIPS

A couple of things to try:

  1. Try running rtiddsping -publisher on the broken VM and rtiddsping -subscriber on Windows. This has two advantages:

    • The data type is small and well-known, so if there's some problem with the data being fragmented due to the different Ethernet drivers, it will not happen with rtiddsping, and may help track down the problem.
    • Rtiddsping prints out when the publisher and subscriber discover each other, so you will be able to confirm that discovery is completing correctly on both sides. I am guessing discovery is working, because Analyzer is showing both applications, but it is good to confirm.
  2. If you see the same problem with rtiddsping that you see with your application, increase the verbosity to rtiddsping -verbosity 3, and then 5. At the highest verbosity level, this will print (a lot of) additional output, which may give us a hint about what is happening.

To answer your bonus question about spy: The reason why spy is showing that IP address is because that is one of the addresses that is being announced as part of discovery. During discovery, a DomainParticipant can announce up to four IP addresses that can be used to reach it. Spy will choose one of those to display, but it may not be the actual address that is being used to communicate with the application. If your machine does not have any interface with the 192.168.11.1 IP address, this could indicate a larger problem. (This may be normal, though - as long as the correct IP is one of the four that are announced.)

Looking through the packet trace images, there is nothing that is obviously the problem. A few things I notice:

  • There seems to be a normal pattern of heartbeats/ACKNACKs in the final packet trace image. This indicates that there is some bidirectional communication between the two applications.
  • It is difficult to tell from these images whether the DATA being sent from .113 to .203 consists of participant-to-participant messages, or real discovery messages - except for two packets: packet #805, and packet #816 (fragments 811-815) look like discovery announcements that are being sent to .203. This indicates that you have at least four entities (DataWriters or DataReaders) in your application on .113.

So, discovery data is being sent by the application on .113. It is being received and reassembled by WireShark, but that doesn't always mean it was received correctly by the application.

Packet #816 has a heartbeat on the end of it. It is possible that packet #818 or #819 might be the ACKNACK that is responding to that heartbeat, but I can't be sure from the image. The next step is to look at those ACKNACKs from .203 to .113 to see if .203 thinks it has received all the discovery data. Here is an example of a HB/ACKNACK pair where a discovery DataReader has received all data:

Submessage: HEARTBEAT
... 
firstSeqNumber: 1
lastSeqNumber: 1

The heartbeat sequence number is 1, which indicates it has only sent an announcement about a single DataReader.

Submessage: ACKNACK
... 
readerSNState: 2/0:
    bitmapBase: 2
    numBits: 0

The readerSNState is 2/0, meaning it has received everything before sequence number two, and there is nothing missing. If there is something other than a 0 in the bitmap, it indicates the DataReader did not receive some data.

If you can confirm that the application is receiving all the discovery data correctly, it will be helpful if you can use a WireShark filter to show only user data, since the images aren't highlighting discovery vs. user data.

WireShark filter for just rtps2 user data: rtps2 && (rtps2.traffic_nature == 3 || rtps2.traffic_nature == 1)

We had a similar issue with this. Here is the environment in a very summarized way:

  • A publisher
  • A working subscriber (laptop)
  • A non-working subscriber (desktop)

Both subscribers held exactly the same software (the desktop was a clone from the laptop, through Clonezilla), but rtiddsspy was blind from the desktop point of view; however, the opposite way worked well: the publisher machine's rtiddsspy saw the desktop. Laptop and publisher machines' always worked well. Laptop and desktop too (they saw each other's subscriptions)

The workaround for this (based on https://community.rti.com/content/forum-topic/discovery-issues) was to increase the MTU on the desktop NIC. Don't ask me why, but it worked.

EDIT: At the beginning, the MTU in the publisher was set to a higher value than the subscriber. So, we changed the MTU in the subscriber to match the publisher's.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top