Windows Storage Replica: dealing with “WaitingForDestination” status
A few months ago we decided to change the configuration of some of our virtualization servers that were using Hyper-V replica to create replicas that could be used in case of a disaster. We used to replicate such machines so we could restore them quickly in case of issues or even start them on the destination host, if needed.
I will deal with the reasons why we decided to switch technology, from Hyper-V replica to Windows Storage Replica, in a different, bigger post but one thing we noticed quite soon during our first migrations was that sometimes our replicated volumes were apparently hanging in a WaitingForDestination status. As the name implies, it seemed that such status should mean that the source volume cannot contact the destination one so the replica is suspended.
We tried to diagnose the issue for a few time. We were sure that it wasn’t an issue with communications between the two machines because the connection from source to destination was clearly available plus some machines also had volumes replicated onto each other, that is Server 1 was replicating volume X to Server 2 and Server 2 was replicating volume Z to Server 1; and in such cases the replication was suspended one way, say from Server 1 to Server 2, but not suspended and working fine from Server 2 to Server 1. Clearly that was not a connection issue.
At first we thought that server would eventually re-establish a working connection to the destination, after trying to connect from time to time but we noticed that in one case the source suspended replication for 3 days without (self-)healing. Which was somewhat odd since during the same period of time the inverse operation had been working fine. No issues during the same period of time when the other replica was waiting for the destination while that other replica task was also displaying a Failed status on the destination instead of the usual ContinuouslyReplicating.
Clearly the replica was stuck and I noticed that there are a very few pages on the Internet dealing with this status and the same situation. In a couple of cases someone suggested to restart the replica service which is an option but not a very handy one since to restart the service we needed to shutdown ALL of the virtual machines hosted on that server.
When we tried to shutdown the VMs that were running on that host and actually restarting the storage replica service, replication started right away without issues so it was clear that something was wrong with the storage replica service. Some users reported that they solved a WaitingForDestination status by issuing a
Get-SRGroup | Sync-SRGroup
command, which is supposed to start or resume replication. In our case, it did nothing. Nothing was happening and restarting the service seemed to be the only option other than restarting the machine itself. Not cool at all since our replication could stop at any moment and it was not going to self-heal. Sometimes it also happened with Hyper-V replication but in most cases that service was able to restart replication.
Then all by chance we discovered a neat trick to restart replication in such cases. We were trying to be sure that no data replica was happening and we just tried to mount volume on destination server by issuing the usual
Mount-SRDestination -ComputerName "DestinationServer" -Name "ReplicaGroupName" -TemporaryPath T:\
command, then we checked if an old version of the files to replicate was inside the destination volume and dismounted it by issuing
Dismount-SRDestination -ComputerName "DestinationServer" -Name "ReplicaGroupName"
and guess what? Yep, replica restarted. Not surprisingly it started with an InitialCopyBlock task that basically re-synched the whole source volume but indeed the good thing was we hadn’t to dismount the source volume and thus stop all the affected services.
We tried this trick another couple of times for the same situation and it worked fine: in all those cases replica just restarted, one time also avoiding the initial copy block task and just resynching changes.
Now the question stays: why was the replica failing with such weird error notwithstanding the fact that there clearly were no communication issues between those servers?
We have a theory about that even if we weren’t able to test it, yet. The fact is that while we have large log volumes (several hundreds GBs) during the first week of deployment we have been moving small and big VMs, in some cases bigger than the log volume itself so we suspect, without being able to prove it, that sometimes the replica got struck because of that. Probably log volume was unable to deal with the total amount of data to transfer and got struck, though the error is not helpful at all.
That is unproven. We didn’t check if that was the case since after the initial deployment, we didn’t move big files over those volumes. Plus, the most important thing was that we could restart replication.
I hope this is helpful to anyone else. 🙂 It’s not fun when you have such kind of issues and no clues about what’s going, with only a few Web pages reporting the same issue.