Like a few out there, I’ve encountered the dreaded -1007781356 DCOM error recently. It started when a client notified me that after migrating 8000+ users from one pool to another, that there was a small handful, about 150, that wouldn’t move. Most users would show the following error: “Distributed Component Object Model (DCOM) operation begin move away failed.” with “RollbackMoveAway failed -1007781356”.
About 20 of those users we found had a slightly different error, a DCOM -1007200250. These errors gave a bit of additional detail in the message: “Distributed Component Object Model (DCOM) operation begin move away failed because user was not found in database.” which can be seen below.
After some additional investigation, other issues were becoming apparent. There were issues with the Backup Service which showed itself in an error state.
Further there were LS Backup Service 4073 errors showing “Microsoft Lync Server 2013, Backup Service user store backup module detected items having pool ownership conflict during import.” The list of users shown include many of those who wouldn’t move. This confirmed that we were dealing with a pool ownership conflict, where the user partially exists in multiple pool SQL databases.
Export-CSUserData was also failing with the error. “Export-CsUserData : Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. This failure occured while attempting to connect to the Principle server.”
Initial Solution Attempt
While I wasn’t entirely sure that all of these issues were related, the timeline in which they started manifesting lined up closely.
The first thing I did? I’m not ashamed to admit that I went for the search engines, no use reinventing the wheel if this is a common problem. Unfortunately, I didn’t find too much. There were similar errors with resolutions that didn’t seem to line up as this was only affecting a handful of users. Further, any approach I take has to be taken with extreme care as this environment has tens of thousands of users. The blog I encountered with the most helpful information was John Cook’s.
With an identical error and symptoms, he was able to contact Microsoft PSS who had of a tool that could resolve the issue. Before we headed down that path, I wanted to see if there were any additional workarounds I could get at for our environment.
Taking a cue from John’s blog, the VerifyUserDataReplication.exe tool from the Lync Resource Kit gave me an output of the identical user set found in the LS Backup Service 4073 errors, and it also lined up nicely with the users who refused to move.
We had a reasonably good backup of user data despite the Export-CSUserData timeouts, and located one of the users who hadn’t logged in for quite some time to use as a guinea pig. Using our guinea pig account, we were able to move the user with the command:
Move-CsUser -identity <Identity> -Target -<OtherPool> -Force
The -Force is what got us there, it ignores the user data, which in our case was the issue preventing us from moving the account. After that, we were able to run an Update-CsUserData to merge the contacts back in for this user from our backup. The remaining users were scheduled for a forceful move and restore that night.
As a side note, it was comforting to see sharp guys out there such as Flinchböt fighting the same issues at the same time and coming up with the same approach. http://flinchbot.wordpress.com/2014/09/17/moving-immovable-users/
Almost, but Not Quite
The rest of the users moved successfully accomplishing the initial goal. However, the LS Backup Service 4073 errors and the VerifyUserDataReplication.exe were still reporting the issue. A sample of the VerifyUserDataReplication.exe can be seen below.
Info: reading batches served by jpprdl3sql1.adms.lyncfix.com\lync13Tokyo from backup pool.
Info: 6247 batches are returned from deprdl3sql1.adms.lyncfix.com\lync13Berlin.
Info: 116161 items are returned from deprdl3sql1.adms.lyncfix.com\lync13Berlin.
Info: reading batches served by jpprdl3sql1.adms.lyncfix.com\lync13Tokyo from source pool.
Info: 6248 batches are returned from jpprdl3sql1.adms.lyncfix.com\lync13Tokyo.
Info: 116366 items are returned from jpprdl3sql1.adms.lyncfix.com\lync13Tokyo.
Info: comparing batches served by jpprdl3sql1.adms.lyncfix.com\lync13Tokyo in source pool and backup pool.
Error: batch bf36c405-0396-429e-bac3-001dd81d17b6 has item 6abec8cb-5857-4cc4-8c44-dd99d9e47206-urn:hcd:email@example.com whose partial version 3 in source pool is less than or equal to batch’s partial version 4 in backup pool. It cannot find same item in backup pool.
Error: batch bf36c405-0396-429e-bac3-001dd81d17b6 has item 6abec8cb-5857-4cc4-8c44-dd99d9e47206-urn:lcd:firstname.lastname@example.org whose partial version 3 in source pool is less than or equal to batch’s partial version 4 in backup pool. It cannot find same item in backup pool.
Clearly, the move got us past the initial DCOM error by ignoring the database issue, but it didn’t clean up the offending database records. The next approach was to completely disable a user and try again. If we can’t resolve the issue with a forced move, surely removing the user from Lync would do the trick. We figured we could always recreate the user later if this approach worked. We went back to our guinea pig account and ran Disable-CsUser. This command deletes all the attribute information related to Lync Server from an Active Directory user account.
Even with the account “removed” from Lync, the issue still persisted. The user still showed up in the LS Backup Service 4073 errors and the VerifyUserDataReplication.exe output. We re-enabled the user and reimported the user data again from our backup. It makes sense that this approach didn’t work, as it only removes the Active Directory attributes, there is no backend database cleanup. But if the forced move and the delete doesn’t do it, what will?
The final answer was simple, now that the DCOM error was resolved due to the move-csuser -force command, we could freely move the users back to the old pool, and then move them back again to the final destination. Success! Using this method, the pool conflict error was resolved, the username was removed from the event log, and VerifyUserDataReplication output no longer reported the account as an issue.
Moving the previously forced-moved users back to their old pool, then back again to their new home cleaned up the database enough that not only did our events disappear, but export-csuserdata stopped experiencing it’s timeouts and the backup service error state went back into a normal state. We’re now back to a healthy state.
A Special Thanks
I’d like to extend a special thanks to John A Cook and Flinchböt for sharing their experiences and letting me talk through mine. Please feel free to reach out to me here or on twitter @CAnthonyCaragol if you’re experiencing issues of your own.