Windows Server 2003 on HP ProLiant Servers : File Replication Service Design and Implementation (part 1)

4/1/2013 6:28:22 PM

FRS was implemented in Windows 2000 to perform functions similar to the NTLmRepl service in Windows NT. FRS was designed to replicate the contents of SYSVOL (GPOs and scripts) and Distributed File System (DFS) replica link-targets. NTFRS, or File Replication Service, communicates with replication partners to determine when changes are made to the replica set (SYSVOL or DFS) and replicates that data to all downstream partners. It is a multithreaded, multimaster replication engine. FRS relies on AD for its replication topology (NTDS connection objects) and specific replica set information, such as partners. FRS is dependant upon AD objects and AD replication, which in turn depends on connectivity, DNS, and RPC (Remote Procedure Calls). This is vital to remember when troubleshooting.

FRS uses its own service, the File Replication Service, to identify replication partners, determine when file changes in the SYSVOL or replicated branches of the DFS tree have occurred, and replicate that data to all downstream partners. It is multithreaded, uses a multimaster replication model, and relies on the AD replication infrastructure for sites, connections, and so on.

1. Basic Operation

Basic FRS operation is illustrated in Figure 1. The first step is a GPO modified on DC1. A temporary copy of the file is then placed in the staging directories. In step 3, a change order is issued to the downstream partner, DC2; and in step 4, the downstream partner, DC2, receives the change order and pulls the file from DC1's staging directory to DC2's Do_Not_Remove_NtFrs_ Preinstall_Directory. The file is then moved to the proper folder. When all downstream partners have pulled the file, the file is removed from DC1's staging directories.

Figure 1. Steps in the basic FRS replication process.

2. Replica Set Structure

There are two basic types of replica sets: SYSVOL and DFS, sometimes referred to as non-SYSVOL. Figure 2 shows this concept. SYSVOL contents exist only on DCs. Thus, all DCs are in a replica set that replicates SYSVOL information. DFS servers, however, contain user-defined data and can exist on DCs, member servers, or both. Note the two DCs in Figure 2 that participate in both FRS and DFS replica sets.

Figure 2. SYSVOL and DFS replica sets can exist on separate or common servers. DFS can exist on DCs or member servers, while SYSVOL can exist only on DCs.

SYSVOL File Structure

This process is pretty simple when everything is working, but unfortunately things break. Much of this depends on AD replication, so if AD replication to a certain DC breaks, then FRS also gets in trouble. Because AD is also a multimaster engine, each DC has its own version of reality, until replication takes place and all DCs share the same data. Thus, a DC that falls behind in replication is out of touch with reality.

A junction point (also referred to as reparse point, directory junction, and volume mount point) is a physical location on a hard disk that points to another location on a disk or storage device. Think of junction points as links in the file system—sort of a tunnel that binds two ends into one because it connects two locations on the disk to each other. SYSVOL uses junction points to manage a single instance store by placing a junction point at the %systemroot%\sysvol\sysvol directory. For more information, refer to Microsoft KB article 324175 “Best Practices for SYSVOL Maintenance” and KB article 205524 “How to Create and Manipulate NTFS Junction Points.”

3. FRS Design Considerations

Any discussion of FRS topology design must include AD replication topology design. Although FRS contains unique elements and objects, such as replica members, subscriber objects, SYSVOL data, and so forth, replication is effected over the framework of AD replication components, such as NTDS connection objects, sites, and site links (to name the important ones). Thus, an inefficient AD replication topology results in an inefficient FRS topology, with errors and failures resulting in both.

In FRS, it's not possible to provide coherent data in a multimaster server environment composed of tens or hundreds of members because not all servers might be connected at the same time and even if they were, the cost to synchronize is prohibitive. Rather, the contents of a replica tree in FRS are loosely coherent, meaning that after all outside change has stopped and all objects are replicated, all replica trees on all connected members will have the same data. Efficient topology can minimize the effects of data latency.

Good topology is directly related to the overall speed of replication. Tweaking the replication schedule from the default can have far-reaching consequences. For instance, you might have a goal of reducing replication traffic over a certain link between sites, so you adjust the replication schedule to replicate every three hours and only between 7 p.m. and 5 a.m. each day. This would significantly increase the time to consistency. For instance, if a Group Policy with a security setting change was created at 10 a.m., it would not be replicated until 7 p.m. that evening. With replication latency in the mix, it might not get to all the DCs by 5 a.m. the next morning and have to wait until the next replication period. The point here is not to specify that a restrictive schedule is good or bad, but you must know the consequences when you implement it.

4. Common FRS Problems and Solutions

FRS was definitely an evolutionary product in Windows 2000 with a common troubleshooting step being checking to see whether you have the latest FRS hotfix. During Windows Server 2003 development, Microsoft took these problem issues and created better solutions or at least workarounds and ported them into Windows 2000 Service Packs starting at SP2. As of this writing, S4 has been released that is current with all FRS hotfixes. However, if you have a problem with FRS, contact Microsoft to see if there is a current FRS hotfix. Hopefully these hotfixes will not be so common, and FRS will be more stable. SP4 and Windows Server 2003 have made FRS fairly stable and not so prone to error. However, this depends on the Administrator being aware of the issues and understanding how to design the FRS structure to take advantage of the fixes. These issues and best practices are described in this section.

Junction Points

Removal of the junction point causes FRS replication to fail. Likewise, copying the junction point creates another SYSVOL tree. I saw a case where an Admin copied the entire SYSVOL tree to his DC's desktop for backup. Because he copied the junction point, it set up a duplicate SYSVOL tree and replicated. That DC had two SYSVOL trees. Deleting the whole directory would have wiped out the SYSVOL tree on all DCs in the domain. We resolved it by using the ResKit utility, LinkD.exe, to delete the junction point in the duplicate directory, and then deleted the directory. What he should have done was to copy just the contents of the %systemroot%\policies and %systemroot%\scripts to a directory outside of %systemroot%\sysvol.

Morphed Directories

Morphed directories and files have been replicated, but an exact copy already exists on the target. FRS knows which one is most recent, and protects the original by creating a copy, which is referred to as a morph. These duplicate directories or files are renamed by prefixing the name with NtFrs_xxxxxxxx where xxxxxxxx is a random eight-digit number. This usually occurs if an Authoritative Restore occurs, forcing an entire SYSVOL tree to replicate to multiple replica set members at the same time. The Administrator must decide which is the newest, most correct version to keep. If it's the morphed version, delete the original and rename the morphed folder by eliminating the NtFrs_xxxxxxxx prefix. If it's the original, simply delete the morphed version. Morphed directory contents are not replicated and if it is more recent data, you might lose changes if the cause of the morphed directory is not resolved. For more information, see Microsoft KB article 328492, “Folder Name Is Changed to FolderName_NtFrs_ <xxxxxxxx>.”

Version Vector Joins

When you join a new DC to the domain, a version vector is created from the new DC to each of the other DCs in the domain. This also takes place when there is a failover to a new DC for AD replication. In Windows 2000, this was a parallel process that caused a lot of grief because it pulled the entire SYSVOL tree from every DC in the domain at the same time—in parallel. This caused problems not only in network performance, but also in DC performance as it has the potential for taking a DC offline during this process. Windows Server 2003 and Windows 2000 SP3+ corrected this by making it a serialized process. The new DC does a vvjoin during demotion, and after that is complete, contacts other DCs in the domain one at a time for changes. If the source DC is up-to-date, the vvjoin is still done to the others, but no replication takes place.

Staging Area Problems and Excessive FRS Replication

This is an oldie but a goodie. There are still many Administrators who are not aware of this important issue. Changes made to files in SYSVOL create temporary files in the staging directory %systemroot%\sysvol\staging\domain. The junction point %Systemroot%\sysvol\staging areas\<domain name> points to that location as well, so it appears that the files are duplicated when they are not. The junction point is simply a pointer. The file stays in the staging directories until all downstream partners have pulled it.

Some programs that scan the files, such as antivirus and defragmenter programs, as well as setting file system policy in a Group Policy to apply to the SYSVOL tree, modified the security descriptors of the files, which forced a change order, causing all files in the SYSVOL tree to be copied to the two staging directories. This resulted in huge numbers of files dumped into the staging directories, exceeding the 660MB limit on the staging directories, causing FRS replication to stop. There was a Registry key to increase this limit, but that was just to give you some breathing room until you could resolve the problem (see Microsoft KB article 264822, “File Replication Service Stops Responding When Staging Area is Full”). This functionality has changed in Windows Server 2003 and Windows 2000 SP3+. As noted in Microsoft KB article 307319, “Changes to the File Replication Service,” when the staging gets to 90% of the capacity, FRS deletes the oldest files until the directory is only 60% full, and keeps replication going. This continues if the directory gets filled up again—never allowing staging files to exceed the limit.

note

Note that most antivirus vendors now have FRS-friendly versions of their products. If you ask and they don't know, find another vendor. This is a well-known problem and they should have a solution. For more information, see Microsoft KB article 815263, “Antivirus, Backup, and Disk Optimization Programs That Are Compatible with NTFRS.”

Microsoft made significant improvements on this issue in Windows 2000 SP3 and Windows Server 2003 in two ways:

Reduction of excessive FRS replication (Microsoft KB article 811370, “Issues that are fixed in the post-Service Pack 3 release of Ntfrs.exe”): FRS detects these unnecessary updates to the files (presumably based on frequency) and suppresses the updates. The Administrator is notified with event ID 13567 in the NTFRS event log. This was available as a Windows 2000 post-SP3 hotfix 811370 as well as Windows Server 2003, and is described in Microsoft KB article 315045, “FRS Event 13567 is Recorded in the File Replication Service Event Log After you Install Service Pack 3.”
Replication is not stopped if staging directory is filled (Microsoft KB article 307319): In Windows 2000 SP3+ and Windows Server 2003, when the staging directory gets to 90% capacity, the oldest files are deleted until it is reduced to 60% capacity, thus preventing replication from stopping and taking the DC offline. Note that this is not a “fix.” The fix is to find out what is causing this huge volume of files to be dumped into the staging area.

note

There is really no reason to experience staging area problems with Windows Server 2003 or Windows 2000 SP3+ if you are using FRS-aware versions of defragmenters and antivirus products. These new versions, combined with the new features in FRS that make it more tolerant of error conditions, will reduce or eliminate most common FRS issues.

Journal Wrap

When changes are made to files in the NTFS, an entry is made in the NTFS journal indicating a new file, a deletion, or a modification. When the NTFS journal gets filled, it wraps and writes over the oldest entries. FRS uses the NTFS journal to detect files in SYSVOL that have changes so it can start replication. If a lot of files are changed and FRS gets overwhelmed, the NTFS journal will fill and begin overwriting the oldest entries. This can cause FRS to get lost because it needs those entries. The NTFS journal in Windows Server 2003 was simply increased to 128MB, a dramatic increase over the Windows 2000 limit of 32MB. This should significantly reduce the opportunity for experiencing journal wrap errors and the resulting nonauthoritative restore.

Authoritative and Nonauthoritative Restore

Authoritative and nonauthoritative restore in FRS are not related to authoritative and nonauthoritative restore in AD, using the Ntdsutil.exe tool. In regard to FRS, these terms refer to a restore of the SYSVOL tree only.

Authoritative Restore

Authoritative restore uses a “big hammer” approach to getting SYSVOL on all DCs in sync with a single source. Although Microsoft now says it was never intended to be a “silver bullet” solution to FRS issues, it was used extensively during the days when antivirus products were causing huge numbers of files to be dumped into the staging areas. Although this created other problems, it was the best we could do at the time to recover. Today, there probably aren't a lot of valid reasons to do an authoritative restore. In fact, Microsoft says that authoritative restore is used too much as a quick fix rather than finding the root cause, and is an excellent way to bring down a domain. Authoritative restore also wipes out all FRS data—Group Policy templates, associated .ini files, scripts, and anything you have placed in the SYSVOL directory tree.

Authoritative restore usage assumes that all DCs in the domain hold corrupt copies of the SYSVOL tree and that the NTFRS database is corrupt. This needs to be investigated and resolved to prevent this situation from reoccurring. This condition is a rarity and the Administrator should only use authoritative restore as a last resort.

So, now that you are properly frightened about using this, you can refer to Microsoft KB article 315457, “How to Rebuild SYSVOL and Its Content in a Domain,” for details on how to do it. The KB article basically says to 1) back up SYSVOL; 2) turn FRS off on all DCs; 3) set the “burflags” Registry value; 4) pick a source DC, delete SYSVOL, and copy the backed-up version to this DC; and 5) turn SYSVOL on the source and then one DC at a time until they are all in sync.

The confusing thing to me was what was meant by “SYSVOL.” According to the author of the KB artilce, it means the data in the SYSVOL tree. You must leave the SYSVOL structure in place and just replace the data. Furthermore, the “data” in SYSVOL really boils down to the contents of %systemroot%\sysvol\sysvol\<domainname>\policies and %systemroot%\sysvol\sysvol\<domainname>\scripts, unless you are one of those mavericks who create your own directories and use FRS to replicate your own data, in which case you'd have to include them. Note that this data will populate into the %systemroot%\sysvol\policies directory via the junction point, so no need to replace it in both places. If you delete the SYSVOL tree, you'll also delete the junction point, which will replicate and delete all the SYSVOL data on all DCs; this is what the authoritative restore does anyway, but then you have to re-create it, so just don't do it.

Nonauthoritative Restore

Nonauthoritative restore is the “little hammer” approach. Unlike the authoritative restore, which syncs all DCs to a common source, nonauthoritative restore syncs one out-of-date DC with an up-to-date source. Thus, only one source and one satellite are involved. This is less intrusive than the authoritative restore because it can mess up only two DCs rather than all of them.

In this case, the FRS is stopped on the target (out-of-date) DC, the Burflags value is set to D2 on the target and the source, and then FRS is started on the target.

Unlike authoritative restore, there are good reasons for this. When a DC gets out of sync and can't catch up—when a serious FRS error occurs such as a journal wrap error which can disable a DC—action must be taken to get the data in sync between the broken DC and a DC with a good copy of the FRS (SYSVOL) data. Windows 2000 behavior is to automatically perform a nonauthoritative restore on a DC that experiences an error such as journal wrap by having the out-of-date DC contact a partner and pull the SYSVOL tree. However, because this is intrusive and disables the DC for a period of time, Windows 2000 SP3 and Windows 2003 do not do this automatically. Rather, they log event 13568 and let the Administrator do it at his or her leisure, presumably during off hours. Refer to Microsoft KB article 307319.

Related -----------------

- Windows Server 2003 on HP ProLiant Servers : File Replication Service Design and Implementation (part 2) - Diagnostics and Troubleshooting Methods and Tools

Other -----------------

- Understanding Network Services and Active Directory Domain Controller Placement for Exchange Server 2007 : Understanding AD Functionality Modes and Their Relationship to Exchange Groups

- Understanding Network Services and Active Directory Domain Controller Placement for Exchange Server 2007 : Exploring DSAccess, DSProxy, and the Categorizer

- Understanding Network Services and Active Directory Domain Controller Placement for Exchange Server 2007 : Defining the Global Catalog

- Integrating BizTalk Server 2010 and Microsoft Dynamics CRM : Communicating from BizTalk Server to Dynamics CRM (part 6)

- Integrating BizTalk Server 2010 and Microsoft Dynamics CRM : Communicating from BizTalk Server to Dynamics CRM (part 5) - Generating a proxy service

- Integrating BizTalk Server 2010 and Microsoft Dynamics CRM : Communicating from BizTalk Server to Dynamics CRM (part 4) - Configuring the BizTalk endpoints

- Integrating BizTalk Server 2010 and Microsoft Dynamics CRM : Communicating from BizTalk Server to Dynamics CRM (part 3) - Building the BizTalk components

- Integrating BizTalk Server 2010 and Microsoft Dynamics CRM : Communicating from BizTalk Server to Dynamics CRM (part 2) - Configuring the BizTalk endpoints

- Integrating BizTalk Server 2010 and Microsoft Dynamics CRM : Communicating from BizTalk Server to Dynamics CRM (part 1) - Building the BizTalk components

- Extending Dynamics AX 2009 (part 3) - Creating Labels, Adding Content to the Wizard