DNS primary secondary not switching

Hi all,

As strange that it may sound, I always had this problem (on W2003, 2008, 2008-R2, 2012, 2012-R2)...

I have 2 DCs wich are also DNS server with AD integrated zone(s). DNS clients use one as primary and the other as secondary. This configuration is mixed in the sense that some DNS clients use the DC1 as primary and DC2 DNS as secondary, others the opposite config DC2 then DC1.

My problem: if from the DNS clients perspective, the primary DNS is down (that sometimes happens), the client will NOT use the secondary. What ever config it might be (DNS1->DNS2 or DNS2->DNS1).

If I then check the second DNS server using NSlookup, it works ok.

I know (from theory) that is the primary DNS is not answering at all, the client will use the secondary DNS server if one is configured in the clients IP config. This newer worked.

Thank you for your

October 20th, 2013 2:37pm

Hi,

 This cases you have to verify Prefix DNS and wins server. I hope in your environment WINS server should be in place. So when ever you configure Primary DNS & WINS on NIC settings have to configure and Forward need to verify.

Free Windows Admin Tool Kit Click here and download it now
October 20th, 2013 2:52pm

Hello,

the question is here, do this happen AFTER logon to the domain and then the preferred DNS server is not available or during startup of the machine?

If this happens after logon to the domain, what you see is CORRECT. If an available DNS server is used all other configured DNS servers are IGNORED until you reboot the client.

There is NO automatic failover between configured DNS servers on machines NICs.

October 20th, 2013 8:07pm

Thanks SePy. I have some trouble following your writing. There are several words missing or not in proper order...

DNS prefixes are ok and there is no problem with WINS NetBios name resolution but with DNS failower.

BTW, what is your work at Microsoft (MGSI)?

Thanks.

Free Windows Admin Tool Kit Click here and download it now
October 21st, 2013 12:52pm

The client side service doesn't toggle back and forth between DNS entries. That's not how it works. And this applies to all operating systems: Windows, Linux, BEOS, Unix (including Macs), etc, etc etc... It's an industry standard defining how the client side resolver service works.

To summarize, if a DNS query has already occurred and the client had already received a response, then it is cached for the TTL on the record (you can run "ipconfig /displaydns" to show what's in cache and the remaining TTL of the record and repeat the command to see the TTL count itself down). If there was no prior query and it's not cached or the TTL has expired, and if there are multiple DNS entries on a machine's NIC (whether a DC, member server or client), it will ask the first entry first. If it receives a response, but say if the DNS server does not have the zone data (such as if you were to use your ISP's DNS or your router as a DNS address, and expect that to work with AD), then it will be an NXDOMAIN response, meaning it got a response, even though it was wrong, and it will not go to the next DNS entry in the NIC's list.

If it doesn't respond, which is evident of a NULL response (no response, such as if the DNS server is down), it will go to the second entry after a time out period, which can last 15 seconds or more as it keeps trying the first one, at which then it REMOVES the first entry from the eligible resolvers list, and won't go back to it for another 15 minutes (or forcing it by restarting the DNS Client service). When a DC/DNS is down, or taken offline purposely for some reason, such as performing DC maintenance during production hours, it may cause issues within AD when accessing a resource such as a printer, folder, getting GPOs to function, etc.

-

WINS NetBIOS, Browser Service, Disabling NetBIOS, & Direct Hosted SMB (DirectSMB). Troubleshooting the browser service.
Client side resolution process chart.
The DNS Client Side Resolver algorithm.
If one DC or DNS goes down, does a client logon to another DC or use the other DNS server in the NIC?
DNS Forwarders Algorithm and multiple DNS addresses (if you've configured more than one forwarders or more than one IP in the NIC's DNS list)
Client side resolution process chart
Published by Ace Fekay, MCT, MVP DS on Nov 29, 2009 at 10:28 PM  1764  1
http://msmvps.com/blogs/acefekay/archive/2009/11/29/dns-wins-netbios-amp-the-client-side-resolver-browser-service-disabling-netbios-direct-hosted-smb-directsmb-if-one-dc-is-down-does-a-client-logon-to-another-dc-and-dns-forwarders-algorithm.aspx

DNS Clients and Timeouts (Part 1 & Part 2), karammasri [MSFT] Dec 2011 6:18 AM
http://blogs.technet.com/b/stdqry/archive/2011/12/02/dns-clients-and-timeouts-part-1.aspx
http://blogs.technet.com/b/stdqry/archive/2011/12/15/dns-clients-and-timeouts-part-2.aspx

-

October 21st, 2013 5:57pm

Hi,

I would like to check if you need further assistance.

Thanks.

Free Windows Admin Tool Kit Click here and download it now
October 24th, 2013 3:58am

Hi Alex.

Thanks for asking. Is seems that the problem is more the second DC/DNS then DNS. I'm about to dcdiag it. HAve found syvol and other replication problems.

Am bussy at this time but will check in soon.

Thanks.

October 24th, 2013 10:25am

Can this be crossreferenced with another forum (DFS/DC) ?

DNS looks to work ok and do failower as expected. It is my second DC that does not accept client logons and leads to everything not working ok (incl. DNS) as long as the first DC does not come online.

DcDiag on that decond DC (2'nd out of 2) showed some errors and I started to examine them. I found that there is a severe DFS-R problem syncing the sysvol folder.

The DFS-Event-log says:

The DFS Replication service stopped replication on the folder with the following local path: C:\Windows\SYSVOL_DFSR\domain. This server has been disconnected from other partners for 287 days, which is longer than the time allowed by the MaxOfflineTimeInDays parameter (60). DFS Replication considers the data in this folder to be stale, and this server will not replicate the folder until this error is corrected.

To resume replication of this folder, use the DFS Management snap-in to remove this server from the replication group, and then add it back to the group. This causes the server to perform an initial synchronization task, which replaces the stale data with fresh data from other members of the replication group.

I tried the "remove this server from the replication group" by doing :

dfsradmin membership delete /rgname:"Domain System Volume" /rfname:"SYSVOL Share" /memname:theBadDC

But this is impossible because the result is:

Failed: DELLDC (SYSVOL Share): The membership cannot be deleted. The operation is not supported on SYSVOL replication groups.

Any ideas how to go on from here?

Thanks.

Free Windows Admin Tool Kit Click here and download it now
October 24th, 2013 5:23pm

It's actually an AD (Directory Services) issue and question, we can keep it here, or get it moved to the DS forum at the moderator's discretion.

What event ID# is that?

Are you also seeing a Jrnl-Wrap error in the event viewer?

Or are you seeing any  Event IDs 13568, 13508, 1388, 1988, 2042, 2023, 2095, 1113, 1115, 2103?

How about Replication error 8614?

-

If you are not seeing any errors with regarding AD not able to replicate beyond the AD tombstone, which would require you to forcibly demote the DC, and re-promote it, you can get away with fixing it by reinitializing just SYSVOL replication using the BURFLAG option, that is if replication as a whole is still working and replicating between your two DCs.

Here's how to do it...

How to Recover a Journal Wrap Error (JRNL_WRAP_ERROR) and a Corrupted SYSVOL from a Good DC What option do I use, D4 or D2? Whats the Difference between D4 and D2?
http://blogs.msmvps.com/acefekay/2013/08/28/how-to-recover-a-journal-wrap-error-jrnl_wrap_error-and-a-corrupted-sysvol-from-a-good-dc-what-option-do-i-use-d4-or-d2-whats-the-difference-between-d4-and-d2/

October 24th, 2013 6:22pm

Is there an antivirus on the DCs that haven't been properly configured with exclusions for DC functions? That is a major cause of issues on DCs.

Also run the following and post back.

dcdiag /v > c:\dcdiag.txt   (from each DC)
ipconfig /all >c:\ipconfig.txt (From each DC) 
repadmin /queue * > c:\repadminQueue.txt         - Shows if anything is in the queue waiting to replicate
repadmin /showrepl > c:\rep-showrepl.txt         - Run on each DC. This helps understand the replication topology and replication failures
nltest /dsgetdc:<domain.local> /force            - Run on each DC. This tests secure channels between DCs
repadmin /showreps > c:\rep-showreps.txt         - Run on each DC. This switch shows if the partitions have replicated or not
repadmin /replsum > c:\rep-replsummary.txt       - Run on each DC. Shows replication summary. You can also use the output to create report.
Event Log Errors:                                - From each DC.

-

Post the info to www.skydrive.com or another sharing site, since this info can get pretty large for a forum post.

Thanks.

-

Also as a report feature and check to make sure ports are opened (which an AV can block), check the following out...

1. Download The Active Directory Replication Status Tool (ADREPLSTATUS) - if you see anything in the report in RED, then we have an issue.
   http://www.microsoft.com/en-us/download/details.aspx?id=30005
     This tool requires .Net Framework 4. If it's not installed, download and install it:
       Microsoft .NET Framework 4 (Web Installer)
       http://www.microsoft.com/en-us/download/details.aspx?id=17851
 
2. Run PortQry GUI choosing the "Domains & Trusts" option between each other (DCs). Run the test from a DC to a DC from both sides to each other, or you can also run it from a client to a DC. Post only errors with "NOTLISTENING," 0x00000001, and 0x00000002. You can ignore UDP 389 and UDP 88 messages. If you see TCP 42 errors, that just means WINS is not running on the target server.
       PortQryUI - GUI - Version 2.0 8/2/2004
       http://www.microsoft.com/download/en/details.aspx?id=24009

Free Windows Admin Tool Kit Click here and download it now
October 24th, 2013 6:39pm


Hi,

I would like to check if you need further assistance.

Thanks.

November 6th, 2013 12:21pm

If you can provide specific event errors, we can provide specific steps.

Since it's a  CA, you can't demote it, which is one of the many reasons we don't recommend installing things on a DC.

My blog has specifics for the different event errrors, but it would be helpful if you can ekbirate on what you're seeing.

Free Windows Admin Tool Kit Click here and download it now
November 6th, 2013 9:33pm

Yes, I know that. But you can migrate it. I already did CA migrations. Not a funny task. In my scenario, the machine density must be kept low and the role density/machine high. So servers have often several roles. The "PDC" DC was supposed to be "stable enough" to host AD-DS,  AD-CA & DNS roles. Not more (in 2013 on WS 2012-R2 this should be considered an acceptable risk). But sometimes it happens... If I cannot resolve the problem, I will migrate the CA to another server before de-promote the DC, but this must be a very ultimate step...

I keep you informed about the testings and will provide data. Time frame=from now to 1 week.

Thanks for your support.

November 7th, 2013 12:29pm

For DCs, we usually recommend the following:

  • Do not make a DC multihomed, unless you are teaming the NICs.
  • The only acceptable service and features to install are:
    - DNS
    - DHCP
    - WINS
  • Do not install anything else on a DC

-

I realize that many companies have budget restraints that forces them to use a DC for multiple purposes, but it does complicate a DC, and a DC's role itself can hinder the other services, because one of the things it does is disabled Write Cache on the controller card, which will slow other things to a crawl. It also makes it so you can't demote a DC with something major on it, such as CA, SQL, Exchange, etc, and it complicates DC recovery procedures. After all, it is your directory service database.

I have more on that in a blog, but I'll spare posting it, as long as you understand the limitations.

Free Windows Admin Tool Kit Click here and download it now
November 7th, 2013 11:57pm

To my FB email address? That won't be possible, since I don't publicly advertise it.

Just post it to Skydrive or some other sharing site, and share the link here. I'll check it, and it will give others a chance to review it in case I missed something.

What is "AC?"

And from what you're saying with SYSVOL not replicating (you shut disabled the DFS service on purpose?), you got quite a bit more issues going on that I didn't realize and may be more than can be fixed with multiple forum posts, and may actually require a qualified engineer that is familiar with AD to take the time required to fix this on site or remotely.

November 13th, 2013 8:24pm

Ace,

There is nothing more then Sysvol not replicating. As you know, this may be with NTFRS or DFS-R. This was migrated as available in W2008 to SYSVOL-DFSR if I remember well.

If this is too much for asking here, then I will say thanks anyway and delete the thread.

Regarding FB, there is the logo in your signature. Clicking it shows your FB page and the "message" button...

I did not sent the files but the link to my Skydrive.

Regards,

Free Windows Admin Tool Kit Click here and download it now
November 13th, 2013 9:08pm

You don't have to delete the thread. I just don't normally respond to direct assistance requests from forum posts. You would be surprised at the number of requests I used to get sent to my personal email. It became overwhelming at one point. The idea of the forums is a collaborative effort to offer assistance, which is why we usually ask to post your data to a sharing site and provide us a link to evaluate the data and hopefully come up with a resolution. Most, if not all, MVPs share the same viewpoint on this, since after all I'm working full time (MVPs are not Microsoft employees).

I hope that makes sense?

As for your FB message, I did not receive it since my FB security settings are locked down.

November 13th, 2013 11:36pm

Ace,

Here are the logs... http://sdrv.ms/1j4xxzI

Regards,

Free Windows Admin Tool Kit Click here and download it now
November 15th, 2013 12:57pm

I put this together in a hurry and may have forgotten things, but this is my general assessment.

-

DC6 - As the dcdiag section below shows, it looks like DC6 has the problematic SYSVOL, which is why the NETLOGON share is not available (it's a subfolder called Scripts under Sysyol\tommy.local\sysvol\scrips)

      Starting test: NetLogons

         * Network Logons Privileges Check
         Unable to connect to the NETLOGON share! (\\DC6\netlogon)

         [DC6] An net use or LsaPolicy operation failed with error 67,

         The network name cannot be found..

-

DNS settings:

This is based on best practices...

As for DC6's DNS settings, on DC6, go into the NIC, IPv6 properties, and set DNS to obtain automatically so the ::1 doesn't show up as a DNS address. Then remove its own IP, (172.17.0.149), since that's redundant with the loopback.

On delldc's DNS settings, change it so 172.17.0.149 is the first, and the loopback is the second.

-

Repl Latency

There are a bunch of latency retired vector invocations showing, which means there have been replication problems between them. One of the main causes of replication issues is antivirus. Uninstall any antivirus on the DCs during the troubleshooting process. Check with the vendor to find out how to configure exculsion on a DC.

-

DellDC -

Dcdiag DFS errors: Did you try the recommendation in the dcdiag to reinitialize it?
"To resume replication of this folder, use the DFS Management snap-in to remove this server from the replication group, and then add it back to the group. This causes the server to perform an initial synchronization task, which replaces the stale data with fresh data from other members of the replication group. "

But if the other DC doesn't have a set to replicate, then that won't work. Was a DFS migration attempted? If so, was it ever successful?

How to attack this depends on the EventID that is associated with it.
See if this discussion helps:

DFS Replication Service stopped on one folder, with Error 9098 (associated Event ID 4004, but there may be others associated with it)
http://social.technet.microsoft.com/Forums/windowsserver/en-US/f8f62854-84b3-4998-9aae-04830fb126fe/dfs-replication-service-stopped-on-one-folder-with-error-9098?forum=winserverfiles

See this article for a hotfix:

The DFS Replication service may stop responding when it initializes the replication process for the replicated folders on a computer that is running Windows Server 2003 R2, Windows Server 2008, or Windows Server 2008 R2
http://support.microsoft.com/kb/977381

-

Time service and Virtualization

Another thing that could have caused the whole thing is the time service. On HyperV, VMware or Xen,, you must disable time sync, otherwise, if time is synced from the host and throws off the time more than 5 minutes, then that can cause authentication problems. Disable time sync on all hosts.

-

Which DC is the PDC? That must be configured with an external time source. Reconfigure it as such:

============
If you've experimented changing time settings to unknowlingly avert default behavior, you can set the time settings back to

default:

1. On the DC that you're experiencing issues with, run the following in a command prompt:
 net stop w32time
 w32tm /unregister
 w32tm /register
 net start w32time
 
2. On the Server in question, run the following in a command prompt:
 "net time /setsntp: " (without the quotes, but I put that in there to signify the blank space prior to the closing quote)
 [This tells the client (whether a DC or workstation) to delete the current registry settings for time and use

default settings.]
 
Restart the time service:
 Net stop w32time && net start w32time
 
3. On the PDC Emulator run the following in a command prompt:
 W32tm /config /manualpeerlist:time.windows.com /syncfromflags:manual /reliable:yes /update
 W32tm /resync /rediscover
 
Restart the time service:
 Net stop w32time && net start w32time
 
4. On each DC that are not holding the PDC Emulator role, run the following in a command prompt:
 w32tm /config /syncfromflags:domhier /update
 W32tm /resync /rediscover

Restart the time service:
 Net stop w32time && net start w32time
 
5. check the new configuration:
 W32tm /query /configuration
 w32tm /query /source
 W32tm /monitor

6. This will also take out any errors in the Event Viewer, if there were any.
============

-

PortQRY

Looks like UDP 138 is blocked. AV causing it?? Not sure.

-

Event log errors? Check both for any AD related errors and post them, please.

-

Summary

Looks like generally speaking, replicaiton is working except for DFS and SYSVOL, which looks to melike a failed DFS migration. This will cause logon issues because GPOs are not accessible from DC6.

Note: With all due respect, due to the complexity of what's going on, something like this may take some time to fix, and it may be prudent to give Microsoft PSS a call to fix this. It's only about $275 for the ticket, and they will take all the time required to fix it.
-

November 15th, 2013 8:47pm

Ace,

Thanks for having looked into this. I knew that this will not be an easy issue. I will live with that 1 DC for the moment. It's on a Hyper-V cluster so it's "highy available"... The time to figure out and fix this is way more then recreate a new domain (a small in my case) and migrate what it needs. Dfs was always hard to troubleshoot (for me) but this one is special because of the sysvol share that is handled special, not as an ordinary dfs-r. There are no AVs on the DC's. Delldc is the PDC and time sync & all is configured properly and works fine. The recommendation in the dcdiag log doesn't apply to syvol dfs-r. It's a misleading error message.

Thanks again.

Free Windows Admin Tool Kit Click here and download it now
November 15th, 2013 9:53pm

If that's misleading and the only issue is sysvol, then that's an easy fix. Try to fix it using my blog I posted, and go from there. 
November 15th, 2013 11:52pm

Finally resolved the Sysvol replication problem. Did a System State backup on the DC/DNS/CA server. Did a System State Restore marking it as authoritative. This reseted various synchronization, including dfsr sysvol.

Everything is now back to normal.

Secondary question: what is the best practice for DNS address if 2 DC are also the 2 DNS and why?

Now:

DC1 points to itself as DNS1 and to DC2 as DNS2

DC2 points to itself as DNS1 and to DC1 as DNS2.

Thanks.

Free Windows Admin Tool Kit Click here and download it now
January 10th, 2014 10:51am

I'm happy to hear our suggestions helped.

For DNS: Point the first one to the partner, then itself or the loopback as the second entry. Do not point to itself as the first entry. Besides, the DNS BPA looks for that now.

January 10th, 2014 11:36am

thanks for the DNS tip.

Regarding "I'm happy to hear our suggestions helped", would you please point me to the post that suggested to do an authoritative restore? I must be overlooked that one and lost a lot of time...

Thanks.

Free Windows Admin Tool Kit Click here and download it now
January 10th, 2014 11:45am

thanks for the DNS tip.

Regarding "I'm happy to hear our suggestions helped", would you please point me to the post that suggested to do an authoritative restore? I must be overlooked that one and lost a lot of time...

T

January 10th, 2014 8:00pm

Thanks. I did not read that and remember why... As it happened also today, the first click on the link (the blog post) got a 404. I did not tried again at that time. I did it today and got to the blog post. Indeed, AD authoritative restore is proposed there.

Thanks. I will mark your first answer as "the answer".

Best regards,

Free Windows Admin Tool Kit Click here and download it now
January 13th, 2014 5:54am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics