How to fix it long term: if you have less than 150k clients, plan on a future migration from your "why did you have a CAS and a hierarchy to being with", to a Standalone primary site--so you don't have to deal with replication anymore.
Short term: Theoretically the first thing to try is in your console, Monitoring, Db replication, right-click on link which is "Link Failed." and run through "Replication Link Analyzer". It will likely find things, and try
to fix them--usually through a reinit. All you can do is try it, and hope.
Another thing to "look" at. It's nothing but looking--you can't affect anything by doing this. In SQL Management studio, connect to the two servers (both of them) on both sides of the link that is failed.
on each CM_xxx database, run Exec spdiagdrs
In the 4th or 5th pane from those results, you'll see *exactly* whith replication group is failed, and 'lastsynctime' for that group (on that server). "In general", if the last synctime is within the last 30-60 minutes, for us, that's fine--EVEN
IF it says degraded (or even failed). the difference between degraded and failed is actually a setting. I forget the default; but if you have a huge obnoxious site with bad links; your synctimes might go longer than the defaults and even if replication
is actually working--just delayed--it might trigger a degraded or failed status.
The other thing I look at to see "is there an issue", is in the results of exec spdiagdrs, I look at the 3rd pane, "IncomingMessagesInQueue", and/or "OutgoingMessagesInQueue". On a busy site, there will always be a count
in there--it's rarely 0. But if it's 100,000 or more AND it's not just been a hotfix release (in which case that might be normal), and it just keeps growing and NOT shrinking (just re-run exec spdiagdrs to watch that count)--there might be some process
that, in SQL, is blocking messages being processed. If you aren't an awesome guru at diagnosing SQL--the easiest thing would be to simply reboot your servers. If whatever-it-was is some chronic issue that will just re-occur 15 minutes after reboot
it won't help--but it also just might kick loose whatever was blocking replication. But if it doesn't help, and you have to open a call w/Microsoft--at least when their first suggestion is "have you tried rebooting"--you can say yes.