Use Powershell to replace characters within a specific string

I'm using a Powershell script to automate the replacement of some troublesome characters from an xml file such as & ' -

The script I have works well for these characters, but I also want to remove the double quote character " but only if it is used within an xml attribute (which unfortunately is enclosed by double quotes) so I obviously cannot remove all double quotes from the xml file as this will stop the attributes from working as they should.

My Powershell script is below:

(Get-Content C:\test\communication.xml) | 
Foreach-Object {$_ -replace "&", "+" -replace "", "GBP" -replace "'", "" -replace "", " "} |
Set-Content C:\test\communication.xml

What I'd like to be able to do is to remove ONLY the double quotes that make up part the XML attributes that are themselves enclosed by a pair of double quotes as below. I know that Powershell looks at each line as a separate object so suspect this should be quite easy, possibly by using conditions?

An example XML file is below:

<?xml version="1.0" encoding="UTF-8"?>
<Portal> 
<communication updates="Text data with no double quotes in the attribute" />
<communication updates="Text data that "includes" double quotes within the double quotes for the attribute" />
</Portal>

In the above example I'd like to remove only the double quotes that immediately surround the word includes BUT not the double quotes that are to the left of the word Text or to the right of the word attribute. The words used for the XML attributes will change on a regular basis but the left double quote will always be to the immediate right of the s= part of the string and the right double quote will always be to the left of a space forward slash combination /

Thanks


July 1st, 2013 5:04pm

Could it help to count the amount of ", and if there are more than 2, remove all except first and last?
Free Windows Admin Tool Kit Click here and download it now
July 1st, 2013 5:07pm

Thanks Martin, yes this sounds like it would work but the first line of the XML document which defines the document type cannot be amended and also has more than 2 " so this would need to be discounted from the rule.

Do you have example code please for the counting suggestion?

July 1st, 2013 5:13pm

It is impossible to have quotes within an attribute value.  It is not legal XML.

There is no reasonable way to do what you are asking.

Free Windows Admin Tool Kit Click here and download it now
July 1st, 2013 5:33pm

There are RegEx patterns that can match this sort of thing.  They will take some amount of skill to design.

Characters like & must be escaped.  &amp; is use for &.

&quot;
&apos;

The XML parser will manage these character entity references:
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

July 1st, 2013 5:39pm

Hi JRV, I think you may have misunderstood my original question...

I'm aware that double quotes inside XML attributes are illegal characters which is why I want to remove them.

I have users who are manually editing XML attributes and although I've told them not to use double quotes they're human after all and unfortunately make mistakes, so what I want is a script that I can run using task scheduler that regularly checks these XML files and removes any illegal characters, thus making the XML files compliant before they need to be used in an HTML page and read using Javascript.

Anyway, your more constructive answer below was actually taken a bit further by a very helpful soul on StackOverflow and just in case you're interested their code (which works perfectly) is below.

(Get-Content C:\test\communication.xml) |
Foreach-Object {$_ -replace "&", "+" `
    -replace "", "GBP" `
    -replace "'", "" `
    -replace "", " " `
    -replace "(?<!\?xml.*)(?<=`".*?)`"(?=.*?`")", ""} |
Set-Content C:\test\communication.xml

Free Windows Admin Tool Kit Click here and download it now
July 2nd, 2013 4:53am

A positive look-ahead will do this under very restrictive circumstances.  If humans are making the mistake the code will fail in many cases.

I would ask why anyone is manually editing XML when there are many tools for editing it corr

July 2nd, 2013 7:24am

There are some ways to do this, but they are all going to be complex.

Ever think of shooting the first user that makes a mistake as a lesson for all?  <joking - in case of doubt>

But seriously, there are much better ways of editing XML that would largely avoid this issue - why not use them?

Sorry if this is not the answer you wanted.

Free Windows Admin Tool Kit Click here and download it now
July 2nd, 2013 9:50am

That is a solution I hadn't though of.  I like it.  I'll try it next time and let you know.

Any idea what I can do with all of those excess bodies?

July 2nd, 2013 10:01am

Sadly, the shooting, even if beneficial to the global gene pool, is probably a violation of human rights and thus the ECHR would probably be upset. An alternative might be a month at McDonald's Hanburger Univ as preparation for their next career.

Seriously, one way might be to utilise PowerShell's error handling.

The above XML fragment, when loaded yeilds an error, that pinpoints the error. WIth some grunt-like coding you could try loading the document and if there's an error, parsing the error record and working out where the error starts, then go from there.

I can kind of see how you can do it, but I don't have the couple of hours I'd need to work it all out properly.

Free Windows Admin Tool Kit Click here and download it now
July 2nd, 2013 10:13am

I have looked for a solutiuo0n like this. Unfortunately the Microsoft tool does not let us load then parse the text.  In other tools this is possible.  We can get a list of violations in the order they occur and then fix this. 

July 2nd, 2013 10:24am

When I try to load the bad xml above, I get this error:

Psh[Cookham8:C:\foo]>$a = [xml] (Get-Content .\foo.xml)
Cannot convert value "System.Object[]" to type "System.Xml.XmlDocument". Error: "'includes' is an unexpected token. Expecting white space. Line 4, position 41."
At line:1 char:1
You could wrap it something like this

$loadOK = $False
While (! $loadok) {
  try {
  $xml = [xml] (get-content dodgy.xml -ea stop)
  $LoadOk = $true
  }
  Catch {
    ... parse the error message to get out the line
    number where the error occurs, then do some 
    grunt parsing of raw text, saving it all again
    before another loop).
}

Free Windows Admin Tool Kit Click here and download it now
July 2nd, 2013 10:32am

Thomas - That would maybe speed things up but it does not get us an exact item to fix.  It also does not tell us what the exact error is correctly.  Good parsers will isolate the exact text and error as an object.

July 2nd, 2013 10:43am

LIke I said, it's not perfect, but grepping the error message you can work out which line the error is in and at what char, you could then work out which line the error is in and go from there. If the relevant column number holds a ", then look right till you get the next " and delete both. VERY workmanlike but could work.

I prefer the option of use education.

Free Windows Admin Tool Kit Click here and download it now
July 2nd, 2013 10:46am

Thanks very much for your suggestions guys (I particularly like the shooting one!) but as I mentioned above, this issue is now fixed.

This is the response from a helpful user on StackOverflow and an explanation of how it works. I have implemented the code and created scheduled tasks to run every 15 minutes to delete the rogue characters from my XML files and it is all working perfectly.

------------------

Try this regex:

"(?<!\?xml.*)(?<=`".*?)`"(?=.*?`")"

In your code it would be:

(Get-Content C:\test\communication.xml) | 
Foreach-Object {$_ -replace "&", "+" `
    -replace "", "GBP" `
    -replace "'", "" `
    -replace "", " " `
    -replace "(?<!\?xml.*)(?<=`".*?)`"(?=.*?`")", ""} |
Set-Content C:\test\communication.xml

This will take any " that has a " in-front of and behind it (except a line that has ?xml in it) and replace it with nothing.

Breakdown of regex;

(?<!\?xml.*)(?<=`".*?)`"(?=.*?`")

1. (?<!\?xml.*)----> Excludes any line that has "?xml" before the first quote
2. (?<=`".*?)------> Lookbehind searching for a quotation mark.  
       The ` is to escape the quotation mark, which is needed for powershell
3. `"--------------> The actual quotation mark you are searching for
4. (?=.*?`")-------> Lookahead searching for a quotation mark

July 2nd, 2013 4:39pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics