• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Remove XML tags from field with Excel Power Query

Prajac

New Member
I am new to power query and I've never come across XML files before. I have an xml data source which when opened in excel had lots of XML tags in it that need cleansing. I found a VBA function which I adapted to give me my desired cleansed data. Its a bit messy (as is the data) but it works. What is the most efficient way to do this in power query? I tried List.RemoveItems and the standard find and replace. I got stuck at trying to find equivalent of "<[^>]+>". Thank you in advance. The following is the VBA that I adapted

>>> use code - tags <<<
Code:
Function StripHTML(cell As Range) As String Dim RegEx As Object Set RegEx = CreateObject("vbscript.regexp") Dim sInput As String Dim sOut As String sInput = cell.Text
sInput = Replace(sInput, "\x0D\x0A", Chr(10))
sInput = Replace(sInput, "\x00", Chr(10))

'replace HTML breaks and end of paragraphs with line breaks
sInput = Replace(sInput, "</p>" & Chr(10) & "<p>&nbsp;</p>", Chr(10) & Chr(10))
sInput = Replace(sInput, "<BR>", Chr(10))
sInput = Replace(sInput, "<br />&chr(10)&<br />", Chr(10))
sInput = Replace(sInput, "<br /><br />", Chr(10))
sInput = Replace(sInput, "<br /> <br />", Chr(10))
sInput = Replace(sInput, "<br />&nbsp;<br />", Chr(10))
sInput = Replace(sInput, "<br />" & Chr(10) & "&nbsp;<br />", Chr(10))
sInput = Replace(sInput, "<br /> <br /> ", Chr(10))
sInput = Replace(sInput, Chr(10) & " " & Chr(10), Chr(10))
sInput = Replace(sInput, "<br />", Chr(10))
sInput = Replace(sInput, "<ol>" & Chr(10) & "<li>", "")
sInput = Replace(sInput, "</ol>" & Chr(10) & "</li>" & Chr(10) & "<li>", "")
sInput = Replace(sInput, "</ol>" & Chr(10) & "</li>" & Chr(10) & "</ol>" & Chr(10) & "</li>" & Chr(10) & "<li>", "")
sInput = Replace(sInput, "<ul>" & Chr(10) & "<li>", "")
sInput = Replace(sInput, Chr(10) & Chr(10) & Chr(10) & Chr(10) & Chr(10) & Chr(10), Chr(10))
sInput = Replace(sInput, Chr(10) & Chr(10) & Chr(10) & Chr(10) & Chr(10), Chr(10))
sInput = Replace(sInput, Chr(10) & Chr(10) & Chr(10) & Chr(10), Chr(10))
sInput = Replace(sInput, Chr(10) & Chr(10) & Chr(10), Chr(10))
sInput = Replace(sInput, Chr(10) & Chr(10), Chr(10))
sInput = Replace(sInput, Chr(10) & " ", Chr(10))



'add back all of the special characters
sInput = Replace(sInput, "&ndash;", "–")
sInput = Replace(sInput, "&mdash;", "—")
sInput = Replace(sInput, "&iexcl;", "¡")
sInput = Replace(sInput, "&iquest;", "¿")
sInput = Replace(sInput, "&quot;", "")
sInput = Replace(sInput, "&ldquo;", "")
sInput = Replace(sInput, "&rdquo;", "")
sInput = Replace(sInput, "", "'")
sInput = Replace(sInput, "&lsquo;", "'")
sInput = Replace(sInput, "&rsquo;", "’")
sInput = Replace(sInput, "&laquo;", "«")
sInput = Replace(sInput, "&raquo;", "»")
sInput = Replace(sInput, "&nbsp;", " ")
sInput = Replace(sInput, "&amp;", "&")
sInput = Replace(sInput, "&cent;", "¢")
sInput = Replace(sInput, "&copy;", "©")
sInput = Replace(sInput, "&divide;", "÷")
sInput = Replace(sInput, "&gt;", ">")
sInput = Replace(sInput, "&lt;", "<")
sInput = Replace(sInput, "&micro;", "µ")
sInput = Replace(sInput, "&middot;", "·")
sInput = Replace(sInput, "&para;", "¶")
sInput = Replace(sInput, "&plusmn;", "±")
sInput = Replace(sInput, "&euro;", "€")
sInput = Replace(sInput, "&pound;", "£")
sInput = Replace(sInput, "&reg;", "®")
sInput = Replace(sInput, "&sect;", "§")
sInput = Replace(sInput, "&trade;", "™")
sInput = Replace(sInput, "&yen;", "¥")
sInput = Replace(sInput, "&aacute;", "á")
sInput = Replace(sInput, "&Aacute;", "Á")
sInput = Replace(sInput, "&agrave;", "à")
sInput = Replace(sInput, "&Agrave;", "À")
sInput = Replace(sInput, "&acirc;", "â")
sInput = Replace(sInput, "&Acirc;", "Â")
sInput = Replace(sInput, "&aring;", "å")
sInput = Replace(sInput, "&Aring;", "Å")
sInput = Replace(sInput, "&atilde;", "ã")
sInput = Replace(sInput, "&Atilde;", "Ã")
sInput = Replace(sInput, "&auml;", "ä")
sInput = Replace(sInput, "&Auml;", "Ä")
sInput = Replace(sInput, "&aelig;", "æ")
sInput = Replace(sInput, "&AElig;", "Æ")
sInput = Replace(sInput, "&ccedil;", "ç")
sInput = Replace(sInput, "&Ccedil;", "Ç")
sInput = Replace(sInput, "&eacute;", "é")
sInput = Replace(sInput, "&Eacute;", "É")
sInput = Replace(sInput, "&egrave;", "è")
sInput = Replace(sInput, "&Egrave;", "È")
sInput = Replace(sInput, "&ecirc;", "ê")
sInput = Replace(sInput, "&Ecirc;", "Ê")
sInput = Replace(sInput, "&euml;", "ë")
sInput = Replace(sInput, "&Euml;", "Ë")
sInput = Replace(sInput, "&iacute;", "í")
sInput = Replace(sInput, "&Iacute;", "Í")
sInput = Replace(sInput, "&igrave;", "ì")
sInput = Replace(sInput, "&Igrave;", "Ì")
sInput = Replace(sInput, "&icirc;", "î")
sInput = Replace(sInput, "&Icirc;", "Î")
sInput = Replace(sInput, "&iuml;", "ï")
sInput = Replace(sInput, "&Iuml;", "Ï")
sInput = Replace(sInput, "&ntilde;", "ñ")
sInput = Replace(sInput, "&Ntilde;", "Ñ")
sInput = Replace(sInput, "&oacute;", "ó")
sInput = Replace(sInput, "&Oacute;", "Ó")
sInput = Replace(sInput, "&ograve;", "ò")
sInput = Replace(sInput, "&Ograve;", "Ò")
sInput = Replace(sInput, "&ocirc;", "ô")
sInput = Replace(sInput, "&Ocirc;", "Ô")
sInput = Replace(sInput, "&oslash;", "ø")
sInput = Replace(sInput, "&Oslash;", "Ø")
sInput = Replace(sInput, "&otilde;", "õ")
sInput = Replace(sInput, "&Otilde;", "Õ")
sInput = Replace(sInput, "&ouml;", "ö")
sInput = Replace(sInput, "&Ouml;", "Ö")
sInput = Replace(sInput, "&szlig;", "ß")
sInput = Replace(sInput, "&uacute;", "ú")
sInput = Replace(sInput, "&Uacute;", "Ú")
sInput = Replace(sInput, "&ugrave;", "ù")
sInput = Replace(sInput, "&Ugrave;", "Ù")
sInput = Replace(sInput, "&ucirc;", "û")
sInput = Replace(sInput, "&Ucirc;", "Û")
sInput = Replace(sInput, "&uuml;", "ü")
sInput = Replace(sInput, "&Uuml;", "Ü")
sInput = Replace(sInput, "&yuml;", "ÿ")
sInput = Replace(sInput, "", "´")
sInput = Replace(sInput, "", "`")
sInput = Replace(sInput, "&#39;", "'")
sInput = Replace(sInput, "&ordm;", "º")
sInput = Replace(sInput, "&deg;", "º")
sInput = Replace(sInput, "&bull;", Chr(10) & "-")
sInput = Replace(sInput, "&sup2;", "²")
sInput = Replace(sInput, "' && task.HTMLContent != '", "")
sInput = Replace(sInput, "' && task.HTMLNotes != '", "")
sInput = Replace(sInput, "' -->", "")
sInput = Replace(sInput, "&le;", "<=")
sInput = Replace(sInput, "&frac12;", "½")
sInput = Replace(sInput, "&frac14;", "¼")
sInput = Replace(sInput, "&frac34;", "¾")
sInput = Replace(sInput, "&mu;", ChrW(181))
sInput = Replace(sInput, "&Omega;", ChrW(937))

'replace all the remaining HTML Tags
With RegEx
.Global = True
.IgnoreCase = True
.MultiLine = True
.Pattern = "<[^>]+>" 'Regular Expression for HTML Tags.

End With
sInput = Replace(sInput, Chr(10) & Chr(10), Chr(10))

sOut = RegEx.Replace(sInput, "")
StripHTML = sOut
Set RegEx = Nothing
End Function
 
Last edited by a moderator:
Then I have doubts it is a real XML file, maybe it is text with XML code.
For your reference, I'm uploading an example on a sample XML (https://www.w3schools.com/xml/cd_catalog.xml)

Code:
let
    Source = Xml.Tables(File.Contents("G:\Uploads\SampleXML.xml")),
    Table0 = Source{0}[Table],
    #"Changed Type" = Table.TransformColumnTypes(Table0,{{"TITLE", type text}, {"ARTIST", type text}, {"COUNTRY", type text}, {"COMPANY", type text}, {"PRICE", Int64.Type}, {"YEAR", Int64.Type}})
in
    #"Changed Type"
 

Attachments

  • Chandoo_XML_Sample_PQ.xlsx
    17.8 KB · Views: 5
I have to download a file off a website which has a .xml extension. Here is a sample of the data. When you open the file in excel it puts the column names in ScheduleId, ScheduleTitle etc but the data in some of the columns do not convert such as '&lt;br /&gt;&lt;br /&gt;' The following is an extract from the xml file.

As You've noted >>> use code - tags <<<
Code:
<Schedule>
      <ScheduleId>308</ScheduleId>
      <ScheduleTitle>Introductory Procedures - Actuators</ScheduleTitle>
      <ScheduleReference>01-01</ScheduleReference>
      <ScheduleType>Core</ScheduleType>
      <ScheduleGroups>
        <ScheduleGroup>Actuators</ScheduleGroup>
        <ScheduleGroup>Building Access and Controls</ScheduleGroup>
        <ScheduleGroup>SUP List - Actuators</ScheduleGroup>
      </ScheduleGroups>
      <ScheduleDate>20 Oct 2015</ScheduleDate>
      <ScheduleVersion>1</ScheduleVersion>
      <UnitOfMeasure>Not specified</UnitOfMeasure>
      <AnnualServiceTiming>0</AnnualServiceTiming>
      <Introductions>
        <Introduction>
          <Content>The following procedures apply to all types of actuator and require that before any maintenance is undertaken all safety precautions must be strictly adhered to. Care must be taken to ensure that any maintenance does not unduly interfere with the system&amp;#39;s environmental conditions.&lt;br /&gt;&lt;br /&gt;Consideration should be given to the discharge of secondary energy i.e. steam, pressurised water, hot oil, hot surfaces, spring pressures etc.&lt;br /&gt;&lt;br /&gt;When working on equipment it should be noted that any auxiliaries fitted may have supplies fed from a separate circuit to that of the actuator. Therefore should the actuator supply be disconnected it should not be taken for granted that the auxiliary supply has also been disconnected.&lt;br /&gt;&lt;br /&gt;To check that the actuator is performing its correct task, it is immaterial as to whether it is mounted on a damper, louvre, valve or any other piece of equipment. Its purpose is to control to a position between two limits, by either opening, closing or moving appropriate pieces of equipment.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Other Actuator Types&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;The function of an actuator can be split into two main categories: ON/OFF or MODULATING. Falling within these two categories are the&amp;nbsp;types listed in the SFG20&amp;nbsp;menu for Actuators.&lt;br /&gt;&lt;br /&gt;Although the majority of actuator types are covered in the menu, new developments may mean that the engineer is not fully aware of the operating characteristics of the device under scrutiny. If any doubt arises it would be preferred if the manufacturer were questioned prior to any checks or adjustments being carried out.&lt;br /&gt;&lt;br /&gt;An example of the above is the VAV actuator sensor controller. This is a type of actuator that has the controller built into the actuator housing. The sensor is connected via a length of cable and will commonly be sensing air velocity or pressure. The controller will have been set to the required VAV box demand and these settings should not be altered without prior consultation with the client, the VAV box manufacturer and the actuator manufacturer. In this example, cross-referencing should be made between the Actuators, Controllers and Sensors sections (SFG 01, SFG 14 and SFG 50).</Content>
          <Content>Please refer to the overarching introduction (SFG 00-01) to make sure you are of the correct skill level as indicated within the task schedule to carry out the described works.  Ensure you have read and understood the manufacturer’s recommendations, carried out risk assessment(s) on each item of plant to identify the correct frequency of maintenance, identified all safety procedures that need to be applied and recorded in order to carry out the work in a safe and reliable manner.</Content>
          <Notes>&lt;p&gt;&lt;p&gt;</Notes>
        </Introduction>
      </Introductions>
      <Tasks />
      <ServiceTimings />
      <Legislations />
    </Schedule>
 
Last edited by a moderator:
Do not open the file in excel, but apply the procedure as I have explained above in #2.
Load the data via PQ from file from XML.
 
Hi
I can't seem to get this to work. There are a few steps I have to take to expand the columns. Also some of the expanded columns do not fully expand. It says "Table".


Is there any way I can send sample file. Its too big to upload
 
This the query editor text

as written >>> use code - tags <<<
Code:
let
    Source = Xml.Tables(File.Contents("C:XML Download 190220.xml")),
    Schedules = Source{0}[Schedules],
    Schedule = Schedules{0}[Schedule],
    #"Changed Type" = Table.TransformColumnTypes(Schedule,{{"ScheduleId", Int64.Type}, {"ScheduleTitle", type text}, {"ScheduleReference", type text}, {"ScheduleType", type text}, {"ScheduleDate", type date}, {"ScheduleVersion", Int64.Type}, {"UnitOfMeasure", type text}, {"AnnualServiceTiming", Int64.Type}}),
    #"Expanded ScheduleGroups" = Table.ExpandTableColumn(#"Changed Type", "ScheduleGroups", {"ScheduleGroup"}, {"ScheduleGroups.ScheduleGroup"}),
    #"Expanded Tasks" = Table.ExpandTableColumn(#"Expanded ScheduleGroups", "Tasks", {"Element:Text"}, {"Tasks.Element:Text"}),
    #"Expanded ServiceTimings" = Table.ExpandTableColumn(#"Expanded Tasks", "ServiceTimings", {"Element:Text"}, {"ServiceTimings.Element:Text"}),
    #"Expanded Introductions" = Table.ExpandTableColumn(#"Expanded ServiceTimings", "Introductions", {"Introduction"}, {"Introductions.Introduction"}),
    #"Expanded Introductions.Introduction" = Table.ExpandTableColumn(#"Expanded Introductions", "Introductions.Introduction", {"Content", "Notes"}, {"Introductions.Introduction.Content", "Introductions.Introduction.Notes"}),
    #"Removed Columns2" = Table.RemoveColumns(#"Expanded Introductions.Introduction",{"Legislations"}),
    #"Expanded ScheduleGroups.ScheduleGroup" = Table.ExpandTableColumn(#"Removed Columns2", "ScheduleGroups.ScheduleGroup", {"Element:Text"}, {"ScheduleGroups.ScheduleGroup.Element:Text"})
in
    #"Expanded ScheduleGroups.ScheduleGroup"
 
Last edited by a moderator:
XML is notoriously annoying to create universal parser, and when it's done, it has large foot print.

Your main issue here, is that whomever created the site and process for generating this xml didn't follow best practice and left HTML encoded string as is without decoding (among other things). As well, there are tags used within element of xml. Such as "$#39;" etc. EDIT: Meant "<p>" here.

In my opinion, these should not be in xml but should reside in separate layer. But I digress.

In your case, there is no sure fire way to decode every HTML encode string, most will be handled automatically. Like "&lt;p&gt;" decoded as "<p>".
However, other's like "$#39;" will not be decoded as apostrophe ('). These must be replaced out, either by column -> replace values, or by using separate table of encoding translation.

EDIT: Note that the reason why "$#39;" isn't decoded automatically... is that apostrophe IS legal character within XML string. Hence, there was no need to have this HTML encoded in the first place. So, it's not in the list of automatic decoding and left as is.

Also, looking at your M code. Structure is different from sample provided in your previous post. There are elements which isn't present in your sample.

As XML is very concrete and when best practices are not followed, will introduce strong coupling between the actual representation and processing, (this is one such case...) Without the appropriate sample that actually represent your XML structure, bit difficult to help.
 
Last edited:
Back
Top