• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

How to kick out script tag and it's content out of some html elements?

shahin

Active Member
Is there any way to kick out "script" tag and it's content from some html elements? I am pasting below the html elements containing the script tag and it's content along with different elements. All I wish to do is shake off script tag and it's content out of the document and keep the rest. Thanks in advance for any help?
Code:
Sub Customize_Documents()
  
    With CreateObject("InternetExplorer.Application")
        .Visible = False
        .navigate "C:\Users\ar\Desktop\content.html"
        While .Busy = True Or .readyState < 4: DoEvents: Wend
        Debug.Print .document.body.outerHTML
        .Quit
    End With

End Sub

Content I'm trying with:
Code:
<div id="ea84293f_1215_4896_ba89_78b670296912">

<script>
var strwarn = "Alpha Kappa Alpha Sorority, Incorporated is not responsible for the content on chapter Websites.";
</script>

<h1>Alpha Alpha Omega</h1>
<br>
<h3>Address:</h3>
<script>
var website="http://www.akapittsburghaao.net";
if (website!="" && website.substring(0,4)!="http") website="http://" + website;
if ("P. O. BOX 5044"!="") document.write("P. O. BOX 5044<br/>");
if (""!="") document.write("<br/>");
if (""!="") document.write("<br/>");
document.write("PITTSBURGH, PA 15206<br/>");
if ("UNITED STATES"!="" && "UNITED STATES"!="UNITED STATES") document.write("UNITED STATES<br />");
if (website!="") document.write("<a onclick='alert(\""+strwarn+"\");' href='"+website+"' target=_blank>"+website+"</a>");

if ("Rhonda Taliaferro"!="") document.write("<h3>Primary Contact:</h3>Rhonda Taliaferro<br/>");
if ("rhondalt@verizon.net"!="") document.write("<a href='mailto:rhondalt@verizon.net'>rhondalt@verizon.net</a><br/>");
</script>
P. O. BOX 5044<br>PITTSBURGH, PA 15206<br>
<a onclick='alert("Alpha Kappa Alpha Sorority, Incorporated is not responsible for the content on chapter Websites.");' href="http://www.akapittsburghaao.net" target="_blank">http://www.akapittsburghaao.net</a>
<h3>Primary Contact:</h3>Rhonda Taliaferro<br>
<a href="mailto:rhondalt@verizon.net">rhondalt@verizon.net</a>
<br>
</div>
 
Replace function, combined with something to grab <script>...</script>.
Such as RegEx or combination of Instr, MID, LEN etc.

Or use RegEx group/subgroup pattern matching.
 
Got it, sir. When I find you in touch, several questions come up in my mind in a while. This is not related to this thread. I think somewhere within this forum I have seen that you have used ".responseText" within InternetExplorer. Ain't that right, sir?
 
Nope. Never have. As it isn't possible, since IE.Document does not have responseText as its member.

Only instance where you may have seen .responseText together with IE. Is when I loaded .responseText (from XMLHttp) into IE.Document using write method for demo purpose.

See link.
https://chandoo.org/forum/threads/w...ils-in-website-through-vba.30640/#post-182628

Or if I used IE to scrape java script filled content (dynamic url), and then used xmlhttp to use that content for further scraping. Though can't think of an example of this type of code.
 
Thanks sir. I was in confusion so long. However, I started this thread to kick out script tag and it's content but I think it's not necessary anymore because I can access the content I would like to parse. The only barrier I'm facing is that I can't parse the name located next to "primary contact". I can parse the email, though. How can i get the name as well? Thanks in advance sir. I tried like below.
Code:
Sub Customize_Documents()
    With CreateObject("InternetExplorer.Application")
        .Visible = False
        .navigate "C:\Users\ar\Desktop\content.html"
        While .Busy = True Or .readyState < 4: DoEvents: Wend
        Set post = .document.querySelector("a[href^='mailto']")
        [A1] = post.innerText
        [A2] = post.PreviousSibling.PreviousSibling.PreviousSibling.textContent
        .Quit
    End With
End Sub
It gives the following result:

rhondalt@verizon.net
Primary Contact:

But I expect:
rhondalt@verizon.net
Rhonda Taliaferro
 
Solved it. I don't understand what ".innerText" could not do, ".textContent" did successfully:

Code:
Sub Customize_Documents()
    With CreateObject("InternetExplorer.Application")
        .Visible = False
        .navigate "C:\Users\ar\Desktop\content.html"
        While .Busy = True Or .readyState < 4: DoEvents: Wend
        Set post = .document.querySelector("a[href^='mailto']")
        [A1] = post.PreviousSibling.PreviousSibling.textContent
        [B1] = post.innerText
        .Quit
    End With
End Sub
 
Back
Top