• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

How to extract values from Text

BSM

New Member
Hi,

I have an excel report containing source code of our FAQs. It has lot of "href" references which point to different URLs. I want to extract the URL against each FAQ and validate it. The result can appear on next column against each FAQ

Would appreciate the help. This can save a lot of my time currently as I manually extract href URLs for each FAQ and validate them

Thanks,
 
Hi ,

Extracting the URL can be done by using formulae , but validating each URL will need VBA.

Either way , the first thing to be done is to have the data ; please upload your file.

Narayan
 
  • Like
Reactions: BSM
Hi BSM,

Welcome to Chandoo.org forum.

A sample file would help readers to understand your data structure and your requirement. SO kindly, upload a sample file.

Regards,
 
  • Like
Reactions: BSM
Thanks Somendra and Narayan.Appreciate your quick response. I have attached the sample file. The URLs marked in red color are invalid whereas the one in blue are valid
 

Attachments

  • abcd.xlsx
    16.9 KB · Views: 10
Hi, BSM!

As a new user you might want to (I'd say should and must) read this:
http://chandoo.org/forum/forums/new-users-please-start-here.14/

And regarding your issue, at column B you have content extracted from the HTML code of some web page, and you've colored some in red and cyan, at least. Now the questions are:
a) What did you tried to mean with "validate"?
b) How would you identify which of the multiple URL at each cell should be validated?

Regards!
 
Hi SirJB7,

The sample html code is part of web FAQ that our team built from time to time.Here is the use case. The KM team keep adding/updating content in FAQ. Currently there is no way to validate these URLs as there are possibility that a URL which an old FAQ was referencing earlier is no longer valid and needs to be identified and corrected.

I have access to FAQ source code and can pull them into excel through a simple SQL query. As of now this is (validation of each URL in a FAQ) is done manually and am seeking your expert help to make the task easier using excel power.

a) By Validate I mean that the extracted URL should be valid and not throw "page not found" or any other error. user NARAYANK991 first reply mentioned about using VBA approach and he asked me for the sample xls

b) All the URLs extracted from step (a) above needs to be validated and listed alongwith validation result

Rgds,
BSM
 
Hi, BSM!

Not only they're all invalid but they can't be all validated, since the 1st one does actually raise a 404 error but the other two handle that error and perform a redirection. So I think you could only aspire to extract the existing URLs and not much more.

Regards!
 
Hi SirJB7,

I got your point.

In that case can we exclude the redirect's case and only flag the one where HTTP response codes are of 400 or 404 types? Extract URL and this would suffice.

Iam interested to see how it can be done in excel.

Thanks,
 
Hi, BSM!

From VBA it'll be very difficult -if possible, which I doubt- to perform such validations, you'll have to get involved with PHP or Java or some other languages to do that. IMHO far beyond the scope of these forums.

Regards!
 
Hi ,

Two points :

1. See if the code at the following link helps in the first part :

http://www.mrexcel.com/forum/excel-questions/632198-check-if-url-valid.html

An explanation of this technique is available here :

http://www.jpsoftwaretech.com/vba/validate-url/

2. Regarding the second part , it should be possible , since both pages are returning the text :

We are sorry , the page you requested could not be found

in the title ; searching for this text or at least the relevant parts of this text in the title of the returned page does not require Java or PHP.

Narayan
 
Thanks Narayan and SirJB7. I will wait for your inputs and in the meantime will also check the links
 
Narayan,

I tested the code provided by Leith in http://www.mrexcel.com/ forum and found that it was even giving "invalid URL" error even for a valid link. I have updated the forum post with my comments

Now if you guys can help me to clear the 1st step i.e. extract multiple URLs for a row, I can apply the Leith VBA code

Thanks.
 
Hi ,

That seems surprising , since the return code it is checking for , 200 , will normally not be returned for a valid URL.

Can you post a valid URL where it returns an invalid URL message ?

Narayan
 
Hi BSM,
I had done something similar in recent past, since that data is sensitive I can't share it here.
I used mostly find and replace to extract the desired text. Soundy bizarre?
First of all try to find out possible start point and end point of your link for e.g start point may be 'href=' or 'link'. Now you can replace the 'href=' by href= & char(10). Do same thing for end point of your link. Once you are sure you have covered all. you can split your data with text to columns using char(10) i.e Ctrl + J as delimiter.
Now you can write simple macro to check if any cell contains text like 'http', if not clear that cell. Now with GOTo Special delete all blank cells. Depending on number of links in your cell, they will be listed next to your original data.

With Regards
Rudra
 
Back
Top