• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Excel Power Query: Handling json data

jan_tothemoon

New Member
Hi all,

Hope you can help me on this please,

I´m trying to do pull the Data Highlights from the EIA website to an excel file (https://www.eia.gov/). I tried what this post discuss (https://chandoo.org/forum/threads/w...ble-pages-how-to-select-the-other-page.46424/) but I can´t make it work. My goal is to build and energy dashboard in excel with data like this from differente websites, so if I can understand how it works for this I think that I would be able to advance on my task.


Kind Regards
78713
 
This has nothing to do with JSON. Data is actually in html content of the page.
However, since page is heavily nested in div and other style container tags, it is difficult to find/navigate to the desired data from elements table. This isn't a easy site to scrape data using PowerQuery due to this.

I don't have time today to detail every step. But here's quick steps to get you started.

Use Text.FromBinary() to parse raw html code string of the page.
Code:
Source = Text.FromBinary(Web.Contents("https://www.eia.gov/"))

Then paste the result to text editor. Identify the tags that precede and follows the content that you are after.

In this page's case. '<div class="accent">' and '</div>'. However, due to nature of html tag syntax, end tag will almost never be unique.
You will need to locate next start tag that is unique. In this case, '<div class="l-row l-two-col-right-narrow">'.

You will then need to use Text.Split to extract out portion between identified tags. Where {#} is used to specify list index of split text.
Code:
mystr = Text.Split(Text.Split(Source, "<div class=""accent"">"){1},"<div class=""l-row l-two-col-right-narrow"">"){0}

For ease of manipulation. You will then convert this into list.
Code:
mylist = Lines.FromText(mystr)

See below for list generated.
<h2>Data Highlights</h2>
<h3><a href="https://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=RCLC1&f=D">WTI crude oil futures price</a></h3>
<p class="p12">4/27/2022: <b>$102.02/barrel</b><br/>
<p><span class="ico_down">down</span>$0.73 from week earlier<br />
<span class="ico_up">up</span>$39.08 from year earlier</p>
<h3><a href="https://www.eia.gov/dnav/ng/hist/rngc1d.htm">Natural gas futures price</a></h3>
<p class="p12">4/27/2022: <b>$7.267/MMBtu</b><br/>
<p><span class="ico_up">up</span>$0.330 from week earlier<br />
<span class="ico_up">up</span>$4.394 from year earlier</p>
<h3><a href="https://www.eia.gov/coal/production/weekly/">Weekly coal production</a></h3>
<p class="p12">4/23/2022: <b>9.874 million tons</b><br/>
<p><span class="ico_down">down</span>0.631 million tons from week earlier<br />
<span class="ico_down">down</span>0.973 million tons from year earlier</p>
<h3><a href="https://ir.eia.gov/ngs/ngs.html">Natural gas inventories</a></h3>
<p class="p12">4/22/2022: <b>1,490 Bcf</b><br/>
<p><span class="ico_up">up</span>40 Bcf from week earlier<br />
<span class="ico_down">down</span>406 Bcf from year earlier</p>
<h3><a href="https://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=WCESTUS1&f=W">Crude oil inventories</a></h3>
<p class="p12">4/22/2022: <b>414.4 million barrels</b><br/>
<p><span class="ico_up">up</span>0.7 million barrels from week earlier<br />
<span class="ico_down">down</span>78.7 million barrels from year earlier</p>
</div>
</div>
</div>
</div>

Now, you'll need to further process this list. Using list functions and using site's html structure.
You can see that <h3> contains header for each section. Each section having 3 rows and a blank row separating them. So each section is comprised of 5 rows.

Therefore, from list index 1 (i.e. 2nd row) of the list, you need 5 * count of <h3> (i.e. 5) = 25 rows.

To count number of <h3> tags. You'd use List.Select and List.Count.
Code:
h3count = List.Count(List.Select(mylist, each Text.Contains(_, "<h3>")))

You can then use List.Range to extract pertinent part of the list.
Code:
cleanList = List.Range(myList, 1, h3Count * 5)

This should give you enough of starting point to work with. See below for all the steps combined in query.
Code:
let
    Source = Text.FromBinary(Web.Contents("https://www.eia.gov/")),
    myStr = Text.Split(Text.Split(Source, "<div class=""accent"">"){1},"<div class=""l-row l-two-col-right-narrow"">"){0},
    myList = Lines.FromText(myStr),
    h3Count = List.Count(List.Select(myList, each Text.Contains(_, "<h3>"))),
    cleanList = List.Range(myList, 1, h3Count * 5)
in
    cleanList
 
Hi Chihiro,

Very good explanation how to get there, I have not M coding/coding experience and I understood you perfectly.
Nevertheless, since I'm extremely beginner in this, I can´t clean it properly and convert it into a a useful table to show in a report.

Out there in the web, I found this code that cleans html text from tables but I doesn´t work for me, could help on this? PFA the code:
Code:
let func =   
 (Table as table, Function, TypeForColumns as type, optional ColumnNames as list) =>
let
    columnNames = if ColumnNames = null then Table.ColumnNames(Table) else ColumnNames,
    Transformation = Table.TransformColumns( Table, List.Transform(columnNames, each {_, Function, TypeForColumns} ) )
in
    Transformation ,
documentation = [
Documentation.Name =  " Table.TransformAllColumns.pq ",
Documentation.Description = " Transforms all columns of a <code>table</code>  with one <code>function</code> and one <code>type</code>. ",
Documentation.LongDescription = " Transforms all columns of a <code>table</code> with one <code>function</code> and one <code>type</code>. Optionial <code>ColumnNames</code> to limit to a specific list. ",
Documentation.Category = " Table ",
Documentation.Source = " www.TheBIccountant.com https://wp.me/p6lgsG-2dQ .   ",
Documentation.Version = " 1.0 ",
Documentation.Author = " Imke Feldmann ",
Documentation.Examples = {[Description =  "  ",
Code = " TableTransformAllColumns( #table( {""TextColumn1"", ""TextColumn2""}, List.Zip( { {""123<code>456</code>"" ,""789<code>101</code>""}, {""ABC<code>DEF</code>"" ,""GHI<code>JKL</code>""} } ) ), fnRemoveHtmlTags, type text) ",
Result = " #table( {""TextColumn1"", ""TextColumn2""}, List.Zip( { {""123456"" ,""789101""}, {""ABCDEF"" ,""GHIJKL""} } ) ) "]}]
 
 in 
  Value.ReplaceType(func, Value.ReplaceMetadata(Value.Type(func), documentation))

Source: https://www.thebiccountant.com/2019...-columns-at-once-in-power-bi-and-power-query/

Hope this thread can help others.

<Mod edit: Put M-code inside code tag for legibility>
 
Last edited by a moderator:
FYI - If it is <table> element within html that you wanted to extract. You don't need special convoluted function. The function you posted above is to apply transformation(s) using single function to entire table or to specific columns within PowerQuery. Not directly related to parsing out table from html.

See below for how to parse out <table> element from html string.
WebSite single URL multiple table pages, how to download all pages? | Chandoo.org Excel Forums - Become Awesome in Excel

Note: Or if you just wanted to strip all html tags. use Html.Table(HtmlCode, {{"text",":root"}})

I found bit of time today, so here's how you can parse it out from this site.

After what I posted. It is simple matter of understanding structure and parsing out texts. Note that what you want to extract from that site isn't really a table and not structured as such.

Below code will transform it to more manageable form. While the code can be condensed significantly. I left majority as individual steps so that it is easier to follow in "Applied Steps" pane in PowerQuery.

Code:
let
    Source = Text.FromBinary(Web.Contents("https://www.eia.gov/")),
    myStr = Text.Split(Text.Split(Source, "<div class=""accent"">"){1},"<div class=""l-row l-two-col-right-narrow"">"){0},
    myList = Lines.FromText(myStr),
    h3Count = List.Count(List.Select(myList, each Text.Contains(_, "<h3>"))),
    cleanList = List.Range(myList, 1, h3Count * 5),
    finalList = List.Select(cleanList, each Text.Length(Text.Trim(_)) > 1),
    #"Converted to Table" = Table.FromList(finalList, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Added Custom" = Table.AddColumn(#"Converted to Table", "Heading", each if Text.Contains([Column1],"<h3>") then Text.Trim(List.Last(Text.Split(Text.Split([Column1],"</a>"){0},">"))) else null),
    #"Filled Down" = Table.FillDown(#"Added Custom",{"Heading"}),
    #"Grouped Rows" = Table.Group(#"Filled Down", {"Heading"}, {{"Temp", each _, type table [Column1=text, Heading=text]}}),
    #"Added Custom1" = Table.AddColumn(#"Grouped Rows", "str", each Text.Combine(List.Select([Temp][Column1], each Text.Contains(_,"<h3>") <> true))),
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom1",{"Temp"}),
    #"Added Custom2" = Table.AddColumn(#"Removed Columns", "Date", each Text.Split(Text.Split([str],":"){0},">"){1}),
    #"Added Custom3" = Table.AddColumn(#"Added Custom2", "Value", each Text.Split(Text.Split([str],"</b>"){0},"<b>"){1}),
    #"Added Custom4" = Table.AddColumn(#"Added Custom3", "IndicatorWk", each List.Last(Text.Split(Text.Split([str],"</span>"){0},">"))),
    #"Added Custom5" = Table.AddColumn(#"Added Custom4", "ChangeWk", each Text.Split(Text.Split([str],"</span>"){1},"<"){0}),
    #"Added Custom6" = Table.AddColumn(#"Added Custom5", "IndicatorYr", each List.Last(Text.Split(Text.Split([str],"</span>"){1},">"))),
    #"Added Custom7" = Table.AddColumn(#"Added Custom6", "ChangeYr", each Text.Replace(List.Last(Text.Split([str],"</span>")),"</p>","")),
    #"Removed Columns1" = Table.RemoveColumns(#"Added Custom7",{"str"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Removed Columns1",{{"Heading", type text}, {"Date", type date}, {"Value", type text}, {"IndicatorWk", type text}, {"ChangeWk", type text}, {"IndicatorYr", type text}, {"ChangeYr", type text}})
in
    #"Changed Type"

Resulting table below.

HeadingDateValueIndicatorWkChangeWkIndicatorYrChangeYr
WTI crude oil futures price4/28/2022$105.36/barrelup$1.57 from week earlierup$41.50 from year earlier
Natural gas futures price4/28/2022$6.888/MMBtudown$0.069 from week earlierup$3.963 from year earlier
Weekly coal production4/23/20229.874 million tonsdown0.631 million tons from week earlierdown0.973 million tons from year earlier
Natural gas inventories4/22/20221,490 Bcfup40 Bcf from week earlierdown406 Bcf from year earlier
Crude oil inventories4/22/2022414.4 million barrelsup0.7 million barrels from week earlierdown78.7 million barrels from year earlier
 
GraH, Chihiro,
Really appreciate it! Thank you.
I've been working in this the whole week and I undertood your steps (even though I think it will be a challenge for me to get to others "tables" alone)

This is very helpful. Regards,
 
Back
Top