Auto tagging your old blog posts

As you all may know I have recently moved my blog from blogger to wordpress. Since I have started blogging way back in 2004 most of my posts were untagged. I have used blogger labels ever since they were introduced, still almost 400 of my posts were untagged. I had two options,

  • Manually edit posts and tag / categorize them
  • Auto tag posts through some plug-in and then fine tune the post tags over the next few months

I have chosen the 2nd option and promptly installed Simple Tags plugin. Even though simple tags had the capability to automatically tag all your posts you still have to provide the list of the tags to use. I could create the list by typing out all things that I think I have been blogging about (for eg. IIM, CAT, Indore, software, excel, data, business, advertising, IRIS, Utsaha, marketing etc.), but over the years I have actually blogged about whole lot of things that thinking them itself would take a lot of time and I could always miss some of the tags. So I wanted to analyze my posts (400+ of them) to generate the tags.

  • My first job is to somehow convert the blogposts in to plain text so that its easy to analyze. So I exported the entire blog using WP’s incredible xml export option [Go to wordpress blog dashboard > manage > export and select the author for whom you want to export the posts]. Then I opened it in notepad++ to see the file structure. As you can see it below the post contents are included in the content:encoded tags.Wordpress XML export file structure
  • Then I wrote a small java program to extract all the post contents and analyzed the word counts. The logic for this is simple,
    1. Find the first "content :encoded" tag
    2. while the next word is not "/content:encoded"
    3. increase that word frequency by 1
    4. get next word
    5. get next "content :encoded" tag
    6. send the output in the form of word: frequency to a txt file
  • Once I had the words and their frequencies in a text file I used my favorite tool, thats right, excel and sorted the list in descending order of the frequency. There were 25000 + unique words in my blog posts. But the list needed a bit of cleaning since I have used StringTokenizer to parse words it returned words like “<a “, “@something” “hello.” “because,” After cleaning and removing all the unnecessary symbols and special characters I was left with atleast 23000 words and their frequencies.
  • I neglected all the words with more than 175 as frequency (just a random number I used) or less than 15 as frequency.
    blog posts word frequencies
  • I took the left over list and manually scanned it to find out the unique words. This way I could knockoff most of the synonyms (like advertising and ads, marketing and selling, b-school and bschool etc.).
  • After all this I had a list of about 100 words. Then I edited the words to bring out clarity and pasted them in Simple Tags plug-in page. Then I ran the auto-tag feature and all my posts were tagged.

Now most of my previous posts are tagged, hence I could automatically locate the related posts when I write something.

Well, incase you have migrated your blog from blogger to wordpress and trying to tag your posts please drop a comment here, I can send you my java program which can generate the word frequency based on the xml format. You can then analyze it and findout the tags for your posts.
The program is no longer in a reusable position (I have edited it so much that it is pretty much useless for everyone except me). If you are trying to write it, follow the above logic and you should be able to do this. Good luck.

Facebook
Twitter
LinkedIn

Share this tip with your colleagues

Excel and Power BI tips - Chandoo.org Newsletter

Get FREE Excel + Power BI Tips

Simple, fun and useful emails, once per week.

Learn & be awesome.

Welcome to Chandoo.org

Thank you so much for visiting. My aim is to make you awesome in Excel & Power BI. I do this by sharing videos, tips, examples and downloads on this website. There are more than 1,000 pages with all things Excel, Power BI, Dashboards & VBA here. Go ahead and spend few minutes to be AWESOME.

Read my storyFREE Excel tips book

Overall I learned a lot and I thought you did a great job of explaining how to do things. This will definitely elevate my reporting in the future.
Rebekah S
Reporting Analyst
Excel formula list - 100+ examples and howto guide for you

From simple to complex, there is a formula for every occasion. Check out the list now.

Calendars, invoices, trackers and much more. All free, fun and fantastic.

Advanced Pivot Table tricks

Power Query, Data model, DAX, Filters, Slicers, Conditional formats and beautiful charts. It's all here.

Still on fence about Power BI? In this getting started guide, learn what is Power BI, how to get it and how to create your first report from scratch.

16 Responses

  1. @all.. I am sorry to say this, but the program is no longer in a reusable condition. But it is a fairly simple program, you can follow the logic mentioned in the post to recreate it yourself.

    all the best. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.