Auto tagging your old blog posts

Posted on January 7th, 2008 in Analytics , blogging , hacks , ideas , technology - 16 comments

As you all may know I have recently moved my blog from blogger to wordpress. Since I have started blogging way back in 2004 most of my posts were untagged. I have used blogger labels ever since they were introduced, still almost 400 of my posts were untagged. I had two options,

  • Manually edit posts and tag / categorize them
  • Auto tag posts through some plug-in and then fine tune the post tags over the next few months

I have chosen the 2nd option and promptly installed Simple Tags plugin. Even though simple tags had the capability to automatically tag all your posts you still have to provide the list of the tags to use. I could create the list by typing out all things that I think I have been blogging about (for eg. IIM, CAT, Indore, software, excel, data, business, advertising, IRIS, Utsaha, marketing etc.), but over the years I have actually blogged about whole lot of things that thinking them itself would take a lot of time and I could always miss some of the tags. So I wanted to analyze my posts (400+ of them) to generate the tags.

  • My first job is to somehow convert the blogposts in to plain text so that its easy to analyze. So I exported the entire blog using WP’s incredible xml export option [Go to wordpress blog dashboard > manage > export and select the author for whom you want to export the posts]. Then I opened it in notepad++ to see the file structure. As you can see it below the post contents are included in the content:encoded tags.Wordpress XML export file structure
  • Then I wrote a small java program to extract all the post contents and analyzed the word counts. The logic for this is simple,
    1. Find the first "content :encoded" tag
    2. while the next word is not "/content:encoded"
    3. increase that word frequency by 1
    4. get next word
    5. get next "content :encoded" tag
    6. send the output in the form of word: frequency to a txt file
  • Once I had the words and their frequencies in a text file I used my favorite tool, thats right, excel and sorted the list in descending order of the frequency. There were 25000 + unique words in my blog posts. But the list needed a bit of cleaning since I have used StringTokenizer to parse words it returned words like “<a “, “@something” “hello.” “because,” After cleaning and removing all the unnecessary symbols and special characters I was left with atleast 23000 words and their frequencies.
  • I neglected all the words with more than 175 as frequency (just a random number I used) or less than 15 as frequency.
    blog posts word frequencies
  • I took the left over list and manually scanned it to find out the unique words. This way I could knockoff most of the synonyms (like advertising and ads, marketing and selling, b-school and bschool etc.).
  • After all this I had a list of about 100 words. Then I edited the words to bring out clarity and pasted them in Simple Tags plug-in page. Then I ran the auto-tag feature and all my posts were tagged.

Now most of my previous posts are tagged, hence I could automatically locate the related posts when I write something.

Well, incase you have migrated your blog from blogger to wordpress and trying to tag your posts please drop a comment here, I can send you my java program which can generate the word frequency based on the xml format. You can then analyze it and findout the tags for your posts.
The program is no longer in a reusable position (I have edited it so much that it is pretty much useless for everyone except me). If you are trying to write it, follow the above logic and you should be able to do this. Good luck.

Written by Chandoo
Tags: , , , , ,
Home: Chandoo.org Main Page
? Doubt: Ask an Excel Question

16 Responses to “Auto tagging your old blog posts”

  1. [...] my first challenge was to organize the posts for better discovery and search. I have done that by automatically generating tags and tagging posts using a bit of analysis and a bit of [...]

  2. I would love to have a copy of your Java program. Would you mind posting it for download?

  3. Ram says:

    Hi chandoo

    great post. lemme try it. we at http://www.findnearyou.com, have tons of content and are looking at some neat way to autotag all of them.

  4. Chandoo says:

    @Ram - thanks for the comments
    @Online Sho... - Will send you the program in email

  5. Ajay says:

    gr8 post
    never know before this
    thanks

  6. Agel says:

    I really thankful to you for posting this article and I never know about this.

  7. sunish says:

    I recently moved from blogger to wordpress.

    Did some manual tagging for old posts.

    Can you send me the java program?

    Regards
    S

  8. MinPin says:

    This looks really helpful. I'd love to get your program too!

  9. JonnieCrunch says:

    Hey, this looks really helpful.. can you please send me the java app?

  10. Albert says:

    Hi P.H.D. - could you possibly send me a copy of that java app? That would rock. Thanks.

  11. Shawn says:

    Hey, great idea. I'm in the same situation and was wondering if I could get a copy of your java program?

  12. Rebecca says:

    Hate to add a 'me to!' post, but I'd also love a copy of the program. Thanks!

  13. Chandoo says:

    @all.. I am sorry to say this, but the program is no longer in a reusable condition. But it is a fairly simple program, you can follow the logic mentioned in the post to recreate it yourself.

    all the best. 🙂

  14. Athif says:

    Hi Chandoo,
    Athif here. BTW, which wordpress related posts plugin do you use?

Leave a Reply