Auto tagging your old blog posts

Posted on January 7th, 2008 in Analytics , blogging , hacks , ideas , technology - 13 comments

As you all may know I have recently moved my blog from blogger to wordpress. Since I have started blogging way back in 2004 most of my posts were untagged. I have used blogger labels ever since they were introduced, still almost 400 of my posts were untagged. I had two options,

  • Manually edit posts and tag / categorize them
  • Auto tag posts through some plug-in and then fine tune the post tags over the next few months

I have chosen the 2nd option and promptly installed Simple Tags plugin. Even though simple tags had the capability to automatically tag all your posts you still have to provide the list of the tags to use. I could create the list by typing out all things that I think I have been blogging about (for eg. IIM, CAT, Indore, software, excel, data, business, advertising, IRIS, Utsaha, marketing etc.), but over the years I have actually blogged about whole lot of things that thinking them itself would take a lot of time and I could always miss some of the tags. So I wanted to analyze my posts (400+ of them) to generate the tags.

  • My first job is to somehow convert the blogposts in to plain text so that its easy to analyze. So I exported the entire blog using WP’s incredible xml export option [Go to wordpress blog dashboard > manage > export and select the author for whom you want to export the posts]. Then I opened it in notepad++ to see the file structure. As you can see it below the post contents are included in the content:encoded tags.Wordpress XML export file structure
  • Then I wrote a small java program to extract all the post contents and analyzed the word counts. The logic for this is simple,
    1. Find the first "content :encoded" tag
    2. while the next word is not "/content:encoded"
    3. increase that word frequency by 1
    4. get next word
    5. get next "content :encoded" tag
    6. send the output in the form of word: frequency to a txt file
  • Once I had the words and their frequencies in a text file I used my favorite tool, thats right, excel and sorted the list in descending order of the frequency. There were 25000 + unique words in my blog posts. But the list needed a bit of cleaning since I have used StringTokenizer to parse words it returned words like “<a “, “@something” “hello.” “because,” After cleaning and removing all the unnecessary symbols and special characters I was left with atleast 23000 words and their frequencies.
  • I neglected all the words with more than 175 as frequency (just a random number I used) or less than 15 as frequency.
    blog posts word frequencies
  • I took the left over list and manually scanned it to find out the unique words. This way I could knockoff most of the synonyms (like advertising and ads, marketing and selling, b-school and bschool etc.).
  • After all this I had a list of about 100 words. Then I edited the words to bring out clarity and pasted them in Simple Tags plug-in page. Then I ran the auto-tag feature and all my posts were tagged.

Now most of my previous posts are tagged, hence I could automatically locate the related posts when I write something.

Well, incase you have migrated your blog from blogger to wordpress and trying to tag your posts please drop a comment here, I can send you my java program which can generate the word frequency based on the xml format. You can then analyze it and findout the tags for your posts.
The program is no longer in a reusable position (I have edited it so much that it is pretty much useless for everyone except me). If you are trying to write it, follow the above logic and you should be able to do this. Good luck.

| More
Subscribe for PHD Email updates and get a free excel e-book with 95 tips & tricks

Comments
Online Shopping Network March 22, 2008

I would love to have a copy of your Java program. Would you mind posting it for download?

Ram March 28, 2008

Hi chandoo

great post. lemme try it. we at http://www.findnearyou.com, have tons of content and are looking at some neat way to autotag all of them.

Chandoo March 28, 2008

@Ram – thanks for the comments
@Online Sho… – Will send you the program in email

Ajay April 10, 2008

gr8 post
never know before this
thanks

Agel May 22, 2008

I really thankful to you for posting this article and I never know about this.

sunish June 17, 2008

I recently moved from blogger to wordpress.

Did some manual tagging for old posts.

Can you send me the java program?

Regards
S

MinPin August 19, 2008

This looks really helpful. I’d love to get your program too!

JonnieCrunch December 21, 2008

Hey, this looks really helpful.. can you please send me the java app?

Albert February 6, 2009

Hi P.H.D. – could you possibly send me a copy of that java app? That would rock. Thanks.

Shawn May 21, 2009

Hey, great idea. I’m in the same situation and was wondering if I could get a copy of your java program?

Rebecca May 30, 2009

Hate to add a ‘me to!’ post, but I’d also love a copy of the program. Thanks!

Chandoo June 1, 2009

@all.. I am sorry to say this, but the program is no longer in a reusable condition. But it is a fairly simple program, you can follow the logic mentioned in the post to recreate it yourself.

all the best. :)

RSS feed for comments on this post. TrackBack URI

Leave a comment

   Name (required)

   E-mail (required, never displayed)

   URL


If you have a question, please ask in the forums

Recommended Excel, Charting, VBA books