Auto tagging your old blog posts
As you all may know I have recently moved my blog from blogger to wordpress. Since I have started blogging way back in 2004 most of my posts were untagged. I have used blogger labels ever since they were introduced, still almost 400 of my posts were untagged. I had two options,
- Manually edit posts and tag / categorize them
- Auto tag posts through some plug-in and then fine tune the post tags over the next few months
I have chosen the 2nd option and promptly installed Simple Tags plugin. Even though simple tags had the capability to automatically tag all your posts you still have to provide the list of the tags to use. I could create the list by typing out all things that I think I have been blogging about (for eg. IIM, CAT, Indore, software, excel, data, business, advertising, IRIS, Utsaha, marketing etc.), but over the years I have actually blogged about whole lot of things that thinking them itself would take a lot of time and I could always miss some of the tags. So I wanted to analyze my posts (400+ of them) to generate the tags.
- My first job is to somehow convert the blogposts in to plain text so that its easy to analyze. So I exported the entire blog using WP’s incredible xml export option [Go to wordpress blog dashboard > manage > export and select the author for whom you want to export the posts]. Then I opened it in notepad++ to see the file structure. As you can see it below the post contents are included in the
- Then I wrote a small java program to extract all the post contents and analyzed the word counts. The logic for this is simple,
1. Find the first "content :encoded" tag
2. while the next word is not "/content:encoded"
3. increase that word frequency by 1
4. get next word
5. get next "content :encoded" tag
6. send the output in the form of word: frequency to a txt file
- Once I had the words and their frequencies in a text file I used my favorite tool, thats right, excel and sorted the list in descending order of the frequency. There were 25000 + unique words in my blog posts. But the list needed a bit of cleaning since I have used StringTokenizer to parse words it returned words like “<a “, “@something” “hello.” “because,” After cleaning and removing all the unnecessary symbols and special characters I was left with atleast 23000 words and their frequencies.
- I neglected all the words with more than 175 as frequency (just a random number I used) or less than 15 as frequency.
- I took the left over list and manually scanned it to find out the unique words. This way I could knockoff most of the synonyms (like advertising and ads, marketing and selling, b-school and bschool etc.).
- After all this I had a list of about 100 words. Then I edited the words to bring out clarity and pasted them in Simple Tags plug-in page. Then I ran the auto-tag feature and all my posts were tagged.
Now most of my previous posts are tagged, hence I could automatically locate the related posts when I write something.
Well, incase you have migrated your blog from blogger to wordpress and trying to tag your posts please drop a comment here, I can send you my java program which can generate the word frequency based on the xml format. You can then analyze it and findout the tags for your posts.
The program is no longer in a reusable position (I have edited it so much that it is pretty much useless for everyone except me). If you are trying to write it, follow the above logic and you should be able to do this. Good luck.
Leave a Reply
|Normal is Boring : Having fun in whatever you do||Tales of riding an auto in Chennai|