Auto tagging your old blog posts
As you all may know I have recently moved my blog from blogger to wordpress. Since I have started blogging way back in 2004 most of my posts were untagged. I have used blogger labels ever since they were introduced, still almost 400 of my posts were untagged. I had two options,
- Manually edit posts and tag / categorize them
- Auto tag posts through some plug-in and then fine tune the post tags over the next few months
I have chosen the 2nd option and promptly installed Simple Tags plugin. Even though simple tags had the capability to automatically tag all your posts you still have to provide the list of the tags to use. I could create the list by typing out all things that I think I have been blogging about (for eg. IIM, CAT, Indore, software, excel, data, business, advertising, IRIS, Utsaha, marketing etc.), but over the years I have actually blogged about whole lot of things that thinking them itself would take a lot of time and I could always miss some of the tags. So I wanted to analyze my posts (400+ of them) to generate the tags.
- My first job is to somehow convert the blogposts in to plain text so that its easy to analyze. So I exported the entire blog using WP’s incredible xml export option [Go to wordpress blog dashboard > manage > export and select the author for whom you want to export the posts]. Then I opened it in notepad++ to see the file structure. As you can see it below the post contents are included in the
content:encoded
tags. - Then I wrote a small java program to extract all the post contents and analyzed the word counts. The logic for this is simple,
1. Find the first "content :encoded" tag
2. while the next word is not "/content:encoded"
3. increase that word frequency by 1
4. get next word
5. get next "content :encoded" tag
6. send the output in the form of word: frequency to a txt file
- Once I had the words and their frequencies in a text file I used my favorite tool, thats right, excel and sorted the list in descending order of the frequency. There were 25000 + unique words in my blog posts. But the list needed a bit of cleaning since I have used StringTokenizer to parse words it returned words like “<a “, “@something” “hello.” “because,” After cleaning and removing all the unnecessary symbols and special characters I was left with atleast 23000 words and their frequencies.
- I neglected all the words with more than 175 as frequency (just a random number I used) or less than 15 as frequency.
- I took the left over list and manually scanned it to find out the unique words. This way I could knockoff most of the synonyms (like advertising and ads, marketing and selling, b-school and bschool etc.).
- After all this I had a list of about 100 words. Then I edited the words to bring out clarity and pasted them in Simple Tags plug-in page. Then I ran the auto-tag feature and all my posts were tagged.
Now most of my previous posts are tagged, hence I could automatically locate the related posts when I write something.
Well, incase you have migrated your blog from blogger to wordpress and trying to tag your posts please drop a comment here, I can send you my java program which can generate the word frequency based on the xml format. You can then analyze it and findout the tags for your posts.
The program is no longer in a reusable position (I have edited it so much that it is pretty much useless for everyone except me). If you are trying to write it, follow the above logic and you should be able to do this. Good luck.
Hello Awesome...
My name is Chandoo. Thanks for dropping by. My mission is to make you awesome in Excel & your work. I live in Wellington, New Zealand. When I am not F9ing my formulas, I cycle, cook or play lego with my kids. Know more about me.
I hope you enjoyed this article. Visit Excel for Beginner or Advanced Excel pages to learn more or join my online video class to master Excel.
Thank you and see you around.
Related articles:
|
Leave a Reply
« Normal is Boring : Having fun in whatever you do | Tales of riding an auto in Chennai » |
16 Responses to “Auto tagging your old blog posts”
[...] my first challenge was to organize the posts for better discovery and search. I have done that by automatically generating tags and tagging posts using a bit of analysis and a bit of [...]
I would love to have a copy of your Java program. Would you mind posting it for download?
Hi chandoo
great post. lemme try it. we at http://www.findnearyou.com, have tons of content and are looking at some neat way to autotag all of them.
@Ram - thanks for the comments
@Online Sho... - Will send you the program in email
gr8 post
never know before this
thanks
I really thankful to you for posting this article and I never know about this.
I recently moved from blogger to wordpress.
Did some manual tagging for old posts.
Can you send me the java program?
Regards
S
This looks really helpful. I'd love to get your program too!
Hey, this looks really helpful.. can you please send me the java app?
Hi P.H.D. - could you possibly send me a copy of that java app? That would rock. Thanks.
Hey, great idea. I'm in the same situation and was wondering if I could get a copy of your java program?
Hate to add a 'me to!' post, but I'd also love a copy of the program. Thanks!
@all.. I am sorry to say this, but the program is no longer in a reusable condition. But it is a fairly simple program, you can follow the logic mentioned in the post to recreate it yourself.
all the best. 🙂
[...] have used excel to generate and clean a list of tags for this [...]
Hi Chandoo,
Athif here. BTW, which wordpress related posts plugin do you use?
thank you