Pranav Gupta's answer to How can I use machine learning to propose tags for content?

CSE Undergrad at IITG · 12y ·

Originally Answered: How to use machine learning to propose Tags for content? ·

Algorithms:

1. As many have suggested, I would also suggest you to use LDA.
Advantages:

It's unsupervised, so you don't need labeled data to start off. But yes, you will need unlabeled data in the beginning to set up the model.
Though the number of topics in LDA are fixed, you can vary the number of tags. LDA would give you the ratio of topics for a post each time. But to assign tags, you will need to pick the (most probable) words under those topics as tags, which can certainly be variable.
LDA is scalable in that you can keep on adding the new incoming posts to you original corpus.
Moreover, it is scalable in the sense that one, it is a proper generative so you may not get stuck so easily if you decide to extend your model tomorrow and two, it is sought after method which has seen regular growth, so you will most likely enjoy the fruits without much work

Disadvantages:

A major disadvantage would be that it might turn out to be much slower than most other alternatives.

2. If the size of the list of tags is flexible, you can use TextRank

It is an old ( given by Mihalcea et. al. in 2004) algorithm for keyphrase extraction and sentence extraction

Advantages:

It is unsupervised. Not just you don't need any labeled data, but you don't need any unlabeled data. It would a piece of text and automatically suggest the keyphrases (with scores for each of them) in the text. Keyphrases with the highest scores can behave as tags.
It gives you the option to systematically pick up phrases (n grams) rather than just words (unigrams) unlike LDA.

Disadvantages:

The tags are extracted from the text itself. There is no direct way to relate tags of one document with those of the another document. So, one needs a strategy to utilize tags already obtained in previous documents and not generate new tags each time.
The number of tags can grow to be very large. So, one needs a strategy to keep the list small.

But it works really nice.

Here is the paper : http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf
Here is my own naive implementation in C++ (which just generates the list of keywords with their scores; you will have to generate tags on you own the method for which you can find in the attached paper): Summarizer
But it requires a PoS Tagger to work, which you can find here: POS-Tagger

Available tools for Automated Tagging

In case you settle down for a supervised approach, you can use the following tools. Both are good, but I would suggest the second one.

Maui is actually built on top of KEA only. Both use the Naive Bayes algorithm after extracting an extensive set of features for all possible tags.

8.9K views ·

View upvotes

Something went wrong. Wait a moment and try again.