statistics - blindly classifying new trends in incoming data -
Newsletters like Google News will automatically classify and rank documents about emerging topics such as "Obama's 2011 Budget".
How do you classify?
I've got a heap of articles, such as baseball data (thanks, openclaim) with articles like the player's name and relevant articles such as articles, and would like to make a Google News-style interface that will come Ranks and shows in posts, especially in the emerging topics I think that a naive Baiyas classifier can be trained w / some steady categories, but this is actually tracking trends "This player was the only business of this team, these other players were also included."
There is no doubt, Google News ), But to take a relatively cheap move, computational, free-text to predict topics, it has to take advantage of NLP's notion that one word means only when the other words Related to .
The possibility of searching new topic categories from a new algorithm can be followed by several documents:
- Tag the POS (part of speech) tag - we may be more on names Want to focus and potentially designated organizations (such as Obama or as New England )
- Make the text normal
special appearance Instead of uncontrolled words by their common stem - a few words from the list To identify, there may also be some adjectives by related name attitudes (ex: Paris ==> Paris, legal ==> law).
- Manually maintaining "current / recurring hot words" (superbowl, choice, scandal ...)
To provide more weight to some NGrams, use it in later stages - All en- villages found in each document (where N is called 1 to 4 or 5)
Separately, the documents given to the number of incidents of each NGram Keep in mind the number of documents in mind and Yes, given given n-gram - The most frequently cited ngram (i.e. cited in most documents) are probably the subjects.
- Identify current topics (from the list of known topics)
- [Optionally] Review new topics manually
This general recipe can also be changed to take advantage of the other properties of the documents and the text in it. For example, Document Origin (CNN / Sports vs. CNN / Politics ...) can be used to select domain specific dictionaries. Another example, the process can emphasize more or less emphasis on words / expressions from the document title (or other areas of text with a particular mark-up).
Comments
Post a Comment