How do you efficiently implement a document similarity search system? -


How do you implement the "same item" system for the objects described by a group of tags?

In my database, I have three tables, paragraphs, articles and tags. Each paragraph is related to many tags through many-to-many relationships. For each paragraph, If you like, you will also like it "Want to find five most similar articles to implement the system.

I am familiar with that algorithm and use them very well. But there is a way to slow it down for each article, I have to repeat all the articles, calculate the cosine parity for the article pair, and then select five articles with maximum equality ratings.

With 200k articles and 30k tags, it requires me to calculate similar articles for half an article one minute so I need another algorithm which is roughly good results in the form of cosine parity , But it can be run in real time and I do not need to iterate over the entire document every time.

The shelf solution for this? Most of the search engines I saw did not enable document parity search.

some questions,

  • how different article tags are different from tags ? Or is it that the M2M mapping table?
  • Can you find out the implementation of the cosine matching algorithm?
  • Why do not you store your document tags in some kind of memory data structure, to use only to recover the document ID? This way, you can only hit the database during recovery time.
  • Depending on the freq linking freq, this structure can be prepared for faster / slower updates.

Intuition towards one answer - I would say, an online clustering algorithm (probably a major component analysis on co-incident matrix, which would be the guess of the K-Earth cluster?).

Cheers.


Comments

Popular posts from this blog

sql - dynamically varied number of conditions in the 'where' statement using LINQ -

asp.net mvc - Dynamically Generated Ajax.BeginForm -

Debug on symbian -