读文章: Understanding Pins through keyword extraction
挺久之前读的, 补个笔记. 传统机器学习. 从帖子的多个文本来源抽取候选标签, 然后用分类模型判断标签是否与帖子相关. 没有用到图片信息 (除了从图中抽取文字).
Pinterest 主要通过 annotations 理解文本. Annotations 是 1~6 个词的关键词或者短语, 描述 Pin 的主题. Annotations 除了文本, 还有置信度分数和语言标签 (共 28 门语言), 例如:
- (EN, sloth sanctuary, 0.99)
- (EN, sloths, 0.95)
- (EN, costa rica, 0.90)
- (EN, carribean, 0.85)
- (EN, animals, 0.80)
- (EN, travel, 0.80)
用例
Annotations 用作他们很多产品的机器学习模型特征, 得到了很好的效果.
搜索. 用 annotations 召回.
相关 Pins (推荐). 用 annotation 向量求 cosine 相似度.
安全内容过滤 (分类).
生成方法
Annotations dictionary
Annotations are limited to a finite vocabulary known internally as the Dictionary. The advantage of using such a dictionary over allowing annotations to be arbitrary ngrams is that it guarantees the annotations will be valid and useful phrases instead of misspellings (e.g., “recipies”), stopwords (e.g., “the”), fragments (e.g., “of liberty”) and generic phrases (e.g., “ideas”, “things”).
The dictionary initially started with popular topics that were manually entered by users, but it has grown to include additional sources of terms such as search queries, hashtags, etc. A significant amount of human curation has gone into building the dictionary to ensure its quality is maintained, and we periodically use heuristics to trim out bad terms and use a spell checker to remove misspellings. We...
剩余内容已隐藏