Personal blogs provide reports about experiences and interests of individuals, objects they surround themselves with, and activities they engage in. Our working hypothesis is that from a blogger's writings we can derive her commercial taste. In this paper we zoom in on books: given a blog, our aim is to generate a list of suggestions of categories of books for the blogger to buy. We begin by mining a blog for indicators of book interests; we then use external resources to match these indicators with actual products, aggregating their category information to arrive at a book profile. We evaluate profiles thus generated by comparing them against wishlists created by the bloggers themselves.
Overall, our goal is to show that valuable insights can be mined from blogs at the individual blog level, using a combination of text analysis and powerful external resources.
For the product extraction method (our baseline), we identify interest indicators by locating explicit references the blogger makes to books. Identification of names of books (and other products) is known to be difficult, and our approach to this is therefore somewhat simplistic: we tag the text with a general named entity tagger, and employ heuristics on the results to identify possible book titles. These heuristics include searching for entities in close proximity to a small set of book related keywords (``read'', ``book''); discarding ``location'' entities; matching patterns such as ``<ENTITY> by <PERSON>'' etc. Extracted entities are scored based on a combination of their recurrence in the blog, their NE-tagger confidence score, and a score derived from the heuristic used to select them; the top-scoring entities are used to populate the indicator list.1
For the keyword extraction method, we use the log-likelihood corpus-comparison method [2] to identify terms which are distinctive to a blogger: word n-grams which she uses often, compared to other bloggers. To filter out irrelevant terms extracted this way (typically, recurring proper names related to the blogger, such as family members) we ignore terms that do not appear as a noun in WordNet. All distinctive n-grams with a log-likelihood score above a threshold are taken as interest indicators.
Possible Book Titles in Text | Supergirl and the Legion of Super Heroes, Golden Age Superman and Earth, ... |
Derived Product Profile | Superheroes (14), Economics (13), Business (13), Juvenile Fiction (11), ... |
Keywords | wolverine, replica, discretion, hulk, pencils, ... |
Relevant Amazon Books | Wolverine: Origin, The Chaos Engine : Book 1 (X-Men: Doctor Doom), ... |
Derived Product Profile | Superheroes (46), Graphic Novels (39), Fantasy (38), Economics (37), ... |
Similarly, the lower part of Table 1 shows the keywords extracted from the blog, the top books returned by Amazon for queries containing these words, and the generated model.
Wishlist | amazon.com/gp/registry/17G9XYDK5GEGG |
Books in wishlist | The Big Book of Conspiracies, Buffy the Vampire Slayer: Origin |
Blogger product profile | Games (61), Role Playing (45), Superheroes (42), Comics (42), ... |
Next, the methods for building advice models were employed, resulting in two models per blog: based on products and based on keywords. To compare these models with the actual models built from the blogger's wishlists, we measured the overlap in the top-3 categories of both models: if two of the categories appearing in the top-3 model built by a method appear also in the golden model, the overlap is 2/3, and so on: in the example in Table 2 the overlap is 1/3 with both of the constructed models. In the experiments reported here we did not take into account the hierarchical structure of Amazon's categorization scheme; doing so would have resulted in higher scores--e.g., in the example, the category ``Graphic Novels'' is a parent category of the ``golden'' category ``Superheroes.'' The average overlap over all blogs was 0.14 for the product-based extraction method, and 0.31 for the keyword-based method; experimenting with combinations of the methods did not yield additional improvements.
These initial results are encouraging: given the simplicity of our keyword method, it performs fairly well, correctly identifying about a third of the categories the blogger is most interested in, out of a large hierarchy of hundreds of different categories. An examination of failures (blogs for which no overlap exists between the models), shows that the majority of them are diary-like, highly personal blogs, with little topical substance. Often, this is computationally discernible: e.g., the keyword extraction phase for these blogs results in short lists, since only a small number of nouns exceed the minimal log-likelihood value to be considered ``distinctive.'' A possible extension to our method can identify these and assign confidence values to the generated models.
In future work, we intend to investigate ``commercial contexts'' in the blog--sections which are likely to be related to the blogger's desired products (such as plans for future purchases) and take the sentiment expressed by the blogger into account. Additionally, we plan to make our evaluation more robust by using the hierarchical structure of the categories, allowing for more than the exact matches we have now used. Finally, we plan to make this novel dataset publicly available to encourage additional work in this direction.