Automatically Organizing Bookmarks per Contents

Yoelle S. Maarek
IBM Haifa Research Laboratory
MATAM, Haifa 31905, ISRAEL
yoelle@haifa.vnet.ibm.com

Israel Z. Ben-Shaul
Dept. of Electrical Engineering
Technion, Israel Institute of Technology
Haifa 32000, ISRAEL
issy@ee.technion.ac.il

Outline

  1. Motivation

  2. Approach

  3. Cluster Analysis and Information Retrieval

  4. A Typical Scenario

  5. Conclusion

1. Motivation

2. Approach

  • A Hybrid approach for bookmark organization embodied in a tool: the Bookmark Organizer (BO)

    Freezing Nodes

    A frozen (F) node denotes a boundary between its sub-tree and the rest of the tree:
    Frozen (F) vs. non-frozen (NF) nodes:
    • An F node can have either F or NF descendants
    • An NF node can have NF descendants
    • An NF node cannot have an F descendant
    • (An F node cannot have an NF ancestor)

  • Manually-generated nodes are by default F
  • Automatically-generated nodes are by default NF
  • The root of the tree is always F
  • Leaves (URLs) are always NF

    BO Algorithms

    Traversal algorithm:

    While N the current node is not empty Do:
    • If N's direct descendance has been modified:
      1. Find the nearest frozen ancestor, call it FA
      2. Apply the Frozen Node Algorithm to FA
      3. Let the new current node be FA's successor in preorder traversal
    • Else let the new current node be N's successor in preorder traversal

    Frozen Node Algorithm

    1. For each child, C of FA do:
      1. If C is a leaf, Add C to a document list DL
      2. If C is non-frozen, Collapse C's subtree and Add all leaves to DL
      3. If C is frozen, Put it in a separate list, FDL.
    2. Apply the Bookmark Clustering Algorithm on DL to form a new cluster hierarchy rooted at a new node NEW-FA
    3. Apply the Frozen Node Algorithm on all elements of FDL and add them as children of NEW-FA
    4. Replace FA by NEW-FA

    3. Cluster Analysis and Information Retrieval

    Purpose of document clustering:

    Requirements:

    Hierarchical Clustering

    Typical cluster hierarchy (or dendogram)

    Requires measure of similarity between documents' profiles (Salton's vector space model)

    Precise and Informative Clustering

    single word norm. freq lexical affinity norm. freq
    book 0.399 book softcopy 0.346
    softcopy 0.394 cd rom 0.336
    ibm 0.343 collection kit 0.263
    library 0.218 ibm softcopy 0.217
    cd 0.161 ibm library 0.206
    rom 0.154 book library 0.201
    bookmanager 0.147 ibm program 0.191
    pc 0.142 library softcopy 0.181
    bookshelf 0.129 bookmanager ibm 0.170
    collection 0.129 book bookshelf 0.165
    program 0.127 kit softcopy 0.139
    kit 0.127 host system 0.139
    information 0.115 collection softcopy 0.134
    system 0.100 library program 0.108
    host 0.093 library reader 0.103
    product 0.090 book print 0.103
    picture 0.080 cd softcopy 0.098

    Single word profile vs. LA profile for "IBM softocopy library package"

    Controlling the size of the hierarchy

    Slicing method

    Enriching the hierarchy

    Problem: Cluster hierarchies are difficult to interpret (defined in extension, rather than in intention).

    We propose to add:

    Bookmarks Clustering Algorithm

    1. For each URL in the collection, Access the source of the corresponding HTML document
    2. Index each document and generate an LA-based profile
    3. Compute the pairwise similarity of all the documents
    4. Cluster by applying the HAC method in order to produce a binary dendogram
    5. Collapse the dendogram into a non-binary cluster hierarchy by applying the slicing technique
    6. Enrich the cluster hierarchy with conceptual information

    4. A Typical Scenario

    User w/ small bookmarks collection, gets interested in Java and adds related sites

    Typical Scenario: manual stage

    User decides to get organized and creates two high-level folders

    Typical Scenario: automatic stage

    User decides to invoke BO, freezes the two high-level folders to get Java-related URLs organized

    Implementation Notes

    5. Conclusion