Automatically Organizing Bookmarks per Contents

Yoelle S. Maarek
IBM Haifa Research Laboratory
MATAM, Haifa 31905, ISRAEL
yoelle@haifa.vnet.ibm.com

Israel Z. Ben-Shaul
Dept. of Electrical Engineering
Technion, Israel Institute of Technology
Haifa 32000, ISRAEL
issy@ee.technion.ac.il

Outline

Motivation
Approach
Cluster Analysis and Information Retrieval
A Typical Scenario
Conclusion

1. Motivation

What are Bookmarks?
- Personal repository of URLs, saved for re-visit
Large repositories require structuring facilities:
- Early browsers: flat list
- Advanced browsers: directory tree, drag-and-drop GUI
Manual URL organization is difficult
- too many bookmarks to classify at once
  - Legacy, Import, Search results, Procrastination
- lack of informativeness of title, URL

2. Approach

Automatic Organization by URL contents
- Based on cluster analysis and IR techniques
- (Re)generate a (sub)tree hierarchy from a list of URLs
Non-intrusive policy. Allow users to:
- Specify "frozen" sub-repositories
- Selectively apply automatic organization on sub-repositories
- Fine granularity, both in specification and in application

A Hybrid approach for bookmark organization embodied in a tool: the Bookmark Organizer (BO)

Freezing Nodes

A frozen (F) node denotes a boundary between its sub-tree and the rest of the tree:

Automatic organization from outside (ancestor) cannot penetrate inside subtree
Automatic organization from inside (descendant) cannot penetrate outside subtree

Frozen (F) vs. non-frozen (NF) nodes:

An F node can have either F or NF descendants
An NF node can have NF descendants
An NF node cannot have an F descendant
(An F node cannot have an NF ancestor)

Manually-generated nodes are by default F

Automatically-generated nodes are by default NF

The root of the tree is always F

Leaves (URLs) are always NF

BO Algorithms

Traversal algorithm:

While N the current node is not empty Do:

If N's direct descendance has been modified:

Find the nearest frozen ancestor, call it FA
Apply the Frozen Node Algorithm to FA
Let the new current node be FA's successor in preorder traversal

Else let the new current node be N's successor in preorder traversal

Frozen Node Algorithm

For each child, C of FA do:
1. If C is a leaf, Add C to a document list DL
2. If C is non-frozen, Collapse C's subtree and Add all leaves to DL
3. If C is frozen, Put it in a separate list, FDL.
Apply the Bookmark Clustering Algorithm on DL to form a new cluster hierarchy rooted at a new node NEW-FA
Apply the Frozen Node Algorithm on all elements of FDL and add them as children of NEW-FA
Replace FA by NEW-FA

3. Cluster Analysis and Information Retrieval

Purpose of document clustering:

primary use: "visualization of underlying structures of explanation purposes"
IR: efficiency and effectiveness of retrieval [Rasmussen 92]

Requirements:

more informative
more precise

Hierarchical Clustering

Typical cluster hierarchy (or dendogram)

Start with a set of singleton clusters, each contains one object (document)
Repeat the following steps iteratively until there is only one cluster
- Identify the two clusters that are the most similar
- Merge them together into a single cluster

Requires measure of similarity between documents' profiles (Salton's vector space model)

Precise and Informative Clustering

More precise:
Better measure of documents' profile similarity by using a less ambiguous indexing unit: the lexical affinity (LA)
More informative:
Associate to each cluster formed a list of key concepts and a pseudo-title derived from the LA-based profiles

single word	norm. freq	lexical affinity	norm. freq
book	0.399	book softcopy	0.346
softcopy	0.394	cd rom	0.336
ibm	0.343	collection kit	0.263
library	0.218	ibm softcopy	0.217
cd	0.161	ibm library	0.206
rom	0.154	book library	0.201
bookmanager	0.147	ibm program	0.191
pc	0.142	library softcopy	0.181
bookshelf	0.129	bookmanager ibm	0.170
collection	0.129	book bookshelf	0.165
program	0.127	kit softcopy	0.139
kit	0.127	host system	0.139
information	0.115	collection softcopy	0.134
system	0.100	library program	0.108
host	0.093	library reader	0.103
product	0.090	book print	0.103
picture	0.080	cd softcopy	0.098

Single word profile vs. LA profile for "IBM softocopy library package"

Controlling the size of the hierarchy

Slicing method

Enriching the hierarchy

Problem: Cluster hierarchies are difficult to interpret (defined in extension, rather than in intention).

We propose to add:

Qualitative information
key concepts and pseudo-title expressed as LAs

e.g.,
(java,language)
(application,development), (development, kit), (application kit)
Quantitative information
measure of intra-cluster similarity (expressed as %)

Bookmarks Clustering Algorithm

For each URL in the collection, Access the source of the corresponding HTML document
Index each document and generate an LA-based profile
Compute the pairwise similarity of all the documents
Cluster by applying the HAC method in order to produce a binary dendogram
Collapse the dendogram into a non-binary cluster hierarchy by applying the slicing technique
Enrich the cluster hierarchy with conceptual information

4. A Typical Scenario

User w/ small bookmarks collection, gets interested in Java and adds related sites

Typical Scenario: manual stage

User decides to get organized and creates two high-level folders

Typical Scenario: automatic stage

User decides to invoke BO, freezes the two high-level folders to get Java-related URLs organized

Implementation Notes

BO is an external facility
Typically invoked after several changes to the bookmark repository
Use checksum to discover new/moved/deleted URLs/folders
Keep annotations (frozen and checksum status) inside bookmark repository
Implemented in C under AIX using Netscape Navigator 2.0 Bookmarks' format

5. Conclusion

Hierarchical organization is useful, but only if documents are properly classified
Manual classification of URLs is difficult
Automatic classification based on conceptual similarity is a powerful browsing aid, but may be intrusive
BO combines manual and automatic classification:
- Allow to protect prior manual classification
- Apply automatic classification selectively:
  - Use Lexical Affinities as indexing unit
  - Generate pseudo-titles
  - Provide measure of similarity