Wide Web contains vast amount of valuable information, but there is not enough guidance for users to find information that is appropriate to their reading ability levels. When a user raises a query, existing search engines mainly return semantically-related materials, but whether the user has sufficient ability to understand the materials is often overlooked.
To enhance user experience, we investigate the use of readability assessment, which is a method to measure the difficulty of a piece of text material [5], in Web information retrieval. We propose a bilingual (English and Chinese) Web page and site readability assessment scheme based on textual features of the pages. As English and Chinese pages cover over 70% of the total Web pages [2], our scheme will have a high impact on the Internet community.
Web Page Readability. For Web page readability, we adopt two existing assessments, Flesch [4] formula and Yang [6] formula, proposed in educational field originally as a baseline evaluation. The resulting page readability formula is as follow:
Web Site Readability. Web site readability is an indicator of overall difficulty level of a site, and it is defined over the proposed page readability. We first describe the following terms before continuing the discussion. Web site is a group of Web pages with the same domain name in the URLs. Root page of a Web site, is a user-specified page where the crawling of the site starts. Page level of a Web page within a site is the minimum number of traversal reaching it starting from the root page of the Web site through hyperlinks.
We propose three site readability assessments in terms of pages at different page levels: (1) Exact-Level, (2) In-Level, and (3) Out-Level (depicted in Figure 1), aiming at describing the site difficulty from different angles of pages composition.
Exact-Level Readability indicates the average readability of pages at a particular level. By using this metric, Web authors can decide how the readability should change with levels. Take Online Teaching Site as an example. Teachers may want to teach some simpler things at the beginning, and then increase the difficulty level gradually. By analyzing the changes of Exact-Level readability along levels, teachers can then prepare and arrange materials in the proper order.
In-Level Readability of a site gives the average readability of Web pages starting from root page up to pages at specified level. This is an overall indicator of a site difficulty. By using this metric, users can get a general idea of whether the site is suitable to them before start browsing it.
Out-Level Readability of a site gives the average readability of Web pages starting one level upper than the specified one, up to pages with maximum available level, which is determined by the depth of crawling. In another words, it is a difficulty indicator of remaining pages after browsing a site for some times. Users may make use of this metric as one of factors in deciding whether he or she should continue browsing the site.
Experiment 1: Web Page Readability. Figure 2(a) shows the Web page readability distribution of the two Web sites being tested. The first observation is that both CSE and HKGOV have large portion of pages with scores ranged from 0 to 5. To investigate this phenomenon, we manually examine those pages and find that they mainly belongs index pages, which are introductory pages having hyperlinks to internal pages. Index pages generally receive low readability score than passage pages. It is because only index terms will remain after extracting raw texts from those pages, and a long sentence will form because there are no separators such as full stop to delimit the index terms. As a result, a long sentence will reduce the readability scores of index pages, although they should not be that difficult apparently.
The second observation is that CSE has larger portion of high-scored (less difficult) pages than HKGOV. It is because CSE contains personal pages of staffs and students, in which the contents are easier to be comprehended than formal articles published in HKGOV.
Conclusion: In addition to measuring difficulty of pages, page readability can also be a good indicator for figuring out low textual content pages such as index pages.
Experiment 2: Web Site Readability. Figure 2(b) and (c) show the site readability against levels of the two Web sites. The two graphs exhibit two types of readability variation: CSE has a dramatic change on readability, while HKGOV has a gradual change on it.
For CSE, we find that Exact-Level score gradually increases from level 2 to 5, and has a drop in level 6. It is because the pages located in level 4 and 5 are personal pages. Following our explanation in last section that they are easier to be comprehended, the site readability increases. For level 6, we find that as this level is right after personal pages, authors would like to put more non-textual information such as images, videos etc., causing the drop in score. In-Level score generally follows the trend of Exact-Level, but with a smoother variation due to accumulation effect of scores starting from level 0. For Out-Level score, although level 6 has relatively low Exact-Level and In-Level scores, it has high Out-Level score, which indicates the contents at later levels will be relatively easier. This may motivate users to continue browsing the site.
For HKGOV, it has a gradual change in Exact-Level score, indicating that pages at different levels have similar difficulty. It is because as a government organization, its Web site is well-organized to give pages with similar difficulties. Due to the gradual change of Exact-Level score, In-Level and Out-Level scores behave nearly the same.
Conclusion: By analyzing the Web site readability variation graph, we can get the content structure and distribution of a site. Web authors can then make use of this information to make the site organization better.
[1] Department of C.S.E., CUHK. http://www.cse.cuhk.edu.hk .
[2] Global Internet statistics: sources and references. http://global-reach.biz/globstats/refs.php3 .
[3] The Hong Kong Government. http://www.gov.hk.
[4] R. F. Flesch. A new readability yardstick. Journal of Applied Psychology, 32:221-233, 1948.
[5] T. S. Hansell. Readability, syntactic transformations, and generative semantics. Journal of Reading, 19(7):557-562, 1976.
[6] S. J. Yang. A readability for Chinese language. Ph.D. Thesis for Mass Communication, University of Wisconsin, 1971.