1. Introduction
Weblogs (or
blogs as they are more commonly known) constitute
a fascinating artifact within the evolving web. Early hand-edited collections
of blogs consisted of any page containing sequences of dated entries.
Nowadays, most people think of blogs as pages with reverse chronological
sequences of dated entries, usually containing a persistent sidebar containing
profile information (and often other blogs read by the author) and usually
maintained and published by one of the common variants of public-domain
blog software. They tend to be quirky, highly personal, often consumed
by regular repeat visitors and highly interwoven into a network of small
but active micro-communities. In short, blogs are perhaps the most significant
recent movement in end-user content creation on the web. We refer to the
collection of blogs with their links as
Blogspace.
There are at least two important reasons for the systematic study
of Blogspace:
- Sociological reasons: Blogspace differs
from traditional web pages structurally because blogs represent concatenations
of messages, as within newsgroups and bulletin boards, but authored by
a single individual. However the more significant differences are more than
structural: the culture of Blogspace focuses heavily on local community
interactions between a small number (say, between 3 and 20) of bloggers.
Members of the informal community might list one another's blogs in a ``blogroll''
and might read, link to, and respond to content on other community members'
blogs. Often, these sequences of responses take place during a brief burst
of heavy activity as an interesting topic arises, jumps to prominence, and
then recedes. Naturally, this leads to the question: can we experimentally
observe and model this highly dynamic, temporal community structure?
- Technical reasons: Traditional studies of
the web and the web graph make use of a static snapshot derived from a crawl.
All such work raises the natural question: what happens over time? A number
of works have begun to address this question through creation and analysis
of a series of snapshots of the data [ 4 , 15 , 8 , 3 ]. The development of tools
and methods to analyze these snapshots is therefore a timely endeavor. However,
Blogspace offers an additional technical advantage over such approaches--if
data is recrawled with a certain frequency, there is no notion of the
precise point in time a page or link was created/updated. In contrast,
Blogspace offers us a ready-made view of evolution in continuous
time: as each blog adds an entry (together with links),
there is a time stamp associated with that event. By automatically extracting
these time stamps we can piece together a view of Blogspace evolving continuously
from the beginning of blog archiving to the present. We should stress that
time is absolute (not merely relative as in a sequence of crawls). Our work
focuses on connectivity evolution and on temporally concentrated bursts
(in the sense of Kleinberg [ 11 ]) in this evolution of Blogspace.
- We introduce a combinatorial object we call a
time graph (Section
3.1 ) for
the study of graphs that evolve in continuous time. We build the
blog graph--the time graph for Blogspace--by automatically
extracting dates from blog page entries ( Section 4 ).
- We define a notion of communities in Blogspace and extend Kleinberg's
notion of temporal bursts in a sequence of documents [ 11 ] to sets of blogs and the
links between them, developing a notion of bursty communities of blogs
that are topically and temporally focused
( Section 3 ).
- We conduct a series of experiments that develop properties and
views of the graph induced by Blogspace as a function of time, showing
the development of macroscopic and microscopic community structure, and
the evolution of burstiness (Section
5 ).
- We show that Blogspace underwent a transition behavior around
the end of 2001, and has been rapidly expanding over the past year, not
just in metrics of scale, but also in metrics of community structure and connectedness.
This expansion shows no sign of abating, although measures of connectedness
must plateau within two years ( Section
5 ).
2. Background
In this section we first provide some background material on graphs,
communities, and burst analysis of events (
Section 2.1 ). Next we review the
world of blogs and argue that blog communities and web communities are
different (
Section 2.2 ).
2.1. Preliminaries
2.1.1. Graphs
A
directed graphG = (
V ,
E) consists of a set
V of
nodes and a
set
E of
edges, where each edge is an ordered pair of
nodes. The
in-degree of a node
u is the number of nodes
v such that
; the
out-degree of
u is the number of nodes
v
such that
. There is a
directed path from
u to
v in
G if there is a sequence of nodes
such that
for
. A
strongly connected component (SCC) of
G is a
subset of nodes such that for any ordered pair of nodes in the subset,
there is a directed path from the former to the latter. An
undirected
graph G = (
V,
E)
consists of a set
V of nodes and a set
E of edges, where
each edge is an unordered pair of nodes.
The notion of communities in the web graph (called
web communities
) was defined in [
13 ] and the problem
of extracting web communities was studied in [
13 ,
12 ,
9 ]. Kumar et al. [
13 ] detected communities by enumerating all
bipartite cliques (up to a certain size) in the web graph. This approach
was motivated by the co-citation phenomenon rampant on the web [
10 ]. Their hypothesis was that any topically
focused community on the web is likely to contain a dense bipartite subgraph
(the
signature ) and almost every occurrence of the signature
corresponds to a web community. Flake et al. [
9 ] adopted a more sophisticated definition of
a web community based on network flow.
Section
3.2 describes our approach to the community extraction problem
for Blogspace.
2.1.3. Bursts
We provide a brief review of Kleinberg's recent work on identifying
bursts in a stream of events [
11 ].
An event might correspond (for instance) to the appearance of an email
containing particular keywords (such as ``NSF grant''--a running example
in [
11 ]). The crucial step is to model
such bursts so that they can be identified efficiently. Care must be
taken to avoid identifying a large number of short spurious bursts or
fragmenting long bursts into many smaller bursts. Kleinberg's approach
is to model the generation of events by an automaton that is in one of
two states, ``low'' and ``high.'' The time gaps between consecutive events
are distributed independently according to an exponential distribution
whose parameter depends on the current state. Thus the high state is hypothesized
as generating bursts of events. There is a cost associated with any state
transition to discourage short bursts. Given an event stream we seek to
find a low cost state sequence that is likely to generate that stream.
Finding an optimal solution to this problem can be accomplished by dynamic
programming. One final extension is required. Consider the case in which
each event in a sequence is either
relevant or
irrelevant
. Kleinberg extends his basic two-state model to this case as well. This
augmented model generates events with a particular mix of relevant and
irrelevant events according to a binomial distribution. A sequence of events
is considered bursty if the fraction of relevant events alternates between
periods in which it is large and long periods in which is small. Kleinberg
defines a measure of weight associated with each such burst and solves the
problem of enumerating all the bursts by order of weight.
2.2. Blogs
According to
slashdot
, blogs are ``... a new, personal, and determinedly non-hostile evolution
of the electric community. They are also the freshest example of how
people use the Net to make their own, radically different new media.''
Historically blogs date back to 1996, but they exploded into popularity
during 1999 with the emergence of
blogger
and other easy-to-use publishing tools. During 2000, they caught the
public eye and articles began to appear in e-zines and forward-looking
publications. Most recently in 2002, a
Newsweek article
appeared estimating the number of weblogs to be half a million, and
discussing the emerging culture of blogspace:
While weblogs had always included a mix of links, commentary,
and personal notes, in the post-Blogger explosion increasing numbers of
weblogs eschewed this focus on the web-at-large in favor of a sort of short-form
journal. These blogs, often updated several times a day, were instead a
record of the blogger's thoughts: something noticed on the way to work, notes
about the weekend, a quick reflection on some subject or another. Links
took the reader to the site of another blogger with whom the first was
having a public conversation or had met the previous evening, or to the
site of a band he had seen the night before. Full-blown conversations were
carried on between three or five blogs, each referencing the other in their
agreement or rebuttal of the other's positions.
It is precisely this type of intense interaction that is at the
core of Blogspace and that we wish to analyze algorithmically.
For example, there is a ``
blogathon '' once a year in which
people blog for 24 hours straight for charity. Sponsors donate money and
then during the blogathon, bloggers update their blogs every 30 minutes
for an entire day.
2.2.1. Bursty communities of blogs
At first blush, blog communities appear similar to web communities
studied in earlier work [
13 ,
9 ]. But there is a distinctly different flavor
to blog communities, both qualitatively and (as we develop in subsequent
sections) quantitatively: these communities exhibit striking temporal characteristics.
Within a community of interacting bloggers, a given topic may become
the subject of intense debate for a period of time, then fade away. These
bursts of activity are typified by heightened hyperlinking amongst the
blogs involved--
within a time interval. Thus it no longer suffices
(as in [
13 ,
9
]) to extract subgraphs that are signatures of communities; rather, we must
extract such signatures while simultaneously identifying a time interval
within which this hyperlinking is concentrated. Note that a subgraph indicative
of a community of interest (in the traditional sense) may exist amongst
a set of blogs, without ever achieving this temporal focus. Conversely,
heavy linkage within a short period may appear less significant when viewed
over a long time span--suggesting that the criterion for inferring that
a pattern of links is a community be less stringent than for a static graph.
Identifying such temporal bursts is inspired by Kleinberg's recent work
that was outlined in
Section 2.1.3 .
In
Section 3 we extend Kleinberg's
work to envelop
sets of blogs inducing a bursty community through
temporal bursts of hyperlinking. While deferring this formal development
and experimentation to
Section 3 , we now
give two examples of the bursty phenomena unearthed by our algorithms.
In a community of blog poets, a burst occurs when one member
Firda posts a series
of daily poems about other bloggers in the community and includes links
to their blogs. This burst occurs from March-April of 2000--for example
http://www.wannabegirl.org/2000_03_01_log-archive.php
from March 2000 contains poems and links to http://trenchant.org/webloglog
, http://www.premiumpolar.com/blog
, and http://www.swallowingtacks.com
, all members of the community.
In another community (this community may not be suitable
for all ages), a blogger Dawn
hosts a poll to determine the funniest and sexiest blogger. She conducts
interviews with other bloggers in the community, of course listing their
sites (see
http://up_yours.blogspot.com/2002_05_19_up_yours_archive.html ). She
then becomes obsessed with one of the other bloggers Jim, which spurs comments
by many others in the community (see
http://jimspot.blogspot.com/2002_07_28_jimspot_archive.html ).
3. Approach
In this section we first define the notion of time graphs which
will be be the basis for studying Blogspace. Time graphs can also be used
to study different evolving graphs such as the web, e-mail graphs, call
graphs, newsgroup graphs and so on. We anticipate that they will prove
to be of considerable independent interest for many other mathematical
and algorithmic studies. Next we focus on tracking bursty communities on
the time graph induced by blogs, henceforth the
blog graph. We
accomplish this by adopting a two-step approach:
- Community extraction ( Section 3.2 ): We extract
dense subgraphs from the blog graph; these correspond to all potential
communities (whether or not bursty).
- Burst analysis ( Section 3.3 ): Building
on the work of Kleinberg [ 11
] on bursts in event streams, we perform a burst analysis
of each subgraph obtained in step 1 to identify and rank bursts in these
communities.
The reason for this two-step approach is that the problem we wish
to solve is somewhat harder than that addressed by Kleinberg. Whereas
the elementary events he considers have simple, local characterizations
(e.g., does an email contain a given keyword?), our setting does not afford
such locality. A bursty community is not characterizable in terms of a
single blog or edge in the time graph. Rather, it entails an analysis
of the entire blog graph. Ideally, we must simultaneously identify subsets
of blogs as communities together with bursts in the events relevant to
this subset. We instead break this down into our two-step sequence; avoiding
this two-step process remains a challenging open problem.
3.1. Time graphs
We now introduce what appears to be a novel combinatorial object:
the time graph. A
time graphG = (
V,
E) consists
of
- A set V of nodes where each
node
has an associated interval D
(v) on the time axis (called the
duration
of v ).
- A set E of edges where each
is a triple (u,
v,t) where
u and v are nodes in
V and t is a point in time
in the interval
.
A node
v is said to be
alive at time
t if
. The interpretation is that each edge is created at a point in time
at which its two end-points are alive. The definition naturally allows
for directed time graphs. Note that in contrast to the well-established
algorithmic study of dynamic graphs (see, for instance, [
5 ]), edge events have real-valued time stamps.
Let
G = (
V,
E) be a time
graph. The
prefix of
G at time
t is also a time graph
G
t = (
Vt,
Et
) where
. Likewise,
.
3.2. Algorithms for community extraction
In the context of time graphs and blogs in particular, we adopt
a more relaxed definition of communities than in [
13 ]. There are at least two motivations
for doing this:
- Compared to the web, blogs are not characterized by the
strong distinction between ``authority-type'' and ``hub-type'' pages [ 10 ]. Every node in the
blog graph corresponds to a `human being'; this is in contrast with the
web where pages can be loosely classified as `people' (the hubs) and `topics'
(the authorities).
- In contrast with the web-scale experiments of [ 13 , 12 ], the scale of our work
here permits us to operate entirely in memory without streaming the data
from disk. As a result, it is feasible for us to seek dense (rather than
complete) subgraphs as community signatures.
We therefore consider the undirected version of the blog graph and
say that a dense subgraph is a signature of a blog community. We will make
the notion of a dense subgraph more precise later (see
Figure 1 for a simple example).
Figure 1: A typical signature of a blog community.
Unfortunately, finding the densest subgraph in an undirected graph
is NP-hard and appears notoriously difficult to even solve approximately [
7 ]. We therefore resort to heuristics that are
simple, efficient, and effective. The blog graphs we deal with are small
enough that we can perform all the operations in memory, in contrast to
the earlier work of Kumar et al. [
13
].
Preprocessing. First, following [
13 ], we adopt the notion that pages linked-to
by an enormous number of other pages are too well-known for the type
of communities we seek to discover; so, we summarily remove all pages
that contain more than a certain number of in-links. Next, we remove
templates from the graph--this has implications for the extraction of
communities and also for the burst analysis described below. The details
of template identification and removal specific to blog graphs are presented
in
Section 5 .
Our algorithm consists of two steps--pruning and expansion. Pruning
corresponds to identifying the seed of a community and expansion corresponds
to growing the seed to a dense subgraph that forms the signature of a
community. We adopt the convention that that a node can participate in
at most one community.
Pruning. We adopt the following algorithm for pruning, based
roughly on the original work of Agrawal et al. [
1 ] and the approach of Kumar et al. [
13 ]. The graph is first scanned for all vertices
of degree at most two. Vertices of degree zero and one are removed, and
vertices of degree two are checked to determine whether they participate
in a
K3 ; that is, whether their two neighbors are connected.
If so, they are passed through as a seed to the expansion step (described
below) and the resulting community is output and removed from the graph,
if it passes a certain threshold. After the entire graph has been processed
in this manner, certain vertices that previously had degree three or more
will now have degree two or less; hence, the pruning step is repeated
several times (specifically, three times in our case). Following the pruning
passes the graph is processed greedily as follows. An arbitrary edge of
the graph is extracted and then grown into a community according to the
expansion algorithm given below. If the resulting community passes a size
threshold, then it is output. In either case, it is removed from the graph
and the process is repeated until there are no remaining edges. Deletion
of vertices is performed by appending the vertex to a ``delete list,'' and
checking this list whenever edges are extracted from the edge data structure.
Once the delete list becomes sufficiently large, it is ``garbage collected''
back into the graph.
Expansion. The aim of the expansion step is to grow the seed
into a set of nodes that constitute a potential community. First, it determines
the vertex that contains most links to the current community. If that
vertex contains at least
tk such links where
t k is a threshold depending only on the
size
k of the community grown so far, then it is added to the
current community and the process repeats.
3.3. Burst analysis
In our context, the goal is to identify communities that are bursty
in the blog graph. There is a natural interpretation of arrivals of edges
in the blog graph as an event stream. Recall from
Section 3.2 that a community
corresponds to an undirected dense subgraph. Given a specific community
C = (
V C
,
EC), the relevant events correspond to the
arrivals of the edges in
EC. Then, applying
Kleinberg's algorithm [
11 ] (see
Section 2.1.3 ), we can obtain the weight
of every burst of
C. We apply this algorithm for each extracted
community in the graph.
4. Methodology
We collected the data for the blog graph from the following seven
popular blog sites:
Blogger ,
Memepool ,
Globe of blogs ,
Metafilter ,
blogs in Salon ,
Blogtree , and
Web_Logs subtree of Yahoo ! Some of the above sites list members
explicitly in a directory while others categorize members by articles or
topics. We crawled these sites to obtain a list consisting of 24,109 urls
corresponding to blog member homepages. For each of the blog members, we
crawled both their homepages and their archives; while the homepages represent
the latest entries in the blog, the archives contain historical entries.
Thus, for each blog, we are able to extract the entire detailed history
of every link ever added to the blog, with the exact time at which it was
added. We used a very simple heuristic to identify if an out-link from the
blog member's homepage was indeed an archive link: the url must contain
the prefix `archiv' or some indication of date and must have an indication
of the blog member (name, id number, etc.). The out-links of a blog member
are identified to be the (multiset) union of the out-links from the homepage
of the blog member and each of the archive page. The
blog graph
is now defined to contain nodes that correspond to blog members and a link
from node
p to node
q if blog
member
p created a link to blog member
q at some point in time. The resulting graph consisted of 22,299
nodes, 70,472 unique edges, and 777,653 edges counting multiplicity. Observe
that the average edge multiplicity of Blogspace is 11, a reflection of the
highly interactive nature of linking. Note that we have not associated time
information with each edge as yet; this is done in the next step.
The subject of extracting specified entities from documents is a
subject of on-going research (see, for instance, [
14 ]). In our case, the entities correspond
to valid dates. Because we were focused on date extraction from blogs in
particular, we adopted a blog-specific scheme. Most blogs are published
using a blog publishing software package (say, blogger that is available
from
http://www.blogger.com ), and
therefore specify dates in a uniform format. However, there will also be
additional dates that occur textually within the blog, and we must be
careful not to be misled into believing that these dates represent a new
journal entry. We created a broad set of date patterns based on various
numeric and alphanumeric date formats, including only partially-specified
dates (i.e., ``September 1'', which does not contain a year). We applied
these patterns to the text of each blog page, and for each spotted occurrence
we noted the family of pattern that matched, and the surrounding context
of the reference. We then processed the entire sequence of extracted dates
to determine whether a particular pattern family and/or context was repeated
frequently enough. If so, we adopted the dominant scheme as the dates inserted
by the blog publication software. We introduced some special logic to match
contexts for well-known blog publishing tools such as blogger. Finally,
we back-filled missing year information into partially-specified dates
to complete them. We did not implement the additional heuristic of checking
that all dates believed to be journal entries form a decreasing or possibly
increasing sequence, but this heuristic might have increased our confidence.
Using the existing algorithm, we were able to associate dates with about 90%
of the links extracted from post-template-removal blog pages. The remaining
10% of edges occur due to various template links and ``blogrolls''--a list
of links to fellow bloggers' sites. We assigned a time of 0 for these links.
We consider a blog to be alive since the earliest non-zero time tag of its
out-links. Bucketing the time in terms of the number of months since January
1999 lets us construct a sequence of 47 prefixes of the blog graph. We
refer to these 47 graphs as
prefix graphs.
We now discuss the specific settings of various parameters while
running our algorithms on the blog graph.
Detecting templates in web pages is in itself an active area of
research (see, for instance, [
2 ]). It
is particularly acute in the case of blogs for the following reason.
Bloggers often modify a profile that appears as a template on all archive
pages corresponding to the blog, and may include a series of links (Recently,
the
Rich Site Summary (RSS) XML format has become a popular means
to announce ``What's new'' in the blogging community.) We wish to avoid
our algorithm being misled into thinking that the archive corresponding
to a particular date includes those links, when in fact the archive may
be several years old and the template may have been created only yesterday.
We adopt the following simple heuristic for removing templates: remove
any sequence of three or more links occurring in a blog for three or more
days. To be conservative, we also removed links with time 0. As in [
13 ], we also removed any node with in-degree
more than 1000 from consideration in the community identification. In
the expansion step of the community extraction, the thresholds
tk
were determined heuristically as follows: edges must grow to triangles;
communities of size up to six will only grow vertices that link to all
but one vertex; communities of size up to nine will only grow vertices that
link to all but two vertices; communities up to size 20 will grow only
vertices that link to 70% of the community; and larger communities will
grow only vertices that link to at least 60% of the community. (It is possible
that many vertices have at least
t k links
to the current community. The algorithm as specified will add only the
best such vertex, but for much larger datasets, it may be necessary to
expand the current community by more than one vertex at a time. In such cases,
we recommend no more than doubling the size of the current community at
each step, to avoid adding large numbers of disjoint pages linked to a
small central core.)
In burst analysis, we identified the links in each community as
relevant events (as in
Section 2.1.3
) and divided them into chronological batches according to the week each
link was added. For each community identified in the previous step, we
calculated the number of links created between blog members in the community
during each week and the total number of links between pages in the community,
to use as input to a two-state automaton. Each state of our automaton
corresponds to a different fraction of relevant link generation: a lower
rate during calm periods and a higher rate during bursty periods. By adjusting
the scaling parameter which determines how much the high rate differs
from the low rate, we were able to control the length of typical bursts.
We experimented with various values for this parameter, and chose a value
which resulted in the majority of bursts ranging from one week to several
months.
5. Results
We begin in
Section 5.1
with an analysis of the evolution of structural properties of the
time graph, as shown by analysis of the sequence of prefix graphs (as
defined in
Section 4 ). This analysis
shows surprising behavior: around the end of 2001, the macroscopic structure
(as measured by the formation of a giant component) and the microscopic
structure (as measured by the formation of a large number of small communities)
of the graph began to change dramatically. In
Section 5.2 we argue that the
change cannot be explained simply through the size, density, and link
arrival pattern of the graph, but in fact arises only because of the particular
linking decisions made by bloggers. In light of this observation Section
Section 5.3 then presents
our analysis of bursty behavior within the blog communities we have extracted,
and shows that this burstiness is a fundamental property of link creation
in blogspace.
5.1. Analysis of prefix graphs
Degree distributions. We first study the degree distributions
of prefixes of the blog graph. The results are shown in
Figure 2 . Each line in the figure represents
a prefix graph of the full time graph corresponding to a snapshot of
the graph at a particular point in time. Higher lines correspond to more
recent snapshots. The upper graph gives the in-degree distribution; that
is, the number of pages in the time graph with a given in-degree. The lower
graph gives the out-degree distribution. As the figure illustrates, the
distributions remains fairly stable over time, increasing uniformly in
y value due to the growth in size of the graph, but retaining the
same overall shape. The later curves also become smoother, and it is possible
to note that the tail of the curves (to the right of the graph, corresponding
to nodes of higher degree) tracks fairly well to the power law curve with
exponent -2.1.
Figure 2: Evolving degree distributions.
Connectivity. We also study the evolution of the strongly
connected component (SCC) in the prefix graphs. The results are shown
in
Figure 3 . For each of the three
largest strongly connected components, the figure shows what
fraction
of the nodes in the prefix graph are part of that SCC at each point in
time. The results here are quite dramatic. In January of 1999, at the beginning
of our study, the number of blog pages is significant but there is no
strongly connected component of more than a few nodes. Around the beginning
of 2000, a few components representing 1-2% of the nodes in the graph
appear, and maintain roughly the same relative size for the next year.
But up to this point, blogspace is not a coherent entity--the overall
size has grown but the inter-connectedness is not significant. At the start
of 2001, the largest component begins to grow in size relative to the
rest of the graph, and by the end of 2001 it contains about 3% of all
nodes. In 2002, however, a threshold behavior arises, and the size of the
component increases dramatically, to over 20% by the present day. This
giant component still appears to be expanding rapidly, doubling in size approximately
every three months. Clearly this growth cannot continue and must plateau
within two years.
Figure 3: SCC evolution.
Communities. We now turn to an analysis of how many pages
take part in a ``community,'' according to the definition implied by the
community extraction algorithm of
Section
3.2 .
Table 1 shows the number
of communities of each size discovered by the algorithm on the underlying
graph.
Table 1: Distribution of community sizes in the blog
graph
Size |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
# |
143 |
165 |
79 |
14 |
2 |
1 |
5 |
Figure 4 shows the results of
applying the same algorithm to the prefix graphs. The upper figure plots
for each time interval the total number of communities in the prefix
graph, and the total number of nodes that participate in one of those
communities. The lower figure plots for each time interval the fraction
of nodes that belong to some community. If this fraction is large, one can
view the space of all blogs at that time as consisting as a set of small
inter-networking communities, rather than a set of standalone pages. These
graph show the same striking pattern as earlier graphs in this section: a
period of minimal structure during 1999 and 2000, slow growth in 2001, and
then rapid growth in 2002.
Figure 4: Community evolution.
To conclude, the degree distributions match earlier observations
from many communities, and do not represent a surprise. The analysis of
the largest SCC (a macroscopic phenomenon) and of communities (a microscopic
phenomenon) does represent a surprise: by both measures, a fundamental
change occurred in blogspace approximately one year ago, and we are still
experiencing the results of that transition. To assess whether this observed
behavior does in fact stem from the sociology of blogspace, we must first
study the alternate possibility: namely, that the emergence of a giant
component and a strong community structure would occur naturally when the
graph reached a certain size and density. We now address this question.
5.2. How random is the blog graph?
We wish to determine whether the prefix graph behavior we have seen
is caused by a phenomenon similar to that of a time-evolving version of
Erdös-Rényi random graphs [
6 ],
but tailored to produce graphs that match Blogspace in the source and
arrival time of every edge. To study this, we create a graph derived from
Blogspace called
randomized Blogspace. This graph is identical
to the blog time graph, except that the destination of every edge is replaced
by a uniformly chosen random node. Thus, the arrival time of each edge,
the number of edges at each time, and the exact profile of when a page chooses
to add a link is left untouched. The only difference is the destination
of each new link. To be precise, we sort the edges of the blog graph according
to time (ties are broken arbitrarily). We scan the edges sequentially
and change each destination to be a node uniformly chosen from among the
nodes that have already been seen. Note that this preserves the times
at which links arrives at all sources, and modifies only the destinations
of those links.
Figure 5: SCC evolution in randomized Blogspace.
Figure 5 and
Figure 6 plot the same quantities
as
Figure 3 and
Figure 4 for randomized Blogspace
instead of the original blog graph. Since we included the time 0 edges,
there is a substantial SCC to begin with. As time progresses, this SCC
seems to have a threshold growth as in blog graph case (this is how a random
graph would behave as well). For completeness, we also evaluated the growth
of the giant SCC without the time 0 edges; initially it was of course much
smaller, but it exhibited a similar threshold behavior and became a larger
fraction of the overall graph during the last timestep than in Blogspace.
However, comparing to
Figure 4 , we see
that the number of nodes in communities for randomized Blogspace is an
order of magnitude smaller than for Blogspace, indicating that the community
formation in Blogspace is not simply an emergent property of the growth
of the graph. On the other hand, comparing
Figure 3 to
Figure 5
shows that the SCC in randomized Blogspace grows much faster than
in the original blog graph.
Figure 6: Community evolution in randomized Blogspace.
So randomized Blogspace actually attains a large strongly connected
component faster than Blogspace does; however, it does not attain significant
community structure. If bloggers added links to other blogs without reference
to topicality, the graph would still become well-connected, but it would
not exhibit the striking community focus that characterizes Blogspace.
5.3. Burstiness in blog communities
Figure 7 shows the number
of communities that are in the high state during each time interval, as
described in
Section 2.1.3 . The
x
axis again shows time, but stops several months before our most recently
crawled pages, because we cannot effectively evaluate the number of bursts
occurring during the present without context about the near future. Again,
consistent with our earlier observations, there is a spurt of bursty activity
toward the end of 2001 that continues through to the present.
Figure 7: Burstiness of communities.
Interestingly, the increase in number of bursts is not explained
by the increase in number of communities alone. Not only have the number
of communities in Blogspace been growing over the last year, the average
burstiness of each individual community has also been growing. This suggests
that the transition behavior we have observed in all our temporal analyses
is in fact correlated with a change in the behavior of the bloggers themselves.
For whatever reason, perhaps because of the richer set of available communities,
or the change in the population of Blogspace, content creators have increased
their participation in bursty community activity over the last year,
and the trend shows no sign of slowing.
5.4. Anecdotal evidence for the effectiveness of community and burst
extraction
We explored a large number of communities and of bursts within communities
using a web-based tool we created for the purpose. The communities found
by the extraction algorithm are almost universally on topic. In all cases
we examined, the communities contained many internal links, and it was
usually clear what bound the members together: be it an interest in Flash,
the law, or library science. Periods of bursty behavior require a deeper
investigation into blog content. In some cases, a burst occurs due to
a spate of activity by one or two bloggers during the time period, as
when during August and September of 2002 blogger
Karen started linking to
her
sister's blog several times
a day. Other bursts are the result of many members of the community contributing
new links to each other. Although we are not always able to determine the
reason for the intense period of linkage, in most cases there is a clear
identifiable event or set of events. As the following closing example shows,
the amount of information in a blog burst, and the window it gives into the
lives of the bloggers, can be startling:
Alicia is part of a group
of artists in Seattle who form a blogger community. She's involved in
fringe theater, and some of the other members are in a band together (see
June 28, 2002 on
http://www.articulatebabble.org/archives/2002_06.php
). Several events surround the bursty link behavior during the four months
from June-Oct in 2002. Alicia decides to connect with old high school
friends (see June 24, 2002 on
http://www.aliciadawn.com/blog/archives/2002_06.html
). She asks two members of the community to set up blogs for them (see June
10, 2002 on
http://www.aliciadawn.com/blog/archives/2002_06.html
), which they do (see July 13, 2002 on
http://melody.asc-soft.com/ enigma/blog/archives/2002_07.html
). The event generates a mini-burst of blogging. She then convinces two
high school friends to visit Seattle on two different weekends. Lots of
blogging covers what to show them when they visit (see August 5, 2002 on
http://www.aliciadawn.com/blog/archives/2002_08.html
), waiting at the airport (see July 14, 2002 on
http://melody.asc-soft.com/ enigma/blog/archives/2002_07.html
), picking them up at the airport, their reaction to Alicia's theater performance,
and so on. A third event during this same period occurs when two members
of the community get engaged. There's discussion about the engagement,
and the beautiful kids they'll have (see June 28, 2002 on
http://www.jetlin.com/blog/archives/2002_06.html
and
http://www.aliciadawn.com/blog/archives/2002_06.html ).
In analyzing the space of weblogs, we have presented a detailed
picture of a web publishing phenomenon in the midst of explosive growth.
Around the end of 2001, Blogspace began a dramatic increase in connectedness,
and in local-scale community structure. Within those local communities,
it also began to exhibit dramatic increases in the occurrence of bursty
link creation behavior. Blogspace represents a clean, detailed, and measurable
instance of a hyperlinked corpus in evolution, and is thus an excellent
testbed for exploring evolutionary analysis, in addition to being of significant
interest in its own right. The tools we have developed are applicable
to other evolving hyperlinked corpora, including sequences of snapshots
of the web.
We thank Jon Kleinberg for providing the burst analysis code. We
thank Amit Kumar for bringing to our attention the popularity of RSS XML
among bloggers.
- 1
- R. Agrawal and R. Srikant.
Fast algorithms for mining association rules in large databases.
In Proc. 20th Intl. Conf. on Very Large Data Bases, pages
487-499, 1994.
- 2
- Z. Bar-Yossef and S. Rajagopalan.
Template detection via data mining and its applications.
In Proc. 11th Intl. World-Wide Web Conference, pages 580-591,
2002.
- 3
- K. Bharat, B. Chang, M. Henzinger, and M. Ruhl.
Who links to whom: Mining linkage between web sites.
In IEEE International Conference on Data Mining, 2001.
- 4
- B. E. Brewington and G. Cybenko.
Keeping up with the changing web.
Computer, 33(5):52-58, 2000.
- 5
- D. Eppstein, Z. Galil, and G. Italiano.
Dynamic graph algorithms.
In CRC Handbook of Algorithms and Theory of Computation, Chapter
22 . CRC Press, 1997.
- 6
- P. Erdös and A. Rényi.
On the evolution of random graphs.
Magy. Tud. Akad. Mat. Kut. Intez. Kozl.
, 5:17-61, 1960.
- 7
- U. Feige, D. Peleg, and G. Kortsarz.
The dense k-subgraph problem.
Algorithmica, 29(3):410-421, 2001.
- 8
- D. Fetterly, M. Manasse, M. Najork, and J. Wiener.
Crawling towards light: A large scale study of the evolution of
web pages.
In Proc. 1st Workshop on Algorithms for the Web, 2002.
- 9
- G. W. Flake, S. Lawrence, and C. L. Giles.
Efficient identification of web communities.
In Proc. 6th ACM SIGKDD Intl. Conf. on Knowledge Discovery and
Data Mining, pages 150-160, 2000.
- 10
- J. Kleinberg.
Authoritative sources in a hyperlinked environment.
J. ACM, 46(5):604-632,
2000.
- 11
- J. Kleinberg.
Bursty and hierarchical structure in streams.
In Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and
Data Mining, 2002.
- 12
- R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins.
Extracting large scale knowledge bases from the web.
In Proc. 27th Intl. Conf. on Very Large Data Bases, pages
639-650, 1999.
- 13
- R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins.
Trawling the web for cyber communities.
WWW8/Computer
Networks, 31(11-16):1481-1493, 1999.
- 14
- T. Mitchell.
Machine
Learning .
McGraw Hill, 1997.
- 15
- The Internet Archive .
Copyright is held by the author/owner(s).
WWW2003, May 20-24, 2003,
Budapest, Hungary.
ACM 1-58113-680-3/03/0005