WEB DYNAMICS

The large variety of information available on the web are highly dynamic. Web pages are created, modified and eventually deleted by their owners. A major challenge is then the ability to locate the most relevant and up-to-date information in a highly evolving Web. Search engines and news alerting/publishing services have to cope with this challenge.

The goal of this project is to analyze the dynamic behavior of the Web contents to assess how often and to what extent the contents of a site change. We monitored the MSNBC news web site for 19 weeks, starting mid November 2004. We collected snapshots of the site every 15 minutes by downloading the pages belonging to five major news categories, that is, business, entertainment, health, news and sports. To model the evolution of the site and in particular its growth, we applied numerical fitting techniques to the daily changes of the site, that is, creations of new Web pages and the updates of existing pages. We also investigated the evolution of the site in terms of content changes. In this framework we applied techniques typical of the Information Retrieval domain, such as the cosine distance.

Publications