1. Creation of Meta data Data for experimental purpose is collected on the client side because various factors that affect personalization are not correctly reflected in the server side. For example page-view time recorded in the server logs is affected by network delay. Also cache hits are not recorded accurately in server logs [3]. In addition, it is very hard and complex to identify the specific user log information from the server side. Finally scrolling speed cannot be traced from the server side [3, 5]. After collecting data from the client side, the search data is pre-processed for extracting various factors that affect personalization. The following useful information is drawn out from the search data such as: i) User queries ii) Web pages visited iii) Scrolling Speed iv) Click through and v) Page size.
User queries which are the direct indications of user’s interest are directly extracted from the input textbox [refer figure 1]. The Web pages visited by the users are parsed. The text content thus extracted is linguistically pre-processed for stop-word removal and stemming. The root words thus identified are used for indicating that Web page in the database.
Each page visited by the user consists of set of scrolling speeds recorded by the browser [5]. The average, maximum and minimum scrolling speed on a page visited by the user is computed from the set of scrolling speeds. The click through on a page is calculated whenever there was the change in address box in the browser. Change in the content of the address box occurs whenever the user clicks a link/URL available in the page that is currently being visited. The Time/PageSize ratio is derived from the list of pages visited and time spent by the user in the page. A sample raw-data and pre-processed data is shown in Figure 2 & Figure 3 respectively.
Data collection involved 10 distinct users and their search data (524 pages) visited during their search process. A summarization of the data collected is shown in Table 1. The table also highlights average search sessions, search queries issued and Web pages visited per user.
2. Statistical analysis of pre-processed search data The pre-processed user search data are analyzed using SPSS (Statistical Package for the Social Sciences) and StatPlus mathematical tools. The collected user data contains the randomness as well as uncertainty due to drift in their search process. It is tested and measured through frequency test and various statistical measures. A hypothesis test use sample data to test a hypothesis about the population from which the sample was taken. It makes inference about one or more population when sample data are available. Hypothesis test on user data based on timespent on a Web page is considered based on statistical measures (mean, median and mode, etc) such as H0 and H1 called Null hypothesis and Alternative hypothesis respectively. Here H0, H1 are assumed and represented below H0: Data collected highlights randomness as well as uncertainty. H1: H0 is not true The results show that there is an uncertainty that exists in the users’ browsing data. In order to resolve these issues fuzzy approach is incorporated with this model to effectively perform interest label based classification on the users. The randomness and uncertainty that exist in the Timespent on a single page visited by the user in various sessions are shown in Table 2.