In the following sections, we will demonstrate the application
of the proposed BI process to a major ISP company
in Taiwan. This company was originally the nation’s sole
enterprise in telecommunication and was only recently
privatized. Its management is struggling to compete with
newly established companies who are eroding its market
share and is very keen to develop service strategy through
the understanding of the ‘‘needs’’ of users.
4.1. Knowledge identification
The management of this company is very much interested
in developing a service management strategy, which
can boost business revenue through providing value-added
products to its customers. They believe very strongly that
personalized service is the way to grow business revenue
further, because it will foster long-term loyalty of customers,
when will then lead to increased sales of value-added
products. They further identified the knowledge they will
need, which includes: network usage patterns of individual
customer, network usage patterns of the region, revenue
contribution of customers, and network facilities utilization.
The network usage patterns of individual customer
should reveal network usage over 7 days of a week and
24 h of a day, along with the usage intensity. This usage
pattern will allow management to develop the knowledge
of VIP status of users and initiate meaningful business dialogue
with individuals. The usage patterns of customers of
a region should reveal the grouping of users and behaviors
of each group, which can help management formulate marketing
strategies by targeting selected groups. In addition,
it will form the basis for understanding the potential revenue
contribution of each group. Lastly, the facility utilization
among geographical regions will lend management an
important piece of knowledge in achieving cost
effectiveness.
4.2. Data collection
ISP customers’ raw data consists of socio-demographic
data, records of call data, IP traffic log, logging authorization
data, application system, and system record. Sociodemographic
data is recorded at the time customers fill in
the application form. Records of call data contain source
and destination of IP address, TCP port number, URL
address, etc. IP traffic log contains switch-router IP
address, customer account number, and input and output
traffic log per five minutes. The logging authorization data
includes customer account number, log-in and log-out
time, facility name of logging in, and IP address. Application
system data is generated when customers make use of
WWW, e-mail, FTP service, etc. Finally, the system record
is generated by routers. With the cooperation of the company,
we selected a region of southern Taiwan that consists
of several districts for this study. The IP flows in K-bytes
were collected every 5 s over 7 weeks using MRTG (Kemper,
1997). We make use of a timer program to transform
network flow records into a SQL database. Table 2 shows
the contents of the database, which contains fields of
ADSL_phone, Log time, the average input K-bytes per five
minutes (Avg_In), the average output K-bytes per five minutes
(Avg_Out), the largest input K-bytes in the interval
(Max_In), and the largest output K-bytes in the interval
(Max_Out), respectively. The final count of data is 41.7
million.
4.3. Data preprocessing
With the big volume of raw data, we need to process
them to ensure its validity for later use. Through the
socio-demographic data provided by the administration,
we found that there are 10.3 million valid data. These data
must be normalized to avoid inconsistency during the mining
process, because different user may be with a different
scheme and hence different bandwidth. We apply the formula
defined in Eq. (4) to transform data to achieve normalization.
In the formula, we take the ratio of
customer’s IP flow to his/her scheme stipulated bandwidth,
Customer_NetUsage/Customer_ Bandwidth, and compare
it with a selected Threshold_rate, which can be set at 1%,
5%, 10%, or other rates, as shown in Table 3. The setting
of the Threshold_rate depends on the conceptual purpose
in the modeling phase. The technical personnel of the company
indicates that threshold rate at 1% will be sufficient to
indicate customer’s intention to use network facilities.
IF(Customer NetUsage=Customer Bandwidth)>=Threshold rate
THEN Threshold rate record=1
ELSE Threshold rate record=0 (4)
4.4. Modeling
With the normalized records, we construct a data warehouse
with multi-dimensionality to facilitate the analysis of
customers’ behavior. We then applied SOM network to
segment customers into different homogeneous clusters
and select the one that can best exhibit customers’ behavior
patterns. We further modify the RFM model to evaluatepotential value of customers of each cluster. The detailed
modeling processes are as follows.
4.4.1. Multi-dimensional modeling
In developing a multi-dimensional model, we adopt the
method proposed by Kimball (1996) to, first, define business
process and grain, and then, define the dimension
table that contains time, user, location, and facts. The relational
model and the dimension tables are shown in Fig. 2.
Each of the tables is described in the following:
(a) Network Traffic Fact Table contains customer network
usage attributes: ADSL_Phone, LogTime,
Max_In/Max_Out, In_P (01/05/10/30/50), and
Out_P (01/05/10/30/50). The primary key is constituent
of ADSL_Phone and LogTime, and the observation
value includes Max_In, Max_Out, and different
threshold value.
(b) District Dimension Table contains ZIP_Code, City_
Name, and District_Name; where ZIP_Code is the
primary key. The hierarchical relationship among
attributes is defined in order as City_Name
District_Name.
(c) Bandwidth Dimension Table contains attributes of
Bandwidth_ID and Bandwidth.
(d) Router Dimension Table contains both Router_ID
and Router_Name.
(e) User Dimension Table contains HN, ADSL_Phone,
Profile, ZIP_Code, Address, Bandwidth_ID, and
Router_ID. The primary key is HN and the foreign
keys are ZIP_Code, Bandwidth_ID, and
Router_ID.
(f) Time Dimension Table has attributes consisting of
LogTime, Year, Quarter, Month, Week, Day, and
Hour. The primary key is LogTime. The hierarchical
relationship among attributes is Year Quarter
Month Week Day Hour.