TSINGHUASCIENCEANDTECHNOLOGY ISSNll1007-0214ll10/10llpp95-101 Volume 19, Number 1, February 2014
Mobile Internet Big Data Platform in China Unicom
Wenliang Huang, Zhen Chen, Wenyu Dong, Hang Li, Bin Cao, and Junwei Cao
Abstract: China Unicom, the largest WCDMA 3G operator in China, meets the requirements of the historical Mobile
Internet Explosion, or the surging of Mobile Internet Traffic from mobile terminals. According to the internal statistics
of China Unicom, mobile user traffic has increased rapidly with a Compound Annual Growth Rate (CAGR) of 135%.
Currently China Unicom monthly stores more than 2 trillion records, data volume is over 525 TB, and the highest
data volume has reached a peak of 5 PB. Since October 2009, China Unicom has been developing a home-brewed
big data storage and analysis platform based on the open source Hadoop Distributed File System (HDFS) as it has
a long-term strategy to make full use of this Big Data. All Mobile Internet Traffic is well served using this big data
platform. Currently, the writing speed has reached 1390000 records per second, and the record retrieval time in
the table that contains trillions of records is less than 100 ms. To take advantage of this opportunity to be a Big
Data Operator, China Unicom has developed new functions and has multiple innovations to solve space and time
constraint challenges presented in data processing. In this paper, we will introduce our big data platform in detail.
Based on this big data platform, China Unicom is building an industry ecosystem based on Mobile Internet Big
Data, and considers that a telecom operator centric ecosystem can be formed that is critical to reach prosperity in
the modern communications business.
Key words: big data platform; China Unicom; 3G wireless network; Hadoop Distributed File System(HDFS); mobile
Internet; network forensic; data warehouse; HBase
1 Introduction
Users of the Mobile Internet[1] can access any content,
Wenliang Huang is with China Unicom Groups, No. 21 Financial Street, Xicheng District, Beijing 100140, China.
Zhen Chen and Junwei Cao are with Research Institute of Information Technology (RIIT) and Tsinghua National Lab for Information Science and Technology (TNList), Tsinghua University, Beijing 100084, China. E-mail: fzhenchen, jcaog@ tsinghua.edu.cn.
Wenyu Dong and Bin Cao are with Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China.
Hang Li is with Department of Computer Science and Technology, PLA Univ. of Info. & Eng., Zhengzhou 450001, China.
To whom correspondence should be addressed. Manuscript received: 2014-01-09; accepted: 2014-01-10
anytime, and anywhere. This convenience produces a large volume of individual user network traffic on the telecom operator side, so is referred to as Mobile Traffic Deluge. According to Mary Meeker’s report[2] on Mobile Internet Trends, more and more PC software is migrating to Mobile Internet devices. It is also predicted that mobile traffic will double each 14 months and that the volume of Internet traffic will quadruple between 2011 and 2016, reaching 1.3 ZB per year in 2016, as indicated by Cisco VNI[3]. China Unicom, the largest 3G operator in China, is prepared to meet this “Mobile Internet Explosion”.
According to the statistics from China Unicom, who had approximately 250 million client users in 2012, mobile user traffic is increasing rapidly with a Compound Annual Growth Rate (CAGR) of 135%. Mobile Internet traffic characteristic has also
96 Tsinghua Science and Technology, February 2014, 19(1): 95-101
been investigated in Ref. [3] and a traffic prediction model based on ARMA and FARIMA has been proposed to capture the multi-fractal spectra in mobile traffic.
China Unicom’s big data platform, starting from October 2009, has recorded monthly traffic of more than 2 trillion records, monthly data volume is over 525 TB, and the maximum data volume recorded has reached a peak of 5 PB. Overall writting speed has reached 1390000 records per second, and the recorded retrieval time in the table that contains trillions of records is less than 100 ms.
2 Related Work
Network traffic recording or archiving is always applied in network forensics, network troubleshooting, and user behavior analysis. All inbound and outbound traffic from a certain vantage point can be recorded to restore the original condition at a later time if necessary.
Regarding storage limits, only network flow data or statistics are recorded, that only contains source and destination IP addresses, ports, protocols, and timestamps.
The actual flow contents are usually neglected, which would otherwise require a huge volume of repository to accommodate. In addition, there are some legal debates in Deep Packet Inspection (DPI) of flow contents concerning user privacy issues. Sometimes this information is useful for quickly identifying phishing[4], spammers, and other types of cyber-attacks.
CNSMS[5] and TIFAflow[6] are used for traffic acquisition and aggregation for forensic analysis. CNSMS is an architecture for traffic acquisition with TIFAflow and its UTM appliance for traffic aggregation used in forensic analysis in a cloud computing based security center. TIFAflow is a software-based probe that combines TIFA[7-9] with Fastbit[10] indexing to provide granular data storage. It may be operated as an independent prober or integrated into CNSMS’s UTM appliance.
Deri and Fusco[11,12] also proposed MicroCloud- based flow aggregation for fixed and mobile networks. This architecture is used to provide real- time traffic monitoring and correlation in large distributed environments. Their system is deployed in the VIVACOM (Bulgarian Telecom) mobile network and is used for monitoring the .it DNS ccTLD and a large 3G mobile network.
There are other works that use a similar platform for network data analysis, just like Lee et al.[13,14] and Qian et al.[15]
However, for any mobile network operator even only recording network flow data, the resulting data repository could easily reach the Terabyte level on a yearly basis. However, if all mobile traffic data is recorded for forensic analysis, the volume of the data could easily reach the Petabyte level. This remains a major challenge to a mobile network operator that must accommodate and index such big data for further analysis.
3 Mobile Traffic Acquisition at China Unicom
3.1 Traffic data acquisition
In China Unicom, traffic acquisition is operated at each Gn point of the GGSN in the 3G WCDMA mobile network, representing the vantage point of the mobile Internet in each province, and there are more than one hundred GGSNs used to cover all service areas. Traffic acquisition captures all the IP packets and aggregates the packets from each user properly.
The principle of the aggregation is that a user’s valid behavior data should not be lost and that efficiency is required to reduce the invalid data. Then the file is produced in less than five minutes, and the volume of every file is less than 200MB. Every file contains approximately 700000 records. The detailed deployments of traffic probers are shown in Fig. 1.
All traffic types are resolved once the traffic is captured. The captured traffic is transmitted after being packaged using a private format that is designed according to China Unicom’s uniform Internet records query and analysis system. The detailed format for a traffic record is shown in Table 1.
Some notes about important fields shown in Table 1 are also described as the following 6 rules:
(1) The bold field in the table needs to be captured, however, in the first stage, it is not stored, but other fields need to be captured and placed in storage.
(2) The value of traffic data packets without related information will be set to null.
(3) In the detailed record files, multiple CDRs are separated by a Carriage Return (CR) symbol and a newline symbol.
(4) To ensure that the information is immediately available for querying in 30 minutes, an intermediary log is generated every five minutes for all the protocols. The records of IM traffic (such as QQ, WeChat, Fetion, XMPP) are merged with the user login ID. The traffic records of RTSP, FTP, SIP, and other traffic types are merged with the control and data channel, and the merged record is identified with the control channel port. The traffic records of other multiple IPs and channel traffic are merged, and the merged record is identified with the first IP and port.
(5) Collect the WAP information and HTTP information that contains a complete URL field, including the “http://” and the host domain information, if there is no such information, the field must be filled with a null string.
(6) Traffic type coding is accomplished with 3 digits. There is a vertical bar used as a separator between each field in a traffic record. The interval of the traffic file generation is 5 minutes by default and can be modified on demand. The size of one single file is limited to less than 200 MB. In each time interval, a traffic record file is generated and writing into the record will end when the time limit is reached or the file size limit is reached. If the size of one single file exceeds 200 MB, multiple files will be produced to guarantee that the size of the single file is below the threshold, and the additional related files are identified by appending a hexadecimal number such as [nnnnn]x.
3.2 Traffic data warehouse
The files are transmitted by FTP protocol to the twenty-four FTP servers located in Beijing. Two small provinces normally share an FTP server, while a large province normally requires two FTP servers. To reduce the bandwidth of transmission, all files are compressed by the bzip2 compression algorithm before the files are uploaded to Beijing from every province.
The warehousing program also runs on the FTP servers, and reads the files transmitted using FTP protocol. After being decompressed, t
TSINGHUASCIENCEANDTECHNOLOGY ISSNll1007-0214ll10/10llpp95-101 เล่ม 19 หมายเลข 1, 2014 กุมภาพันธ์อินเทอร์เน็ตมือถือแพลตฟอร์มข้อมูลขนาดใหญ่ในไชน่าหวง Wenliang เฉินเจิน ดง Wenyu หลี่ฮัง เกาช่อง และ เกา Junweiบทคัดย่อ: ไชน่า ตัว WCDMA ที่ใหญ่ที่สุดผู้ให้บริการ 3G ในประเทศจีน ตรงตามความต้องการของมือถือประวัติศาสตร์อินเทอร์เน็ตกระจาย หรือพล่านของโมบายอินเทอร์เน็ต Traffic จากเทอร์มินัลเคลื่อนที่ ตามสถิติภายในของไชน่า traffic ผู้ใช้โทรศัพท์มือถือได้เพิ่มขึ้นอย่างรวดเร็วด้วยการผสมปีเจริญเติบโตอัตรา (เครื่องกำเนิดไฟฟ้า 135%ขณะนี้ไชน่าเก็บรายเดือนมากกว่า 2 ล้านล้านระเบียน ไดรฟ์ข้อมูล กว่า 525 TB และสูงสุดข้อมูลเสียงได้ถึงสูงสุด 5 PB ตุลาคม 2552 ไชน่าได้พัฒนาเป็นสมูทตี้ข้อมูลขนาดใหญ่จัดเก็บและวิเคราะห์แพลตฟอร์มตามเปิดแหล่งที่มาอย่างไร Hadoop กระจายแฟ้มระบบ (HDFS) มีกลยุทธ์ระยะยาวเพื่อให้เต็มใช้ข้อมูลขนาดใหญ่นี้ Traffic อินเทอร์เน็ตเคลื่อนที่ทั้งหมดมีบริการใช้ข้อมูลนี้ใหญ่ดีแพลตฟอร์ม ในปัจจุบัน ความเร็วในการเขียนแล้ว 1390000 ระเบียนต่อวินาที และเวลาเรียกระเบียนในนางสาวน้อยกว่า 100 จะใช้โอกาสนี้ให้ เป็นใหญ่เป็นตารางที่ประกอบด้วย trillions ของเรกคอร์ดข้อมูลผู้ประกอบการ ไชน่าได้พัฒนาฟังก์ชันใหม่ และมีหลายนวัตกรรมเพื่อแก้ปัญหาพื้นที่และเวลาความท้าทายของข้อจำกัดในการประมวลผลข้อมูล ในเอกสารนี้ เราจะนำแพลตฟอร์มข้อมูลขนาดใหญ่ของเราในรายละเอียดขึ้นอยู่กับแพลตฟอร์มนี้ข้อมูลขนาดใหญ่ ไชน่าเป็นอาคารระบบนิเวศอุตสาหกรรมการใช้โมบายอินเทอร์เน็ตขนาดใหญ่ข้อมูล และพิจารณาว่า การโทรคมนาคมดำเนินการเกี่ยวกับระบบนิเวศสามารถเป็นรูปแบบที่มีความสำคัญถึงความเจริญรุ่งเรืองในธุรกิจสื่อสารที่ทันสมัยคำสำคัญ: แพลตฟอร์มข้อมูลใหญ่ ไชน่า เครือข่ายไร้สาย 3g อย่างไร Hadoop กระจายแฟ้ม System(HDFS) โทรศัพท์มือถืออินเทอร์เน็ต เครือข่ายทางกฎหมาย คลังสินค้าของข้อมูล HBaseบทนำ 1ผู้ใช้อินเทอร์เน็ตเคลื่อนที่ [1] สามารถเข้าถึงเนื้อหาใด ๆ Wenliang หวงเป็น กลุ่ม Unicom จีน ถนนหมายเลข 21 เงิน เจิงริค ปักกิ่ง 100140 จีน เฉินเจินและเกา Junwei เป็นวิจัยสถาบันของเทคโนโลยีสารสนเทศ (RIIT) และห้องปฏิบัติการแห่งชาติพบข้อมูลวิทยาศาสตร์และเทคโนโลยี (TNList), พบมหาวิทยาลัย ปักกิ่ง 100084 จีน อีเมล์: fzhenchen, jcaog แอ tsinghua.edu.cn Wenyu ตงและเกาช่องอยู่แผนกคอมพิวเตอร์วิทยาศาสตร์ และเทคโนโลยี พบ ปักกิ่ง 100084 จีน หาง Li กรมวิทยาศาสตร์คอมพิวเตอร์และเทคโนโลยี มหาวิทยาลัยข้อมูลปลาได้ และสุขาภิบาล เจิ้งโจว 450001 จีน การโต้ตอบควรได้รับ ฉบับที่ได้รับ: 2014-01-09 ยอมรับ: 2014-01-10anytime, and anywhere. This convenience produces a large volume of individual user network traffic on the telecom operator side, so is referred to as Mobile Traffic Deluge. According to Mary Meeker’s report[2] on Mobile Internet Trends, more and more PC software is migrating to Mobile Internet devices. It is also predicted that mobile traffic will double each 14 months and that the volume of Internet traffic will quadruple between 2011 and 2016, reaching 1.3 ZB per year in 2016, as indicated by Cisco VNI[3]. China Unicom, the largest 3G operator in China, is prepared to meet this “Mobile Internet Explosion”. According to the statistics from China Unicom, who had approximately 250 million client users in 2012, mobile user traffic is increasing rapidly with a Compound Annual Growth Rate (CAGR) of 135%. Mobile Internet traffic characteristic has also96 Tsinghua Science and Technology, February 2014, 19(1): 95-101been investigated in Ref. [3] and a traffic prediction model based on ARMA and FARIMA has been proposed to capture the multi-fractal spectra in mobile traffic. China Unicom’s big data platform, starting from October 2009, has recorded monthly traffic of more than 2 trillion records, monthly data volume is over 525 TB, and the maximum data volume recorded has reached a peak of 5 PB. Overall writting speed has reached 1390000 records per second, and the recorded retrieval time in the table that contains trillions of records is less than 100 ms.งานที่เกี่ยวข้อง 2เสมอมีใช้เครือข่าย traffic บันทึก หรือเก็บถาวรนิติเครือข่าย เครือข่ายการแก้ไขปัญหา และวิเคราะห์พฤติกรรมของผู้ใช้ สามารถบันทึกทั้งหมดขาเข้า และขาออก traffic จากจุดชมวิวบางคืนสภาพเดิมในภายหลังถ้าจำเป็น เกี่ยวกับข้อจำกัดการจัดเก็บ เฉพาะเครือข่าย flow ข้อมูลหรือสถิติถูกบันทึก ที่ประกอบด้วยเฉพาะ แหล่ง และแอดเดรสปลายทาง พอร์ต โปรโตคอล และลงเวลา เนื้อหา flow จริงมีปกติที่ไม่มีกิจกรรม ที่อื่นจะต้องมีปริมาณขนาดใหญ่ของการเก็บข้อมูลเพื่อรองรับ นอกจากนี้ ได้ดำเนินการทางกฎหมายบางในลึกแพคเก็ตตรวจสอบ (DPI) ของ flow เนื้อหาเกี่ยวข้องกับปัญหาความเป็นส่วนตัวของผู้ใช้ บางครั้งข้อมูลนี้มีประโยชน์สำหรับการระบุฟิชชิ่ง [4], ยิ่ง และชนิดอื่น ๆ ของการโจมตีทางไซเบอร์อย่างรวดเร็ว CNSMS [5] และ TIFAflow [6] ใช้สำหรับซื้อ traffic และรวมสำหรับการวิเคราะห์ทางนิติวิทยาศาสตร์ CNSMS เป็นสถาปัตยกรรมสำหรับ traffic กับ TIFAflow และของอุปกรณ์ UTM สำหรับใช้ในการวิเคราะห์ทางนิติวิทยาศาสตร์ในก้อนเมฆคอมพิวเตอร์ศูนย์รักษาความปลอดภัยโดยรวม traffic TIFAflow เป็นโพรบที่ใช้ซอฟต์แวร์ที่รวม TIFA [7-9] กับ Fastbit [10] ทำดัชนีเพื่อให้การจัดเก็บข้อมูล granular มันอาจดำเนินการเป็น prober การอิสระ หรือรวมอยู่ในอุปกรณ์ UTM ของ CNSMS Deri and Fusco[11,12] also proposed MicroCloud- based flow aggregation for fixed and mobile networks. This architecture is used to provide real- time traffic monitoring and correlation in large distributed environments. Their system is deployed in the VIVACOM (Bulgarian Telecom) mobile network and is used for monitoring the .it DNS ccTLD and a large 3G mobile network.There are other works that use a similar platform for network data analysis, just like Lee et al.[13,14] and Qian et al.[15] However, for any mobile network operator even only recording network flow data, the resulting data repository could easily reach the Terabyte level on a yearly basis. However, if all mobile traffic data is recorded for forensic analysis, the volume of the data could easily reach the Petabyte level. This remains a major challenge to a mobile network operator that must accommodate and index such big data for further analysis.3 Mobile Traffic Acquisition at China Unicom3.1 Traffic data acquisitionIn China Unicom, traffic acquisition is operated at each Gn point of the GGSN in the 3G WCDMA mobile network, representing the vantage point of the mobile Internet in each province, and there are more than one hundred GGSNs used to cover all service areas. Traffic acquisition captures all the IP packets and aggregates the packets from each user properly. The principle of the aggregation is that a user’s valid behavior data should not be lost and that efficiency is required to reduce the invalid data. Then the file is produced in less than five minutes, and the volume of every file is less than 200MB. Every file contains approximately 700000 records. The detailed deployments of traffic probers are shown in Fig. 1. All traffic types are resolved once the traffic is captured. The captured traffic is transmitted after being packaged using a private format that is designed according to China Unicom’s uniform Internet records query and analysis system. The detailed format for a traffic record is shown in Table 1.Some notes about important fields shown in Table 1 are also described as the following 6 rules: (1) The bold field in the table needs to be captured, however, in the first stage, it is not stored, but other fields need to be captured and placed in storage. (2) The value of traffic data packets without related information will be set to null. (3) In the detailed record files, multiple CDRs are separated by a Carriage Return (CR) symbol and a newline symbol. (4) To ensure that the information is immediately available for querying in 30 minutes, an intermediary log is generated every five minutes for all the protocols. The records of IM traffic (such as QQ, WeChat, Fetion, XMPP) are merged with the user login ID. The traffic records of RTSP, FTP, SIP, and other traffic types are merged with the control and data channel, and the merged record is identified with the control channel port. The traffic records of other multiple IPs and channel traffic are merged, and the merged record is identified with the first IP and port.(5) Collect the WAP information and HTTP information that contains a complete URL field, including the “http://” and the host domain information, if there is no such information, the field must be filled with a null string.(6) Traffic type coding is accomplished with 3 digits. There is a vertical bar used as a separator between each field in a traffic record. The interval of the traffic file generation is 5 minutes by default and can be modified on demand. The size of one single file is limited to less than 200 MB. In each time interval, a traffic record file is generated and writing into the record will end when the time limit is reached or the file size limit is reached. If the size of one single file exceeds 200 MB, multiple files will be produced to guarantee that the size of the single file is below the threshold, and the additional related files are identified by appending a hexadecimal number such as [nnnnn]x.3.2 Traffic data warehouseThe files are transmitted by FTP protocol to the twenty-four FTP servers located in Beijing. Two small provinces normally share an FTP server, while a large province normally requires two FTP servers. To reduce the bandwidth of transmission, all files are compressed by the bzip2 compression algorithm before the files are uploaded to Beijing from every province. The warehousing program also runs on the FTP servers, and reads the files transmitted using FTP protocol. After being decompressed, t
การแปล กรุณารอสักครู่..
