4.1 Data Storage and Management
Our problem requires ad-hoc navigation of time series at different aggregation granularities, and fast access to sentiments for demographic groups. Therefore, it makes sense to organize the data storage around a time-indexed and aggregating structure named Demographics Tree (DTree), which at its nodes provides access to aggregated sentiment values via the demographics lattice. We demonstrate this structure in Figure 3 and describe below.
DTree is a hierarchically organized balanced tree, where each level in the hierarchy stores information relevant to years, month, weeks, and days. Each node in the tree corresponds to one of these intervals, and is connected to the parent and children nodes inthehierarchy, as well as to the adjacent nodes at the same level. Each DTree node stores statistical aggregations of sentiments for different topics for the specific time interval: (count, sum, sum of squares), where topic t ∈ T. These aggregations allow us to reconstruct sentiment mean, variance, volume and their derivatives, and they are also incrementally maintainable, allowing the easy update of the DTree as new data come in. In addition, DTree nodes store top-k correlations for the particular time interval and topic, in order to facilitate query answering. We provide more details on the construction of top-k correlations in Section4.3. DTree nodes maintain physical aggregations only for the top-level demographic groups for each topic (e.g., only for group(1.1)in Figure 3). Detailed aggregations for all individual groups are accessible by following a pointer to a separate structure, the sequential file storage for lattices. This pointer indicates an offset in the file that contains the demographics lattice snapshot with the aggregations for all demographic groups for the particular topic and time interval. By traversing this sequential file storage structure, we can simultaneously reconstruct the sentiment time series for all demographic groups for a particular topic and time aggregation level.
Thanks to this layout, a time index with high-level aggregates and pointers remains compact and can be kept in main memory (Figure3,left), while sentiment time series can be organized as a collection of individual files(Figure3,right). The additional benefit of this organization is that it ensures fast sequential access for time series of sentiments, compared to relational databases.
We note that the number of sentiment values monotonically decreases as we navigate down the demographics lattice and down the DTree levels, so many of the demographics leaf nodes will contain zeroes at lower time granularities. This allows storing sentiment values in a more compact way, by storing only non-zero values (e.g., using run-length encoding methods).
4.1 Data Storage and Management Our problem requires ad-hoc navigation of time series at different aggregation granularities, and fast access to sentiments for demographic groups. Therefore, it makes sense to organize the data storage around a time-indexed and aggregating structure named Demographics Tree (DTree), which at its nodes provides access to aggregated sentiment values via the demographics lattice. We demonstrate this structure in Figure 3 and describe below. DTree is a hierarchically organized balanced tree, where each level in the hierarchy stores information relevant to years, month, weeks, and days. Each node in the tree corresponds to one of these intervals, and is connected to the parent and children nodes inthehierarchy, as well as to the adjacent nodes at the same level. Each DTree node stores statistical aggregations of sentiments for different topics for the specific time interval: (count, sum, sum of squares), where topic t ∈ T. These aggregations allow us to reconstruct sentiment mean, variance, volume and their derivatives, and they are also incrementally maintainable, allowing the easy update of the DTree as new data come in. In addition, DTree nodes store top-k correlations for the particular time interval and topic, in order to facilitate query answering. We provide more details on the construction of top-k correlations in Section4.3. DTree nodes maintain physical aggregations only for the top-level demographic groups for each topic (e.g., only for group(1.1)in Figure 3). Detailed aggregations for all individual groups are accessible by following a pointer to a separate structure, the sequential file storage for lattices. This pointer indicates an offset in the file that contains the demographics lattice snapshot with the aggregations for all demographic groups for the particular topic and time interval. By traversing this sequential file storage structure, we can simultaneously reconstruct the sentiment time series for all demographic groups for a particular topic and time aggregation level. Thanks to this layout, a time index with high-level aggregates and pointers remains compact and can be kept in main memory (Figure3,left), while sentiment time series can be organized as a collection of individual files(Figure3,right). The additional benefit of this organization is that it ensures fast sequential access for time series of sentiments, compared to relational databases. We note that the number of sentiment values monotonically decreases as we navigate down the demographics lattice and down the DTree levels, so many of the demographics leaf nodes will contain zeroes at lower time granularities. This allows storing sentiment values in a more compact way, by storing only non-zero values (e.g., using run-length encoding methods).
การแปล กรุณารอสักครู่..
