2021 | Information Systems Frontiers | Citations: 0
Authors: Truică, Ciprian-Octavian; Apostol, Elena-Simona; Darmont, Jérôme; Assent, Ira
Abstract: Extracting top-k keywords and documents using weighting schemes are popular tech ...
Expand
Abstract: Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, calculation errors are introduced when analyzing only subsets of the dataset, i.e., wrong weighting are computed as weighting schemes use the number of documents for scoring keywords and documents. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of computing top-k keywords and documents (which largely relies on weighting schemes), it is customary to design benchmarks that compare weighting schemes within various configurations of distributedframeworks and database management systems. Thus, we propose TextBenDS - a generic document-oriented benchmark for storing textual data and constructing weighting schemes. Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top-k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB shows the best overall performance, while Spark’s execution time remains almost constant regardless of weighting schemes.
Collapse
Semantic filters:
HiveQL
Topics:
information retrieval MongoDB database system distributed system MapReduce
Methods:
term frequency–inverse document frequency Okapi BM25 experiment design science qualitative content analysis
Can Social Media Support Public Health? Demonstrating Disease Surveillance using Big Data Analytics
2015 | Americas Conference on Information Systems | Citations: 0
Abstract: Rapid growth of the Internet has paved the way for millions of people across the ...
Expand
Abstract: Rapid growth of the Internet has paved the way for millions of people across the globe to access social media platforms such as Facebook and Twitter. These social media platforms enable people to share information instantaneously. The large volume of information shared on these platforms can be leveraged to identify outbreaks of various epidemics. This will help health professionals to provide timely intervention, which in return could help save lives and millions of dollars. Analysis of information shared on social media is complicated due to its sheer volume, varied formats and velocity of collection. We have addressed this potential problem by making use of a big data analytics platform capable of handling large quantities of streaming data. In this paper we demonstrate how data from social media can be effectively used in the surveillance of disease conditions.
Collapse
Semantic filters:
HiveQL
Topics:
Twitter social media Hadoop Distributed File System Apache Hadoop big data
Methods:
case study cluster analysis natural language processing part of speech tagging computational algorithm