Journal of the American Society for Information Science and Technology

Syndiquer le contenu
Wiley InterScience : Journal of the American Society for Information Science and Technology
Mis à jour : il y a 1 heure 17 minutes

SpamED: A spam E-mail detection approach based on phrase similarity

ven, 10/03/2008 - 12:17
E-mail messages are unquestionably one of the most popular communication media these days. Not only are they fast and reliable but also free in general. Unfortunately, a significant number of e-mail messages received by e-mail users on a daily basis are spam. This fact is annoying since spam messages translate into a waste of the user's time in reviewing and deleting them. In addition, spam messages consume resources such as storage, bandwidth, and computer-processing time. Many attempts have been made in the past to eradicate spam; however, none has proven highly effective. In this article, we propose a spam e-mail detection approach, called SpamED, which uses the similarity of phrases in messages to detect spam. Conducted experiments not only verify that SpamED using trigrams in e-mail messages is capable of minimizing false positives and false negatives in spam detection but it also outperforms a number of existing e-mail filtering approaches with a 96% accuracy rate.

A global map of science based on the ISI subject categories

ven, 10/03/2008 - 12:04
The decomposition of scientific literature into disciplinary and subdisciplinary structures is one of the core goals of scientometrics. How can we achieve a good decomposition? The ISI subject categories classify journals included in the Science Citation Index (SCI). The aggregated journal-journal citation matrix contained in the Journal Citation Reports can be aggregated on the basis of these categories. This leads to an asymmetrical matrix (citing versus cited) that is much more densely populated than the underlying matrix at the journal level. Exploratory factor analysis of the matrix of subject categories suggests a 14-factor solution. This solution could be interpreted as the disciplinary structure of science. The nested maps of science (corresponding to 14 factors, 172 categories, and 6,164 journals) are online at . Presumably, inaccuracies in the attribution of journals to the ISI subject categories average out so that the factor analysis reveals the main structures. The mapping of science could, therefore, be comprehensive and reliable on a large scale albeit imprecise in terms of the attribution of journals to the ISI subject categories.

New relations between similarity measures for vectors based on vector norms

ven, 10/03/2008 - 10:39
The well-known similarity measures Jaccard, Salton's cosine, Dice, and several related overlap measures for vectors are compared. While general relations are not possible to prove, we study these measures on the "trajectories" of the form , where a > 0 is a constant and ||·|| denotes the Euclidean norm of a vector. In this case, direct functional relations between these measures are proved. For Jaccard, we prove that it is a convexly increasing function of Salton's cosine measure, but always smaller than or equal to the latter, hereby explaining a curve, experimentally found by Leydesdorff. All the other measures have a linear relation with Salton's cosine, reducing even to equality, in case a = 1. Hence, for equally normed vectors (e.g., for normalized vectors) we, essentially, only have Jaccard's measure and Salton's cosine measure since all the other measures are equal to the latter.

Computational methods in authorship attribution

ven, 09/26/2008 - 06:02
Statistical authorship attribution has a long history, culminating in the use of modern machine learning classification methods. Nevertheless, most of this work suffers from the limitation of assuming a small closed set of candidate authors and essentially unlimited training text for each. Real-life authorship attribution problems, however, typically fall short of this ideal. Thus, following detailed discussion of previous work, three scenarios are considered here for which solutions to the basic attribution problem are inadequate. In the first variant, the profiling problem, there is no candidate set at all; in this case, the challenge is to provide as much demographic or psychological information as possible about the author. In the second variant, the needle-in-a-haystack problem, there are many thousands of candidates for each of whom we might have a very limited writing sample. In the third variant, the verification problem, there is no closed candidate set but there is one suspect; in this case, the challenge is to determine if the suspect is or is not the author. For each variant, it is shown how machine learning methods can be adapted to handle the special challenges of that variant.

Erratum re: "The DCI-index: Discounted cumulated impact-based research evaluation", Journal of the American Society for Information Science and Technology, 59(9), 1433-1440

ven, 09/26/2008 - 05:51
The article by K. Järvelin & O. Persson published in JASIST 59(9), "The DCI-Index: Discounted Cumulated Impact-Based Research Evaluation," (pp. 1433-1440) contains an unfortunate error in one of its formulas, Equation . The present paper gives the correction and an example of impact analysis based on the corrected formula.

The Matthew effect defined and tested for the 100 most prolific economists

jeu, 09/25/2008 - 11:36
The Matthew effect has that recognition is bestowed on researchers of already high repute. If recognition is measured by citations, this means that often-cited papers or authors are cited more often. I use the statistical theory of the growth of firms to test whether the fame of papers and authors indeed exhibits increasing returns to scale, and confirm this hypothesis for the 100 most prolific economists.

Author-choice open-access publishing in the biological and medical literature: A citation analysis

jeu, 09/25/2008 - 11:23
In this article, we analyze the citations to articles published in 11 biological and medical journals from 2003 to 2007 that employ author-choice open-access models. Controlling for known explanatory predictors of citations, only 2 of the 11 journals show positive and significant open-access effects. Analyzing all journals together, we report a small but significant increase in article citations of 17%. In addition, there is strong evidence to suggest that the open-access advantage is declining by about 7% per year, from 32% in 2004 to 11% in 2007.

HotMap: Supporting visual exploration of Web search results

ven, 09/19/2008 - 10:11
Although information retrieval techniques used by Web search engines have improved substantially over the years, the results of Web searches have continued to be represented in simple list-based formats. Although the list-based representation makes it easy to evaluate a single document for relevance, it does not support the users in the broader tasks of manipulating or exploring the search results as they attempt to find a collection of relevant documents. HotMap is a meta-search system that provides a compact visual representation of Web search results at two levels of detail, and it supports interactive exploration via nested sorting of Web search results based on query term frequencies. An evaluation of the search results for a set of vague queries has shown that the re-sorted search results can provide a higher portion of relevant documents among the top search results. User studies show an increase in speed and effectiveness and a reduction in missed documents when comparing HotMap to the list-based representation used by Google. Subjective measures were positive, and users showed a preference for the HotMap interface. These results provide evidence for the utility of next-generation Web search results interfaces that promote interactive search results exploration.

The publication and citation impact profiles of Angewandte Chemie and the Journal of the American Chemical Society based on the sections of Chemical Abstracts: A case study on the limitations of the Journal Impact Factor

ven, 09/19/2008 - 06:52
The Journal Impact Factor (JIF) published by ThomsonReuters is often used to evaluate the significance and performance ofscientific journals. Besides methodological problems with the JIF,the critical issue is whether a single measure is sufficient for characterizing the impact of journals, particularly the impact of multidisciplinary and wide-scope journals that publish articles in a broad range of research fields. Taking Angewandte Chemie International Edition and the Journal of the American Chemical Society as examples, we examined the two journals' publication and impact profiles across the sections of Chemical Abstracts and compared the results with the JIF. The analysis was based primarily on Communications published in Angewandte Chemie International Edition and the Journal of the American Chemical Society during 2001 to 2005. The findings show that the information available in the Science Citation Index is a rather unreliable indication of the document type and is therefore inappropriate for comparative analysis. The findings further suggest that the composition of the journal in terms of contribution types, the length of the citation window, and the thematic focus of the journal in terms of the sections of Chemical Abstracts has a significant influence on the overall journal citation impact. Therefore, a single measure of journal citation impact such as the JIF is insufficient for characterizing the significance and performance of wide-scope journals. For the comparison of journals, more sophisticated methods such as publication and impact profiles across subject headings of bibliographic databases (e.g., the sections of Chemical Abstracts) are valuable.

Natural language processing versus content-based image analysis for medical document retrieval

jeu, 09/18/2008 - 11:25
One of the most significant recent advances in health information systems has been the shift from paper to electronic documents. While research on automatic text and image processing has taken separate paths, there is a growing need for joint efforts, particularly for electronic health records and biomedical literature databases. This work aims at comparing text-based versus image-based access to multimodal medical documents using state-of-the-art methods of processing text and image components. A collection of 180 medical documents containing an image accompanied by a short text describing it was divided into training and test sets. Content-based image analysis and natural language processing techniques are applied individually and combined for multimodal document analysis. The evaluation consists of an indexing task and a retrieval task based on the "gold standard" codes manually assigned to corpus documents. The performance of text-based and image-based access, as well as combined document features, is compared. Image analysis proves more adequate for both the indexing and retrieval of the images. In the indexing task, multimodal analysis outperforms both independent image and text analysis. This experiment shows that text describing images can be usefully analyzed in the framework of a hybrid text/image retrieval system.

Training a hierarchical classifier using inter document relationships

ven, 09/12/2008 - 11:13
Text classifiers automatically classify documents into appropriate concepts for different applications. Most classification approaches use flat classifiers that treat each concept as independent, even when the concept space is hierarchically structured. In contrast, hierarchical text classification exploits the structural relationships between the concepts. In this article, we explore the effectiveness of hierarchical classification for a large concept hierarchy. Since the quality of the classification is dependent on the quality and quantity of the training data, we evaluate the use of documents selected from subconcepts to address the sparseness of training data for the top-level classifiers and the use of document relationships to identify the most representative training documents. By selecting training documents using structural and similarity relationships, we achieve a statistically significant improvement of 39.8% (from 54.5-76.2%) in the accuracy of the hierarchical classifier over that of the flat classifier for a large, three-level concept hierarchy.

Business stakeholder analyzer: An experiment of classifying stakeholders on the Web

ven, 09/12/2008 - 11:00
As the Web is used increasingly to share and disseminate information, business analysts and managers are challenged to understand stakeholder relationships. Traditional stakeholder theories and frameworks employ a manual approach to analysis and do not scale up to accommodate the rapid growth of the Web. Unfortunately, existing business intelligence (BI) tools lack analysis capability, and research on BI systems is sparse. This research proposes a framework for designing BI systems to identify and to classify stakeholders on the Web, incorporating human knowledge and machine-learned information from Web pages. Based on the framework, we have developed a prototype called Business Stakeholder Analyzer (BSA) that helps managers and analysts to identify and to classify their stakeholders on the Web. Results from our experiment involving algorithm comparison, feature comparison, and a user study showed that the system achieved better within-class accuracies in widespread stakeholder types such as partner/sponsor/supplier and media/reviewer, and was more efficient than human classification. The student and practitioner subjects in our user study strongly agreed that such a system would save analysts' time and help to identify and classify stakeholders. This research contributes to a better understanding of how to integrate information technology with stakeholder theory, and enriches the knowledge base of BI system design.

A Google Scholar h-index for journals: An alternative metric to measure journal impact in economics and business

ven, 09/12/2008 - 07:01
We propose a new data source (Google Scholar) and metric (Hirsch's h-index) to assess journal impact in the field of economics and business. A systematic comparison between the Google Scholar h-index and the ISI Journal Impact Factor for a sample of 838 journals in economics and business shows that the former provides a more accurate and comprehensive measure of journal impact.

Exploring the h-index at patent level

lun, 09/08/2008 - 11:31
As an acceptable proxy for innovative activity, patents have become increasingly important in recent years. Patents and patent citations have been used for construction of technology indicators. This article presents an alternative to other citation-based indicators, i.e., the patent h-index, which is borrowed from bibliometrics. We conduct the analysis on a sample of the world's top 20 firms ranked by total patents granted in the period 1996-2005 from the Derwent Innovations Index in the semiconductor area. We also investigate the relationships between the patent h-index and other three indicators, i.e., patent counts, citation counts, and the mean family size (MFS). The findings show that the patent h-index is indeed an effective indicator for evaluating the technological importance and quality, or impact, for an assignee. In addition, the MFS indicator correlates negatively and not significantly with the patent h-index, which indicates that the "social value" of a patent is in disagreement with its "private value." The two indicators, patent h-index and MFS, both provide an overview of the value of patents, but from two different angles.

Intellectual structure of human resources management research: A bibliometric analysis of the journal Human Resource Management, 1985-2005

ven, 09/05/2008 - 06:34
The multidisciplinary character of the theories supporting research in the discipline of human resources management (HRM), the increasing importance of a more rigorous approach to HRM studies by academics, and the impact of HRM on the competitive advantage of firms are just some of the indicators demonstrating the relevance of this discipline in the broader field of the social sciences. These developments explain why a quantitative analysis of HRM studies based on bibliometric techniques is particularly opportune. The general objective of this article is to analyze the intellectual structure of the HRM discipline; this can be divided into two specific objectives. The first is to identify the most frequently cited studies, with the purpose of identifying the key topics of research in the HRM discipline. The second objective is to represent the networks of relationships between the most-cited studies, grouping them under common themes, with the object of providing a diagrammatic description of the knowledge base constituted by accumulated works of research in the HRM field. The methodology utilized is based on the bibliometric techniques of citation analysis.

Scholarly hyperwriting: The function of links in academic weblogs

ven, 09/05/2008 - 06:30
Weblogs are gaining momentum as one of most versatile tools for online scholarly communication. Since academic weblogs tend to be used by scholars to position themselves in a disciplinary blogging community, links are essential to their construction. The aim of this article is to analyze the reasons for linking in academic weblogs and to determine how links are used for distribution of information, collaborative construction of knowledge, and construction of the blog's and the blogger's identity. For this purpose I analyzed types of links in 15 academic blogs, considering both sidebar links and in-post links. The results show that links are strategically used by academic bloggers for several purposes, among others to seek their place in a disciplinary community, to engage in hypertext conversations for collaborative construction of knowledge, to organize information in the blog, to publicize their research, to enhance the blog's visibility, and to optimize blog entries and the blog itself.

Presentation bias is significant in determining user preference for search results - A user study

jeu, 09/04/2008 - 06:23
We describe the results of an experiment designed to study user preferences for different orderings of search results from three major search engines. In the experiment, 65 users were asked to choose the best ordering from two different orderings of the same set of search results: Each pair consisted of the search engine's original top-10 ordering and a synthetic ordering created from the same top-10 results retrieved by the search engine. This process was repeated for 12 queries and nine different synthetic orderings. The results show that there is a slight overall preference for the search engines' original orderings, but the preference is rarely significant. Users' choice of the "best" result from each of the different orderings indicates that placement on the page (i.e., whether the result appears near the top) is the most important factor used in determining the quality of the result, not the actual content displayed in the top-10 snippets. In addition to the placement bias, we detected a small bias due to the reputation of the sites appearing in the search results.

User satisfaction with an internet-based portal: An asymmetric and nonlinear approach

jeu, 09/04/2008 - 06:06
Past research in information systems (IS) user satisfaction primarily adopted a conventional "key-driver analysis" approach assuming that independent variables symmetrically and linearly affect user satisfaction. However, recent studies suggest that relationships in IS satisfaction models are more complex. Relying solely on symmetric and linear models runs the risk of systemically misestimating the impact of independent variables on user satisfaction. Building upon previous work, we empirically tested the asymmetric and nonlinear IS user satisfaction model in the context of Internet-based portals. Results show that negative perceived performance on three of the four information-quality attributes have greater impacts on overall satisfaction than do positive perceived performance. In addition, user satisfaction appears to display diminishing sensitivity to information quality in the domain of negative perceived performance but not in positive perceived performance. We expect that this study will generate interest in this new but important area of research.

Theme port sponsored by Duplika Web Hosting.
Accueil Back To Top