Wiley Online Library » Statistical Analysis and Data Mining
1,054 FOLLOWERS
Statistical Analysis and Data Mining addresses the broad area of data analysis, including statistical approaches, machine learning, data mining, and applications. Topics include statistical and computational approaches for analyzing massive and complex datasets, novel statistical and/or machine learning methods and theory, and state-of-the-art applications with high impact.
Wiley Online Library » Statistical Analysis and Data Mining
13h ago
Statistical Analysis and Data Mining: The ASA Data Science Journal, Volume 17, Issue 3, June 2024 ..read more
Wiley Online Library » Statistical Analysis and Data Mining
1w ago
Abstract
Ordinal data frequently occur in various fields such as knowledge level assessment, credit rating, clinical disease diagnosis, and psychological evaluation. The classic models including cumulative logistic regression or probit regression are often used to model such ordinal data. But these modeling approaches conditionally depict the mean characteristic of response variable on a cluster of predictive variables, which often results in non-robust estimation results. As a considerable alternative, composite quantile regression (CQR) approach is usually employed to gain more robust and re ..read more
Wiley Online Library » Statistical Analysis and Data Mining
2w ago
Abstract
Transfer learning, focusing on information borrowing to address limited sample size issues, has gained increasing attention in recent years. Our method aims to utilize data from other population groups as a complement to enhance risk factor discernment and failure time prediction among underrepresented subgroups. However, a literature gap exists in effective knowledge transfer from the source to the target for risk assessment with interval-censored data while accommodating population incomparability and privacy constraints. Our objective is to bridge this gap by developing a transfer ..read more
Wiley Online Library » Statistical Analysis and Data Mining
2w ago
Abstract
We introduce a simple variant of a purely random forest, called an absolute random forest (ARF) used for clustering. At every node, splits of units are determined by a randomly chosen feature and a random threshold drawn from a uniform distribution whose support, the range of the selected feature in the root node, does not change. This enables closed-form estimators of parameters, such as pairwise proximities, to be obtained without having to grow a forest. The probabilistic structure corresponding to an ARF is called a treeless absolute random forest (TARF). With high probability, th ..read more
Wiley Online Library » Statistical Analysis and Data Mining
2w ago
Abstract
With the increasing availability of biomedical data from multiple platforms of the same patients in clinical research, such as epigenomics, gene expression, and clinical features, there is a growing need for statistical methods that can jointly analyze data from different platforms to provide complementary information for clinical studies. In this paper, we propose a two-stage hierarchical Bayesian model that integrates high-dimensional biomedical data from diverse platforms to select biomarkers associated with clinical outcomes of interest. In the first stage, we use Expectation Maxi ..read more
Wiley Online Library » Statistical Analysis and Data Mining
2w ago
Abstract
In many scientific experiments, multiarmed bandits are used as an adaptive data collection method. However, this adaptive process can lead to a dependence that renders many commonly used statistical inference methods invalid. An example of this is the sample mean, which is a natural estimator of the mean parameter but can be biased. This can cause test statistics based on this estimator to have an inflated type I error rate, and the resulting confidence intervals may have significantly lower coverage probabilities than their nominal values. To address this issue, we propose an alterna ..read more
Wiley Online Library » Statistical Analysis and Data Mining
2w ago
Abstract
Nuclear data are fundamental inputs to radiation transport codes used for reactor design and criticality safety. The design of experiments to reduce nuclear data uncertainty has been a challenge for many years, but advances in the sensitivity calculations of radiation transport codes within the last two decades have made optimal experimental design possible. The design of integral nuclear experiments poses numerous challenges not emphasized in classical optimal design, in particular, constrained design spaces (in both a statistical and engineering sense), severely under-determined sys ..read more
Wiley Online Library » Statistical Analysis and Data Mining
3w ago
Abstract
In the last two decades, we are witnessing a huge increase of valuable big data structured in the form of graphs or networks. To apply traditional machine learning and data analytic techniques to such data it is necessary to transform graphs into vector-based representations that preserve the most essential structural properties of graphs. For this purpose, a large number of graph embedding methods have been proposed in the literature. Most of them produce general-purpose embeddings suitable for a variety of applications such as node clustering, node classification, graph visualizatio ..read more
Wiley Online Library » Statistical Analysis and Data Mining
1M ago
Abstract
Advancement in high-throughput sequencing technologies has stimulated intensive research interests to identify specific microbial taxa that are associated with disease conditions. Such knowledge is invaluable both from the perspective of understanding biology and from the biomedical perspective of therapeutic development, as the microbiome is inherently modifiable. Despite availability of massive data, analysis of microbiome compositional data remains difficult. The nature that relative abundances of all components of a microbial community sum to one poses challenges for statistical a ..read more
Wiley Online Library » Statistical Analysis and Data Mining
1M ago
Abstract
Class imbalance is a common and critical challenge in machine learning classification problems, resulting in low prediction accuracy. While numerous methods, especially data augmentation methods, have been proposed to address this issue, a method that works well on one dataset may perform poorly on another. To the best of our knowledge, there is still no one single best approach for handling class imbalance that can be uniformly applied. In this paper, we propose an approach named smart data augmentation (SDA), which aims to augment imbalanced data in an optimal way to maximize downst ..read more