Research
Overview
The goal of my research is to provide a theoretical and algorithmic framework for information science that can lead to efficient strategies for assessing, gathering, extracting, and exploiting information. In the era of data deluge, we want to fully utilize the large volumes and richness of data sets to efficiently infer the real-world phenomena behind the data. Information-theoretic concepts and tools are useful in data science, especially to establish fundamental limits and to explore trade-offs in extracting information from data sets. To deal with new challenges originated from practical concerns in engineering information processors for big data, we also need new techniques and concepts beyond the classical information-theoretic solutions.
My research focus is on developing a theoretical framework for data science that copes with practical concerns such as timeliness in decision making, efficient usage of limited sensing resources, and computational efficiency in data processing. More specially, I study questions such as: How can we design sensing strategies to acquire the most relevant observations for estimating an unknown target variable at the lowest cost? How can we quantify value of information and develop strategies to extract the most valuable information given limited sensing resources? How can we design efficient information-recovery procedures from large amounts of noisy observations? How can we design distributed querying over crowd of unknown reliabilities to efficiently collect useful observations? I develop algorithms for these data acquisition and information recovery problems and provide performance guarantees for these algorithms by using tools from probability theory, information theory, and stochastic analysis.
Recent papers
Algorithms and Theory for Data Science and Machine Learning
Rank-1 Matrix Completion with Gradient Descent and Small Random Initialization, NeurIPS 2023
Recovering Top-Two Answers and Confusion Probability in Multi-Choice Crowdsourcing, ICML 2023
A Worker-Task Specialization Model for Crowdsourcing: Efficient Inference and Fundamental Limits, IEEE Trans. Information Theory 2024
Binary Classification with XOR Queries: Fundamental Limits and An Efficient Algorithm, IEEE Trans. Information Theory 2021
Detection of Signal in the Spiked Rectangular Models, ICML 2021
Robust Hypergraph Clustering via Convex Relaxation of Truncated MLE, IEEE JSAIT 2020
Efficient Deep Learning, Robust and Trustworthy AI
BWS: Best Window Selection Based on Sample Scores for Data Pruning across Broad Ranges, ICML 2024
SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching, ICML 2024
Data Valuation without Training of a Model, ICLR 2023
Test-Time Adaptation via Self-Training with Nearest Neighbor Information, ICLR 2023
Self-Diagnosing GAN: Diagnosing Underrepresented Samples in Generative Adversarial Networks, NeurIPS 2021