Warning: Undefined array key "scheme" in /home/bitrix/www/index.php on line 202

Warning: Undefined array key "host" in /home/bitrix/www/index.php on line 202

Warning: Undefined array key "scheme" in /home/bitrix/www/index.php on line 210

Warning: Undefined array key "scheme" in /home/bitrix/www/index.php on line 265

Warning: Undefined array key "host" in /home/bitrix/www/index.php on line 265
RMT-KD: Random Matrix Theoretic Causal Knowledge Distillation
lynx   »   [go: up one dir, main page]

RMT-KD: Random Matrix Theoretic Causal Knowledge Distillation

Davide Ettori1 Nastaran Darabi1 Sureshkumar Senthilkumar1 and Amit Ranjan Trivedi1
Abstract

Large deep learning models such as BERT and ResNet achieve state-of-the-art performance but are costly to deploy at the edge due to their size and compute demands. We present RMT-KD, a compression method that leverages Random Matrix Theory (RMT) for knowledge distillation to iteratively reduce network size. Instead of pruning or heuristic rank selection, RMT-KD preserves only informative directions identified via the spectral properties of hidden representations. RMT-based causal reduction is applied layer by layer with self-distillation to maintain stability and accuracy. On GLUE, AG News, and CIFAR-10, RMT-KD achieves up to 80% parameter reduction with only 2% accuracy loss, delivering 2.8×\times faster inference and nearly halved power consumption. These results establish RMT-KD as a mathematically grounded approach to network distillation.

I Introduction and Prior Works

Recent advances in natural language processing (NLP) and computer vision rely on increasingly large deep learning models such as BERT [1] and convolutional networks such as ResNet [2]. While highly accurate, these models incur substantial inference latency, memory footprint, and carbon emissions [3]. Classical compression methods include knowledge distillation (KD), which transfers predictions from a frozen teacher to a smaller student [4], with refinements such as DistilBERT [5] and progressive shrinking [6]; pruning, which removes unimportant weights or channels [7]; and low-rank factorisation, which compresses convolutional filters [8] and adaptation matrices in large language models (LLMs) [9]. Yet these techniques often depend on heuristic thresholds, yield hardware-unfriendly sparsity, or lack a principled statistical rule.

In this paper, we ask: Is there a principled statistical basis for model reduction? We argue that Random Matrix Theory (RMT) provides such a foundation [10]. In high-dimensional settings, the eigenvalue spectrum of activation covariances typically separates into a bulk that reflects random noise and a small number of outliers that capture structured, causal features [11, 12]. This separation, formalized by the Marchenko–Pastur (MP) law [13], offers a rigorous criterion for identifying informative directions.

Building on this insight, we propose an iterative distillation framework: once the network has learned meaningful features, we extract a calibration subset, estimate activation covariances, and project onto the causal subspace defined by outlier eigenvalues. The resulting models are narrower yet fully dense and deployable on standard hardware without sparse operations. After each reduction, self-distillation from the previous checkpoint restores accuracy and prevents catastrophic forgetting. Repeating this cycle across layers enables progressive, RMT-guided compression.

Refer to caption
Figure 1: Architecture of RMT-KD for iterative distillation. At each stage, hidden layer activations are analyzed with RMT principles to identify causal directions, followed by projection and self-distillation. This process is repeated across layers until the benefits saturate.

This RMT-guided distillation is scalable across architectures and modalities, adapting projections to the spectral structure of each layer. Compression can be terminated based on validation accuracy, retained outliers, or target reductions. Our contributions are: (i) a principled, causal rule for layer reduction via RMT-based eigenvalue filtering, thereby avoiding heuristic rank selection; (ii) a model-agnostic algorithm alternating RMT projection with self-distillation; and (iii) an empirical study demonstrating state-of-the-art accuracy–efficiency trade-offs on GLUE and CIFAR-10, with significant parameter and energy savings.

II RMT-KD: Methodology

We propose an iterative self-distillation method that compresses neural networks by progressively reducing their width under RMT guidance. Unlike standard knowledge distillation, where a student learns from a fixed teacher, our approach uses a single model that distills itself at intermediate checkpoints. Each iteration treats the current model as the teacher and its reduced counterpart as the student, trained with a combined loss:

L=αCEtask+(1α)KL(poldpnew),L=\alpha\,\mathrm{CE}_{\text{task}}+(1-\alpha)\,\mathrm{KL}(p_{\text{old}}\,\|\,p_{\text{new}}),

where CEtask\mathrm{CE}_{\text{task}} is the cross-entropy loss and KL(poldpnew)=ipold(i)logpold(i)pnew(i)\mathrm{KL}(p_{\text{old}}\,\|\,p_{\text{new}})=\sum_{i}p_{\text{old}}(i)\log\tfrac{p_{\text{old}}(i)}{p_{\text{new}}(i)} is the distillation loss. This regularizes training, enforcing similarity between consecutive models and mitigating catastrophic forgetting.

Refer to caption
Figure 2: The empirical eigenvalue distribution of the covariance matrices computed on the calibration dataset for BERT-base on the SST dataset.

RMT-Guided Projection: Training proceeds until validation accuracy exceeds a threshold, after which a calibration subset 𝒟cal\mathcal{D}_{\text{cal}} (10% of the training set) is used to extract hidden activations Xd×nX\in\mathbb{R}^{d\times n} from a target layer. This subset provides a stable snapshot of the model’s learned features without requiring the full dataset. The empirical covariance Σ=1nXX\Sigma=\tfrac{1}{n}XX^{\top} yields spectrum {λi}i=1d\{\lambda_{i}\}_{i=1}^{d}, which under the spiked covariance model [14, 15] follows the MP law [13]:

ρMP(λ)=12πλqσ2(λ+λ)(λλ),λ[λ,λ+],\rho_{\text{MP}}(\lambda)=\frac{1}{2\pi\lambda q\sigma^{2}}\sqrt{(\lambda_{+}-\lambda)(\lambda-\lambda_{-})},\quad\lambda\in[\lambda_{-},\lambda_{+}],

with bulk edges λ±=σ2(1±d/n)2\lambda_{\pm}=\sigma^{2}(1\pm\sqrt{d/n})^{2}, where q=d/nq=d/n. The noise variance σ2\sigma^{2} is initialized as the median eigenvalue and refined by minimizing the 2\ell_{2} distance between the empirical histogram and the MP distribution. Adjusting the initialization quantile controls compression aggressiveness: higher quantiles increase λ+\lambda_{+} and prune more directions.

Eigenvalues λi>λ+\lambda_{i}>\lambda_{+} identify informative causal directions [16]. Their eigenvectors form a projection Pk×dP\in\mathbb{R}^{k\times d}, defining a lower-dimensional causal subspace. A fixed linear layer applies PP, and downstream layers are resized to k<dk<d. Training then resumes with loss LL, defined above. This cycle repeats across layers until compression targets are met or validation accuracy falls below the threshold. In BERT, projections are applied to embeddings; in ResNet, to convolutional channels. Fig. 1 illustrates the workflow: train \rightarrow analyse \rightarrow project \rightarrow fine-tune, thereby producing compact, dense models without sparse operations which are efficient for GPU acceleration.

Complexity and Scalability Analysis: RMT-KD adds only lightweight spectral analysis. For a layer of width dd and calibration set size nn, forming the covariance Σd×d\Sigma\in\mathbb{R}^{d\times d} costs 𝒪(nd2)\mathcal{O}(nd^{2}) and eigen-decomposition 𝒪(d3)\mathcal{O}(d^{3}). Since nn is a small fraction (10%) of training data and dd is bounded by hidden width, this cost is negligible relative to training. The projection adds a fixed linear layer of size k×dk\times d, similar to a pruning mask but without sparsity.

Overall, the loop scales as 𝒪(Ld3)\mathcal{O}(Ld^{3}) for LL layers, dominated by decompositions that parallelize well on GPUs. Unlike iterative pruning, which requires repeated masks and sparse kernels, RMT-KD preserves dense tensors and hardware efficiency. The cubic cost remains tractable as only a few layers are reduced per iteration, with gains compounding across layers. The method thus applies to both backbones (ResNet-50) and transformers (BERT-base), and can extend to billion-parameter LLMs via block-wise decomposition or randomized eigensolvers.

Experimental Setup: We trained all models from scratch: a 12-layer Transformer identical to BERT-base (139M parameters), its 6-layer TinyBERT counterpart (44M), and ResNet-50 (23M). Experiments ran on a single NVIDIA RTX 6000 GPU with CUDA, with GPU power measured using vendor tools. Datasets followed standard splits: GLUE and AG News for language tasks, and CIFAR-10 for vision. Tokenization and preprocessing adhered to best practices.

Why RMT-based Model Distillation is Theoretically Justified Although embeddings from large language and vision–language models are generated by deterministic networks, their covariance structures in high dimensions exhibit statistical regularities that can be described by RMT [17]. The eigenvalue spectrum of empirical covariance matrices typically consists of a bulk, following the MP distribution, and a few isolated spikes corresponding to structured, task-relevant signals. This separation provides a principled criterion for distinguishing meaningful representations from random variations which forms the basis of our method.

For symmetric random matrices with i.i.d. entries of mean zero and variance σ2\sigma^{2}, the Wigner semicircle law [18] describes the eigenvalue density and defines the noise floor of random fluctuations. For sample covariance matrices 𝐂=1n𝐗𝐗\mathbf{C}=\tfrac{1}{n}\mathbf{X}^{\top}\mathbf{X} with 𝐗n×p\mathbf{X}\in\mathbb{R}^{n\times p} and variance σ2\sigma^{2}, the eigenvalue distribution converges to the MP law with support [λ,λ+]=[σ2(1c)2,σ2(1+c)2][\lambda_{-},\lambda_{+}]=[\sigma^{2}(1-\sqrt{c})^{2},\ \sigma^{2}(1+\sqrt{c})^{2}], where c=p/nc=p/n. Eigenvalues inside this bulk reflect random variation, while outliers indicate structure. The spiked covariance model [14] formalizes this separation, showing that if signal strength exceeds the Baik–Ben Arous–Péché (BBP) threshold λBBP=σ2(1+c)\lambda_{\text{BBP}}=\sigma^{2}(1+\sqrt{c}) [19], some eigenvalues detach from the MP bulk and their eigenvectors align with meaningful, data-dependent directions [20]. These spikes correspond to causal or semantically structured dimensions that can be isolated.

In modern deep networks, hidden representations are extremely high-dimensional, yet many directions do not contribute to task-relevant information [21, 22]. By retaining only eigenvectors linked to outlier eigenvalues, one constructs a projection operator mapping activations to a lower-dimensional, signal-dominant subspace. Unlike PCA, where the cutoff is based on heuristics such as explained variance ratios, RMT provides principled thresholds for separating signal from noise. This RMT-guided filtering discards noisy or redundant dimensions while preserving essential features, yielding narrower yet dense layers. Embedded in an iterative knowledge-distillation loop, it enables compressed models to adapt to the reduced space, maintaining accuracy while reducing memory, latency, and energy consumption.

Refer to caption
Refer to caption
Figure 3: (a, top) Accuracy vs. parameter reduction and (b, bottom) power consumption vs. inference speedup on GLUE datasets (σ2=\sigma^{2}= median eigenvalue, initial quantile = 50%).

III Results and Discussion

Fig. 2 shows the evolution of empirical eigenvalue spectra of covariance matrices computed on calibration data from different depths of BERT-base trained on SST (GLUE). After the validation accuracy threshold is reached, activations encode structured information rather than random noise, and the spectra deviate markedly from the Wigner semicircle law. In the embedding block, most eigenvalues cluster near zero, with only a few outliers exceeding the Marchenko–Pastur bulk (red dashed line), indicating that most directions are noise-dominated while a small subset carries meaningful signal. Deeper layers exhibit broader spectra with smoother decay and a larger fraction of eigenvalues above the bulk edge, reflecting the accumulation and propagation of task-relevant information. This trend is consistent with the idea that later representations are more specialized and semantically structured. Importantly, the bulk cutoff is computed adaptively from the data using RMT rather than a fixed threshold as in PCA, ensuring that only statistically significant causal directions are retained at each step. This adaptive criterion supports our iterative approach: early layers can be compressed aggressively, while deeper layers with richer spectra need conservative reduction, as their distributions deviate increasingly from the Wigner Semicircle Law.

Fig. 3a compares accuracy and parameter counts before and after RMT-based iterative compression. Across all model–dataset pairs, accuracy remains within 2–3 percentage points of the baseline despite substantial reduction, and in cases such as BERT-base on SST, performance even improves slightly—likely due to removal of noisy or redundant parameters. Parameter counts drop sharply, with BERT-base reduced by about 80% while retaining near-original accuracy, highlighting its overparameterization and the ability of RMT-based reduction with self-distillation to preserve informative directions. BERT-tiny still shows 58% reduction, suggesting redundancy even in smaller models, though with less dramatic impact. ResNet-50 exhibits the smallest reduction, around 48%, reflecting its more compact convolutional structure. Overall, the results indicate that larger, more overparameterized models benefit most from statistically guided compression, while smaller architectures are closer to their limits.

Refer to caption
Figure 4: Comparison of memory on disk and energy efficiency for all models and GLUE datasets, σ2=MedianEigenvalue\sigma^{2}=MedianEigenvalue, initial quantile = 50%

Fig. 3b shows inference speedup and average power consumption after compression. BERT-base achieves the largest gains, with nearly 3× faster inference on SST and QNLI and slightly lower but still substantial gains on QQP. These improvements stem from the sharp parameter reduction and smaller intermediate representations, which cut both computation and data movement. BERT-tiny shows more modest speedups of 1.3–1.4×, reflecting fixed overheads that dominate smaller models. ResNet-50, already compact, improves only marginally ( 1.03×), indicating limited latency benefits from further reduction. In the figure, power consumption consistently decreases across models, most notably for BERT-base where lower compute demand reduces sustained power draw. For BERT-tiny and ResNet-50 the drop is less pronounced, likely because fixed hardware and memory access costs contribute a larger share of energy use.

Fig. 4 compares memory footprint and total energy consumption before and after compression. All models show substantial memory savings, with BERT-base dropping from 532 MB to just over 100 MB (\sim80%) and BERT-tiny halving to 72 MB. These reductions mirror the parameter savings, as fewer weights translate directly to smaller model files. ResNet-50, starting from a smaller size with less redundant structure, shows more modest gains. Energy consumption during inference also decreases consistently, most sharply for BERT-base, which achieves over 5× reduction due to shorter execution and lower sustained power draw. BERT-tiny follows the same trend with smaller absolute savings, while ResNet-50 shows the least change.

Refer to caption
Figure 5: Accuracy–reduction tradeoff for BERT-base (GLUE) and ResNet-50 (CIFAR) as a function of the eigenvalue quantile used to initialize σ2\sigma^{2}. The x-axis shows the quantile, the y-axis shows accuracy (decreasing) and parameter reduction (increasing). The best balance occurs near 40%.

Ablation Study: Fig. 5 shows the trade-off between accuracy and parameter reduction when varying the quantile for σ2\sigma^{2} initialization in BERT (GLUE) and ResNet (CIFAR-10). At low quantiles, accuracy remains close to the baseline but reductions are limited; at high quantiles, aggressive pruning causes sharp accuracy loss. The best balance occurs around the 40%–50% quantile, near the median, where both architectures retain high accuracy with substantial compression. This regime is particularly effective for reducing inference time and energy while preserving performance. ResNet offers greater reduction potential, while BERT is constrained by fixed embedding and token projection layers, which dominate beyond the 40% quantile and limit compression without accuracy degradation.

Model Method Red. Acc.
BERT-base (GLUE) RMT-KD 80.9% +1.8%
DistilBERT 42.7% +0.2%
Theseus 48.3% +0.6%
PKD 40.5% -1.0%
BERT-tiny (GLUE) RMT-KD 58.8% +1.4%
DistilBERT 54.8% +0.4%
Theseus 53.0% +0.1%
PKD 50.1% -0.8%
ResNet-50 (CIFAR-10) RMT-KD 47.7% +0.7%
AT 42.2% +0.4%
FitNet 40.6% +0.2%
CRD 45.4% +0.6%
Table I: Comparison of KD methods. BERT results are on GLUE; ResNet-50 results are on CIFAR-10. Theseus = BERT-of-Theseus, PKD = Patient Knowledge Distillation, AT = Attention Transfer, FitNet = Hints for Thin Deep Nets, CRD = Contrastive Representation Distillation.

Table I compares RMT-KD with state-of-the-art distillation and compression methods across NLP (BERT-base, BERT-tiny on GLUE) and CV (ResNet-50 on CIFAR-10). RMT-KD achieves substantial parameter reductions (up to 80.9%) while consistently improving accuracy, outperforming specialized baselines such as TinyBERT, Theseus, and CRD. This advantage arises from its ability to dynamically identify and retain only the most causal components of hidden representations. Unlike heuristic pruning or rank selection, RMT-KD leverages Random Matrix Theory: eigenvalues beyond the Marchenko–Pastur threshold mark informative directions, enabling compression driven by rigorous statistical principles rather than ad hoc cutoffs.

IV Conclusion

We introduce RMT-KD, an iterative distillation framework that combines layer-wise RMT analysis with self-distillation to compress deep models while preserving accuracy. Unlike pruning or heuristic truncation, RMT-KD offers a statistically grounded, data-driven rule for dimensionality reduction that retains dense causal structure. Experiments on BERT (GLUE, AG News) and ResNet (CIFAR-10), it achieves up to 80% parameter reduction with only 2% accuracy loss, 2.8× faster inference, and 80% lower energy use. The models remain hardware-efficient since projections are dense and avoid sparse kernels. Gains are largest for overparameterized transformers, while smaller models like ResNet-50 show modest improvements, reflecting proximity to their efficiency frontier. Performance also depends on calibration subset quality, which may limit robustness under distribution shift.

Acknowledgment

This work was supported in part by a Gift funding from Intel, by CogniSense, one of the seven SRC/DARPA JUMP2.0 Centers, NSF CAREER Award # 2046435.

References

  • [1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019.
  • [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [3] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy considerations for deep learning in nlp,” arXiv preprint arXiv:1906.02243, 2019.
  • [4] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [5] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” in Proceedings of the Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing, 2019.
  • [6] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once‐for‐all: Train one network and specialize it for efficient deployment,” in Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020, arXiv preprint arXiv:1908.09791. [Online]. Available: https://arxiv.org/abs/1908.09791
  • [7] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in International Conference on Learning Representations, 2016.
  • [8] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional neural networks with low rank expansions,” in British Machine Vision Conference, 2014.
  • [9] E. J. Hu, Y. Shen, S. Wallis, H. Li, P. Yang, Z. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022.
  • [10] J. Bun, J.-P. Bouchaud, and M. Potters, “Cleaning large correlation matrices: tools from random matrix theory,” Physics Reports, vol. 666, pp. 1–109, 2017.
  • [11] V. Papyan, X. Han, and D. Donoho, “The power law spectrum of deep network gradients,” Nature Communications, vol. 11, pp. 1–12, 2020.
  • [12] C. H. Martin and M. W. Mahoney, “Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning,” Journal of Machine Learning Research, vol. 22, pp. 1–73, 2021.
  • [13] V. A. Marchenko and L. A. Pastur, “Distribution of eigenvalues for some sets of random matrices,” Matematicheskii Sbornik, vol. 114, no. 4, pp. 507–536, 1967.
  • [14] I. M. Johnstone, “On the distribution of the largest eigenvalue in principal components analysis,” The Annals of Statistics, vol. 29, no. 2, pp. 295–327, 2001.
  • [15] D. Paul, “Asymptotics of sample eigenstructure for a large dimensional spiked covariance model,” Statistica Sinica, vol. 17, no. 4, pp. 1617–1642, 2007. [Online]. Available: https://www.jstor.org/stable/24307846
  • [16] N. Darabi, D. Naik, S. Tayebati, D. Jayasuriya, R. Krishnan, and A. R. Trivedi, “Eigenshield: Causal subspace filtering via random matrix theory for adversarially robust vision-language models,” 2025. [Online]. Available: https://arxiv.org/abs/2502.14976
  • [17] G. Zumbach, “Empirical properties of large covariance matrices,” Quantitative Finance, vol. 11, no. 7, pp. 1091–1102, 2011. [Online]. Available: https://arxiv.org/abs/0903.1525
  • [18] E. P. Wigner, “On the distribution of the roots of certain symmetric matrices,” Annals of Mathematics, vol. 67, no. 2, pp. 325–327, 1958.
  • [19] J. Baik, G. Ben Arous, and S. Péché, “Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices,” The Annals of Probability, vol. 33, no. 5, pp. 1643–1697, 2005.
  • [20] F. Benaych-Georges and R. R. Nadakuditi, “Eigenvectors of spiked random matrices and outliers of the singular values,” Electronic Journal of Probability, vol. 16, pp. 1621–1677, 2011.
  • [21] M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein, “Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017.
  • [22] T. Serre, A. M. Saxe, and A. S. Morcos, “A pca-based analysis of deep neural networks,” Journal of Machine Learning Research, vol. 23, no. 84, pp. 1–35, 2022.
Лучший частный хостинг