[NASA]
Astronomical observatories are physically remote, leading to data transmission limits that severely bottleneck throughput. Often, observatories can collect much more data than they can downlink. Any improvements in lossless compression could unlock millions of dollars worth of additional science by enabling more observations and thus unleashing the potential of existing instruments.
Traditional lossless methods for compressing astrophysical data are manually designed. Neural data compression can learn end-to-end from data and outperform classical techniques by leveraging the unique spatial, temporal, and wavelength structures of astronomical images.
This paper introduces AstroCompress: a neural compression challenge for astrophysics data, featuring four new datasets with raw, 16-bit unsigned integer imaging data in various modes: space-based, ground-based, multi-wavelength, and time-series imaging. We benchmark seven lossless compression methods and show that neural methods can enhance data collection.
Astronomy data volumes are growing at a superexponential rate, increasing by a factor of nearly 1000x every 10 years. This rapid growth places an immense burden on our ability to transmit and store data, threatening to stall scientific progress. The development of exascale supercomputers is a direct response to this data deluge.
Advanced compression is not just a convenience—it's a critical necessity to ensure we can continue to collect and analyze the vast amounts of data from radio astronomy, satellite imagery, and other large-scale scientific endeavors. Unlike internet media, where better compression implies faster load times for a user, telescope data compression leads to more data collected overall. The Nancy Grace Roman Space Telescope, launching soon in 2027, faces "unprecedented" data scale challenges that could result in permanent data loss if transmission bottlenecks aren't solved. "The rate of data generation significantly exceeds Roman's very large download capability (1.4 TB/day) even after on board compression; therefore downloading all individual readouts is currently not feasible." [StSci] For future space-based telescopes where every bit of downlink bandwidth is precious, compression improvements have the potential to dramatically increase scientific output and discovery potential—transforming what would otherwise be lost data into preserved science.
Compression is important for ground-based telescopes as well. The Square Kilometer Array alone will produce 62 exabytes per year starting in the late 2020s. The infrastructure costs for storing, transmitting, and processing this data represent a substantial portion of total project budgets, where compression improvements have compound economic benefits across the entire data pipeline.
Due to the high costs of these instruments, even a one to few percent improvement in compression ratio can deliver significant scientific value.
We find that existing advanced non-neural codecs such as JPEG-XL and JPEG-LS can achieve significantly improved compression than the current astronomy standard, RICE. These algorithms are also highly efficient and compatible with most existing hardware -- both algorithms can compress a full 4K Hubble image in under a second, while the powerful JPEG-XL at max setting takes 1.5 minutes in exchange for improved compression. We believe that future space and ground optical telescopes should consider one of these algorithms.
Additionally, we explore neural methods and demonstrate their potential for the medium to long term when the necessary ML-acceleration hardware can be installed on space or ground-based telescopes. Our research demonstrates that neural compression methods can match or exceed classical state-of-the-art algorithms like JPEG-XL while learning end-to-end from the data itself.
Unlike traditional hand-crafted compression algorithms, neural compression learns directly from astronomical data, automatically discovering the unique patterns and structures present in space-based and ground-based imagery. By leveraging the spatial correlations of celestial objects, temporal dependencies in time-series observations, and spectral relationships across wavelengths, neural methods might achieve superior compression ratios.
| Dataset ID | Telescope/Survey | Location | Wavelength (Å) | Filters/Bands | Resolution (arcsec/pix) | Dimensions (pixels) | Arrays | Size (GB) | Key Artifacts |
|---|---|---|---|---|---|---|---|---|---|
| GBI-16-2D | W.M. Keck Observatory LRIS | Hawaii, USA (Ground) | 4500-8400 | B, V, R, I | 0.135 | 2248×2048 3768×2520 |
137 | 1.5 | Atmospheric seeing, variable conditions |
| SBI-16-2D | Hubble Space Telescope ACS | Low Earth Orbit (Space) | 5926 | F606W (V-band) | 0.05 | 4144×2068 | 2140 | 69 | Cosmic ray hits, charge transfer inefficiency |
| SBI-16-3D | James Webb Space Telescope NIRCam | L2 Lagrange Point (Space) | 19840 | F200W (H-band IR) | 0.031 | 2048×2048×T (T=5-10) |
1273 | 90 | Up-the-ramp sampling, saturation effects |
| GBI-16-4D | Sloan Digital Sky Survey | New Mexico, USA (Ground) | 3543, 4770, 6231, 7625, 9134 | u, g, r, i, z (ugriz) | 0.396 | 800×800×5×T (T≤90) |
500 | 158 | Atmospheric variations, multi-epoch gaps |
| GBI-16-2D-Legacy | Las Cumbres Observatory | Multiple Sites (Ground) | Variable | B, V, r', i' | 0.58 | 3136×2112 | 9 | 0.1 | Benchmark comparison dataset |
Total Dataset: ~320 GB across 6,201 arrays | Bit Depth: 16-bit unsigned integer | Access: HuggingFace Datasets
The AstroCompress benchmark provides the astronomical community with 320 GB of carefully curated datasets spanning the full diversity of modern astronomical observations, from Hubble's cosmic ray-affected images to JWST's infrared time-series and SDSS's multi-wavelength surveys.
Compression Ratios Achieved: Higher values indicate better compression.
| Dataset | Neural Methods | Non-Neural Methods | |||||||
|---|---|---|---|---|---|---|---|---|---|
| IDF | L3C | PixelCNN++ | VDM* | JPEG-XL (max) | JPEG-XL | JPEG-LS | JPEG-2000 | RICE | |
| Single-Frame Datasets | |||||||||
| LCO (Ground) | 2.83 | 1.67 | 2.02 | 3.64 | 2.98 | 2.78 | 2.81 | 2.80 | 2.65 |
| Keck (Ground) | 2.04 | 1.89 | 2.08 | 2.11 | 2.01 | 1.97 | 1.97 | 1.96 | 1.84 |
| Hubble (Space) | 2.94 | 2.90 | 3.13 | 3.33 | 3.26 | 2.92 | 2.86 | 2.67 | 2.64 |
| JWST-2D (Space) | 1.44 | 1.38 | 1.44 | 1.44 | 1.38 | 1.33 | 1.35 | 1.37 | 1.24 |
| SDSS-2D (Ground) | 2.91 | 2.36 | 3.35 | 3.27 | 3.38 | 3.14 | 3.16 | 3.20 | 2.96 |
| Multi-Frame/Multi-Spectral Datasets | |||||||||
| JWST-2D-Res | 3.14 | 2.91 | 2.80 | — | 3.35 | 2.37 | 3.24 | 1.69 | 3.08 |
| SDSS-3Dλ | 3.05 | 2.29 | 2.88 | — | 3.49 | 3.23 | 3.24 | 3.28 | 3.05 |
| SDSS-3DT | 3.03 | 2.59 | 3.02 | — | 3.48 | 3.23 | 3.24 | 3.29 | 3.05 |
Our analysis confirms a fundamental insight about astronomical image compression, first explained by Pence et al. (2009): 98.5% of pixels in typical astronomical images fall below 3 SNR, meaning they are dominated by background noise rather than actual astronomical sources. This finding has important implications for compression algorithm design.
We demonstrate a strong linear correlation between background pixel noise levels and compression bitrate requirements across all datasets. Images with higher background noise standard deviation require proportionally more bits per pixel, following the theoretical prediction from Shannon's source coding theorem.
Our Hubble dataset analysis shows an inverse relationship between telescope exposure time and achievable compression ratios. Longer exposures accumulate more photon noise and detector artifacts, reducing compressibility. This relationship plateaus at very long exposures as CCD pixels approach saturation limits.
Neural compression models trained on the diverse Keck dataset (with varied wavelength filters and observing conditions) demonstrate superior generalization performance compared to models trained on more homogeneous datasets. The Keck-trained Integer Discrete Flow model achieved better compression on SDSS data than models trained directly on SDSS itself.
Though lossless compression is the gold standard, lossy and near-lossless compression can achieve significantly higher compression ratios. For astronomy, where most pixels are dominated by noise, lossy compression is particularly promising. Collaborations between astronomers and machine learning experts could lead to advanced lossy compression algorithms that selectively discard non-essential data while preserving scientifically valuable information.
One approach might mask astronomical sources and prioritize the accuracy of source pixels. Another approach could involve near-lossless compression, which ensures a strict user-defined upper bound on the error of every individual reconstructed pixel. Such an approach would be attractive to astronomers, who may desire a robust error measurement for uncertainty propagation.
Other domains in astrophysics are in need of compression benchmarks, especially such as radio astronomy (i.e. Square Kilometer Array) and neutrino detectors (i.e. IceCube). Even beyond astronomy, domains like biological imaging, genome sequencing, and satellite data all have the fundamental bandwidth limitations that make compression a critical bottleneck.
The next critical step is actually deploying new compression algorithms to observatories, starting with JPEG-XL or JPEG-LS. Today, RICE compression is the standard method in use. To this end, we are working on a PR to Astropy, integrating JPEG-XL and JPEG-LS compression methods [Astropy PR]. We hope to follow this up with a PR to the [CFITSIO] library, which hosts the fpack tool widely used to compress FITS files. Finally, we hope that in the near future, we can deploy a codec like JPEG-XL or JPEG-LS to a ground-based or space-based telescope.
Astronomy prioritizes compression performance, followed by manageable encoding times. Slow decoding on the ground is a non-issue. Traditional compression methods instead optimize for fast decoding on consumer devices, so future work should explore trading decoding speed for higher compression ratios. This asymmetry favors autoregressive methods that can yield higher compression ratios but slow sequential decoding, with Transformer models being potential candidates.
Time-series imaging has the potential for better compression ratios by exploiting correlations across time. We achieved extremely high compression ratios on the JWST-2D-Res dataset, as the frames were collected back-to-back in time. We encourage more exploration of dataset construction and compression for back-to-back time-series imagery. Such compression will be critical for the operation of next-generation wide-field time-domain surveys from space. We believe that spectral image compression or temporal images taken far apart in time, such as our SDSS dataset, may be difficult due to noise causing weaker correlation across the less significant bits.
@inproceedings{truong2025astrocompress,
title={AstroCompress: A Benchmark Dataset for Multi-Purpose Compression of Astronomical Imagery},
author={Truong, Tuan and Sudharsan, Rithwik and Yang, Yibo and Ma, Peter Xiangyuan and Yang, Ruihan and Mandt, Stephan and Bloom, Joshua S.},
booktitle={International Conference on Learning Representations},
year={2025},
url={https://huggingface.co/AstroCompress}
}