AstroCompress: Neural Compression for Astronomical Imagery

A Benchmark Dataset for Multi-Purpose Compression of Astronomical Imagery

Abstract

Astronomical observatories are physically remote, leading to data transmission limits that severely bottleneck throughput. Often, observatories can collect much more data than they can downlink. Any improvements in lossless compression could unlock millions of dollars worth of additional science by enabling more observations and thus unleashing the potential of existing instruments.

Traditional lossless methods for compressing astrophysical data are manually designed. Neural data compression can learn end-to-end from data and outperform classical techniques by leveraging the unique spatial, temporal, and wavelength structures of astronomical images.

This paper introduces AstroCompress: a neural compression challenge for astrophysics data, featuring four new datasets with raw, 16-bit unsigned integer imaging data in various modes: space-based, ground-based, multi-wavelength, and time-series imaging. We benchmark seven lossless compression methods and show that neural methods can enhance data collection.

The Challenge: An Explosion of Astronomical Data

Astronomy data volumes are growing at a superexponential rate, increasing by a factor of nearly 1000x every 10 years. This rapid growth places an immense burden on our ability to transmit and store data, threatening to stall scientific progress. The development of exascale supercomputers is a direct response to this data deluge.

Space Telescopes

Advanced compression is not just a convenience—it's a critical necessity to ensure we can continue to collect and analyze the vast amounts of data from radio astronomy, satellite imagery, and other large-scale scientific endeavors. Unlike internet media, where better compression implies faster load times for a user, telescope data compression leads to more data collected overall. The Nancy Grace Roman Space Telescope, launching soon in 2027, faces "unprecedented" data scale challenges that could result in permanent data loss if transmission bottlenecks aren't solved. "The rate of data generation significantly exceeds Roman's very large download capability (1.4 TB/day) even after on board compression; therefore downloading all individual readouts is currently not feasible." [StSci] For future space-based telescopes where every bit of downlink bandwidth is precious, compression improvements have the potential to dramatically increase scientific output and discovery potential—transforming what would otherwise be lost data into preserved science.

Ground Telescopes

Compression is important for ground-based telescopes as well. The Square Kilometer Array alone will produce 62 exabytes per year starting in the late 2020s. The infrastructure costs for storing, transmitting, and processing this data represent a substantial portion of total project budgets, where compression improvements have compound economic benefits across the entire data pipeline.

Due to the high costs of these instruments, even a one to few percent improvement in compression ratio can deliver significant scientific value.

Essential Findings for Future Telescope Design

Non-neural Methods

We find that existing advanced non-neural codecs such as JPEG-XL and JPEG-LS can achieve significantly improved compression than the current astronomy standard, RICE. These algorithms are also highly efficient and compatible with most existing hardware -- both algorithms can compress a full 4K Hubble image in under a second, while the powerful JPEG-XL at max setting takes 1.5 minutes in exchange for improved compression. We believe that future space and ground optical telescopes should consider one of these algorithms.

Neural Methods

Additionally, we explore neural methods and demonstrate their potential for the medium to long term when the necessary ML-acceleration hardware can be installed on space or ground-based telescopes. Our research demonstrates that neural compression methods can match or exceed classical state-of-the-art algorithms like JPEG-XL while learning end-to-end from the data itself.

Unlike traditional hand-crafted compression algorithms, neural compression learns directly from astronomical data, automatically discovering the unique patterns and structures present in space-based and ground-based imagery. By leveraging the spatial correlations of celestial objects, temporal dependencies in time-series observations, and spectral relationships across wavelengths, neural methods might achieve superior compression ratios.

AstroCompress Dataset Specifications

Dataset ID	Telescope/Survey	Location	Wavelength (Å)	Filters/Bands	Resolution (arcsec/pix)	Dimensions (pixels)	Arrays	Size (GB)	Key Artifacts
GBI-16-2D	W.M. Keck Observatory LRIS	Hawaii, USA (Ground)	4500-8400	B, V, R, I	0.135	2248×2048 3768×2520	137	1.5	Atmospheric seeing, variable conditions
SBI-16-2D	Hubble Space Telescope ACS	Low Earth Orbit (Space)	5926	F606W (V-band)	0.05	4144×2068	2140	69	Cosmic ray hits, charge transfer inefficiency
SBI-16-3D	James Webb Space Telescope NIRCam	L2 Lagrange Point (Space)	19840	F200W (H-band IR)	0.031	2048×2048×T (T=5-10)	1273	90	Up-the-ramp sampling, saturation effects
GBI-16-4D	Sloan Digital Sky Survey	New Mexico, USA (Ground)	3543, 4770, 6231, 7625, 9134	u, g, r, i, z (ugriz)	0.396	800×800×5×T (T≤90)	500	158	Atmospheric variations, multi-epoch gaps
GBI-16-2D-Legacy	Las Cumbres Observatory	Multiple Sites (Ground)	Variable	B, V, r', i'	0.58	3136×2112	9	0.1	Benchmark comparison dataset

Total Dataset: ~320 GB across 6,201 arrays | Bit Depth: 16-bit unsigned integer | Access: HuggingFace Datasets

The AstroCompress benchmark provides the astronomical community with 320 GB of carefully curated datasets spanning the full diversity of modern astronomical observations, from Hubble's cosmic ray-affected images to JWST's infrared time-series and SDSS's multi-wavelength surveys.

Comprehensive Compression Results

Compression Ratios Achieved: Higher values indicate better compression.

Dataset	Neural Methods				Non-Neural Methods
Dataset	IDF	L3C	PixelCNN++	VDM*	JPEG-XL (max)	JPEG-XL	JPEG-LS	JPEG-2000	RICE
Single-Frame Datasets
LCO (Ground)	2.83	1.67	2.02	3.64	2.98	2.78	2.81	2.80	2.65
Keck (Ground)	2.04	1.89	2.08	2.11	2.01	1.97	1.97	1.96	1.84
Hubble (Space)	2.94	2.90	3.13	3.33	3.26	2.92	2.86	2.67	2.64
JWST-2D (Space)	1.44	1.38	1.44	1.44	1.38	1.33	1.35	1.37	1.24
SDSS-2D (Ground)	2.91	2.36	3.35	3.27	3.38	3.14	3.16	3.20	2.96
Multi-Frame/Multi-Spectral Datasets
JWST-2D-Res	3.14	2.91	2.80	—	3.35	2.37	3.24	1.69	3.08
SDSS-3Dλ	3.05	2.29	2.88	—	3.49	3.23	3.24	3.28	3.05
SDSS-3DT	3.03	2.59	3.02	—	3.48	3.23	3.24	3.29	3.05

🏆 Key Findings

JPEG-XL (max) establishes new state-of-the-art among non-neural methods
Neural methods achieve competitive performance with minimal architectural modifications
VDM estimates show significant room for future neural compression improvements

📊 Performance Notes

*VDM: Theoretical upper bound estimates (impractical runtime)
Green highlights: Best performance per dataset
Multi-frame: Higher ratios exploit temporal/spectral correlations

Scientific Analysis: What Makes Astronomical Images Compressible?

Noise Analysis: Signal-to-noise ratio (SNR) and compression bitrate allocation analysis showing how background noise dominates compression requirements in astronomical images.

Exposure Time vs Compression Performance

Compression vs Exposure: Relationship between Hubble telescope exposure times and compression performance, showing how longer exposures reduce compressibility due to increased noise.

Noise Impact: Strong correlation between background pixel noise levels and compression performance across different astronomical datasets.

Promising Future Directions

〰️ Lossy Compression for Astronomy

Though lossless compression is the gold standard, lossy and near-lossless compression can achieve significantly higher compression ratios. For astronomy, where most pixels are dominated by noise, lossy compression is particularly promising. Collaborations between astronomers and machine learning experts could lead to advanced lossy compression algorithms that selectively discard non-essential data while preserving scientifically valuable information.

One approach might mask astronomical sources and prioritize the accuracy of source pixels. Another approach could involve near-lossless compression, which ensures a strict user-defined upper bound on the error of every individual reconstructed pixel. Such an approach would be attractive to astronomers, who may desire a robust error measurement for uncertainty propagation.

📡 Other Data Types

Other domains in astrophysics are in need of compression benchmarks, especially such as radio astronomy (i.e. Square Kilometer Array) and neutrino detectors (i.e. IceCube). Even beyond astronomy, domains like biological imaging, genome sequencing, and satellite data all have the fundamental bandwidth limitations that make compression a critical bottleneck.

🚀 Practical Deployment

The next critical step is actually deploying new compression algorithms to observatories, starting with JPEG-XL or JPEG-LS. Today, RICE compression is the standard method in use. To this end, we are working on a PR to Astropy, integrating JPEG-XL and JPEG-LS compression methods [Astropy PR]. We hope to follow this up with a PR to the [CFITSIO] library, which hosts the fpack tool widely used to compress FITS files. Finally, we hope that in the near future, we can deploy a codec like JPEG-XL or JPEG-LS to a ground-based or space-based telescope.

⚡ Encoding-Decoding Time Tradeoffs

Astronomy prioritizes compression performance, followed by manageable encoding times. Slow decoding on the ground is a non-issue. Traditional compression methods instead optimize for fast decoding on consumer devices, so future work should explore trading decoding speed for higher compression ratios. This asymmetry favors autoregressive methods that can yield higher compression ratios but slow sequential decoding, with Transformer models being potential candidates.

📊 Temporal Image Compression

Time-series imaging has the potential for better compression ratios by exploiting correlations across time. We achieved extremely high compression ratios on the JWST-2D-Res dataset, as the frames were collected back-to-back in time. We encourage more exploration of dataset construction and compression for back-to-back time-series imagery. Such compression will be critical for the operation of next-generation wide-field time-domain surveys from space. We believe that spectral image compression or temporal images taken far apart in time, such as our SDSS dataset, may be difficult due to noise causing weaker correlation across the less significant bits.

BibTeX

@inproceedings{truong2025astrocompress, title={AstroCompress: A Benchmark Dataset for Multi-Purpose Compression of Astronomical Imagery}, author={Truong, Tuan and Sudharsan, Rithwik and Yang, Yibo and Ma, Peter Xiangyuan and Yang, Ruihan and Mandt, Stephan and Bloom, Joshua S.}, booktitle={International Conference on Learning Representations}, year={2025}, url={https://huggingface.co/AstroCompress} }

AstroCompress: