AstroCompress:

A Benchmark Dataset for Multi-Purpose Compression of Astronomical Imagery

UC Irvine
UC Berkeley
UC Irvine
UC Berkeley
UC Berkeley
UC Irvine
UC Berkeley, LBNL
ICLR 2025
*Equal contribution first authors. Email questions jointly to rithwik@berkeley.edu and tuannt2@uci.edu.
AstroCompress Dataset Overview

AstroCompress provides large-scale astronomical imaging datasets spanning space-based and ground-based telescopes, enabling scientifically-informed ML for astronomy research. We focus on neural compression for improved data transmission and storage.

Abstract

Astronomical observatories are physically remote, leading to data transmission limits that severely bottleneck throughput. Often, observatories can collect much more data than they can downlink. Any improvements in lossless compression could unlock millions of dollars worth of additional science by enabling more observations and thus unleashing the potential of existing instruments.

Traditional lossless methods for compressing astrophysical data are manually designed. Neural data compression can learn end-to-end from data and outperform classical techniques by leveraging the unique spatial, temporal, and wavelength structures of astronomical images.

This paper introduces AstroCompress: a neural compression challenge for astrophysics data, featuring four new datasets with raw, 16-bit unsigned integer imaging data in various modes: space-based, ground-based, multi-wavelength, and time-series imaging. We benchmark seven lossless compression methods and show that neural methods can enhance data collection.

Webb's Orbit at L2

The Challenge: An Explosion of Astronomical Data

Astronomy data volumes are growing at a superexponential rate, increasing by a factor of nearly 1000x every 10 years. This rapid growth places an immense burden on our ability to transmit and store data, threatening to stall scientific progress. The development of exascale supercomputers is a direct response to this data deluge.

Space Telescopes

Advanced compression is not just a convenience—it's a critical necessity to ensure we can continue to collect and analyze the vast amounts of data from radio astronomy, satellite imagery, and other large-scale scientific endeavors. Unlike internet media, where better compression implies faster load times for a user, telescope data compression leads to more data collected overall. The Nancy Grace Roman Space Telescope, launching soon in 2027, faces "unprecedented" data scale challenges that could result in permanent data loss if transmission bottlenecks aren't solved. "The rate of data generation significantly exceeds Roman's very large download capability (1.4 TB/day) even after on board compression; therefore downloading all individual readouts is currently not feasible." [StSci] For future space-based telescopes where every bit of downlink bandwidth is precious, compression improvements have the potential to dramatically increase scientific output and discovery potential—transforming what would otherwise be lost data into preserved science.

Ground Telescopes

Compression is important for ground-based telescopes as well. The Square Kilometer Array alone will produce 62 exabytes per year starting in the late 2020s. The infrastructure costs for storing, transmitting, and processing this data represent a substantial portion of total project budgets, where compression improvements have compound economic benefits across the entire data pipeline.

Due to the high costs of these instruments, even a one to few percent improvement in compression ratio can deliver significant scientific value.

Essential Findings for Future Telescope Design

Non-neural Methods

We find that existing advanced non-neural codecs such as JPEG-XL and JPEG-LS can achieve significantly improved compression than the current astronomy standard, RICE. These algorithms are also highly efficient and compatible with most existing hardware -- both algorithms can compress a full 4K Hubble image in under a second, while the powerful JPEG-XL at max setting takes 1.5 minutes in exchange for improved compression. We believe that future space and ground optical telescopes should consider one of these algorithms.

Neural Methods

Additionally, we explore neural methods and demonstrate their potential for the medium to long term when the necessary ML-acceleration hardware can be installed on space or ground-based telescopes. Our research demonstrates that neural compression methods can match or exceed classical state-of-the-art algorithms like JPEG-XL while learning end-to-end from the data itself.

Unlike traditional hand-crafted compression algorithms, neural compression learns directly from astronomical data, automatically discovering the unique patterns and structures present in space-based and ground-based imagery. By leveraging the spatial correlations of celestial objects, temporal dependencies in time-series observations, and spectral relationships across wavelengths, neural methods might achieve superior compression ratios.

AstroCompress Dataset Specifications

Dataset ID Telescope/Survey Location Wavelength (Å) Filters/Bands Resolution (arcsec/pix) Dimensions (pixels) Arrays Size (GB) Key Artifacts
GBI-16-2D W.M. Keck Observatory LRIS Hawaii, USA (Ground) 4500-8400 B, V, R, I 0.135 2248×2048
3768×2520
137 1.5 Atmospheric seeing, variable conditions
SBI-16-2D Hubble Space Telescope ACS Low Earth Orbit (Space) 5926 F606W (V-band) 0.05 4144×2068 2140 69 Cosmic ray hits, charge transfer inefficiency
SBI-16-3D James Webb Space Telescope NIRCam L2 Lagrange Point (Space) 19840 F200W (H-band IR) 0.031 2048×2048×T
(T=5-10)
1273 90 Up-the-ramp sampling, saturation effects
GBI-16-4D Sloan Digital Sky Survey New Mexico, USA (Ground) 3543, 4770, 6231, 7625, 9134 u, g, r, i, z (ugriz) 0.396 800×800×5×T
(T≤90)
500 158 Atmospheric variations, multi-epoch gaps
GBI-16-2D-Legacy Las Cumbres Observatory Multiple Sites (Ground) Variable B, V, r', i' 0.58 3136×2112 9 0.1 Benchmark comparison dataset

Total Dataset: ~320 GB across 6,201 arrays | Bit Depth: 16-bit unsigned integer | Access: HuggingFace Datasets

The AstroCompress benchmark provides the astronomical community with 320 GB of carefully curated datasets spanning the full diversity of modern astronomical observations, from Hubble's cosmic ray-affected images to JWST's infrared time-series and SDSS's multi-wavelength surveys.

Comprehensive Compression Results

Compression Ratios Achieved: Higher values indicate better compression.

Dataset Neural Methods Non-Neural Methods
IDF L3C PixelCNN++ VDM* JPEG-XL (max) JPEG-XL JPEG-LS JPEG-2000 RICE
Single-Frame Datasets
LCO (Ground) 2.83 1.67 2.02 3.64 2.98 2.78 2.81 2.80 2.65
Keck (Ground) 2.04 1.89 2.08 2.11 2.01 1.97 1.97 1.96 1.84
Hubble (Space) 2.94 2.90 3.13 3.33 3.26 2.92 2.86 2.67 2.64
JWST-2D (Space) 1.44 1.38 1.44 1.44 1.38 1.33 1.35 1.37 1.24
SDSS-2D (Ground) 2.91 2.36 3.35 3.27 3.38 3.14 3.16 3.20 2.96
Multi-Frame/Multi-Spectral Datasets
JWST-2D-Res 3.14 2.91 2.80 3.35 2.37 3.24 1.69 3.08
SDSS-3Dλ 3.05 2.29 2.88 3.49 3.23 3.24 3.28 3.05
SDSS-3DT 3.03 2.59 3.02 3.48 3.23 3.24 3.29 3.05

🏆 Key Findings

  • JPEG-XL (max) establishes new state-of-the-art among non-neural methods
  • Neural methods achieve competitive performance with minimal architectural modifications
  • VDM estimates show significant room for future neural compression improvements

📊 Performance Notes

  • *VDM: Theoretical upper bound estimates (impractical runtime)
  • Green highlights: Best performance per dataset
  • Multi-frame: Higher ratios exploit temporal/spectral correlations

Scientific Analysis: What Makes Astronomical Images Compressible?

Background Noise Dominates Compression Requirements

Our analysis confirms a fundamental insight about astronomical image compression, first explained by Pence et al. (2009): 98.5% of pixels in typical astronomical images fall below 3 SNR, meaning they are dominated by background noise rather than actual astronomical sources. This finding has important implications for compression algorithm design.

We demonstrate a strong linear correlation between background pixel noise levels and compression bitrate requirements across all datasets. Images with higher background noise standard deviation require proportionally more bits per pixel, following the theoretical prediction from Shannon's source coding theorem.

Exposure Time vs. Compression Performance

Our Hubble dataset analysis shows an inverse relationship between telescope exposure time and achievable compression ratios. Longer exposures accumulate more photon noise and detector artifacts, reducing compressibility. This relationship plateaus at very long exposures as CCD pixels approach saturation limits.

Cross-Dataset Generalization Insights

Neural compression models trained on the diverse Keck dataset (with varied wavelength filters and observing conditions) demonstrate superior generalization performance compared to models trained on more homogeneous datasets. The Keck-trained Integer Discrete Flow model achieved better compression on SDSS data than models trained directly on SDSS itself.

Promising Future Directions

〰️ Lossy Compression for Astronomy

Though lossless compression is the gold standard, lossy and near-lossless compression can achieve significantly higher compression ratios. For astronomy, where most pixels are dominated by noise, lossy compression is particularly promising. Collaborations between astronomers and machine learning experts could lead to advanced lossy compression algorithms that selectively discard non-essential data while preserving scientifically valuable information.

One approach might mask astronomical sources and prioritize the accuracy of source pixels. Another approach could involve near-lossless compression, which ensures a strict user-defined upper bound on the error of every individual reconstructed pixel. Such an approach would be attractive to astronomers, who may desire a robust error measurement for uncertainty propagation.

📡 Other Data Types

Other domains in astrophysics are in need of compression benchmarks, especially such as radio astronomy (i.e. Square Kilometer Array) and neutrino detectors (i.e. IceCube). Even beyond astronomy, domains like biological imaging, genome sequencing, and satellite data all have the fundamental bandwidth limitations that make compression a critical bottleneck.

🚀 Practical Deployment

The next critical step is actually deploying new compression algorithms to observatories, starting with JPEG-XL or JPEG-LS. Today, RICE compression is the standard method in use. To this end, we are working on a PR to Astropy, integrating JPEG-XL and JPEG-LS compression methods [Astropy PR]. We hope to follow this up with a PR to the [CFITSIO] library, which hosts the fpack tool widely used to compress FITS files. Finally, we hope that in the near future, we can deploy a codec like JPEG-XL or JPEG-LS to a ground-based or space-based telescope.

⚡ Encoding-Decoding Time Tradeoffs

Astronomy prioritizes compression performance, followed by manageable encoding times. Slow decoding on the ground is a non-issue. Traditional compression methods instead optimize for fast decoding on consumer devices, so future work should explore trading decoding speed for higher compression ratios. This asymmetry favors autoregressive methods that can yield higher compression ratios but slow sequential decoding, with Transformer models being potential candidates.

📊 Temporal Image Compression

Time-series imaging has the potential for better compression ratios by exploiting correlations across time. We achieved extremely high compression ratios on the JWST-2D-Res dataset, as the frames were collected back-to-back in time. We encourage more exploration of dataset construction and compression for back-to-back time-series imagery. Such compression will be critical for the operation of next-generation wide-field time-domain surveys from space. We believe that spectral image compression or temporal images taken far apart in time, such as our SDSS dataset, may be difficult due to noise causing weaker correlation across the less significant bits.

BibTeX

@inproceedings{truong2025astrocompress,
  title={AstroCompress: A Benchmark Dataset for Multi-Purpose Compression of Astronomical Imagery},
  author={Truong, Tuan and Sudharsan, Rithwik and Yang, Yibo and Ma, Peter Xiangyuan and Yang, Ruihan and Mandt, Stephan and Bloom, Joshua S.},
  booktitle={International Conference on Learning Representations},
  year={2025},
  url={https://huggingface.co/AstroCompress}
}