Big Data in Astroinformatics

Modern astronomy has undergone a true paradigmatic shift: from hypothesis-driven science focused on investigation of a single class of objects to the data-driven research based on explorative analysis of petabyte-scaled surveys of the Universe. Current astronomical high-performance digital detectors in observatories generate petabytes of raw data per night. The data is pan-spectral, ranging from radio through visible light to X-ray and gamma-ray frequencies. New domains are emerging, such as particle astrophysics (neutrinos) and gravitational-wave astronomy.

People

prof. Ing.Pavel Tvrdik CSc.: tvrdik@fit.cvut.cz; Project leader …
RNDR. Petr Skoda, PhD: skoda@sunstel.asu.cas.cz; Detail info about Petr Skoda …
Ing. Jiri Nadvornik: nadvornik.ji@gmail.com; Detail info about Jiri Nadvornik …
Ing. Ondrej Podsztavek: podszond@fit.cvut.cz; Detail info about Ondrej Podsztavek …

Motivation

Most of astronomical data are publicly available through sophisticated networks of federated interoperable data archives based on the same standards for data storage, query and transfer called Astronomical Virtual Observatory.

Requirements to pre-process, store, and analyze this big data pushed the current information technology to its true limits.

High-throughput pre-processing algorithms based on massively parallel GPU platforms using workflow orchestration systems such as Dask or Spark for distributed processing are needed to reduce the amount of stored data to sustainable size.
Advanced visualisation tools became a part of many astronomical projects and result in increasing amount of multimedia content in on-line volumes of major astronomical refereed journals.
Heterogeneity, multidimensionality, and sparsity of more and more complex astronomical datasets need special storage formats (e.g., Parquet, HDF5, ASDF) for rapid searching, filtering, and data mining.
Astronomical analysis of big sky surveys has recently been done in distributed cloud environments called Science platforms (e.g., SciServer, Astro Data Lab) where interactive data mining and visualisation experiments are done through dedicated web GUI or in Jupyter Hub directly launched in data centers storing Big Data archives.

Goals

We are developing a special hierarchical semi-sparse cube architecture to store such data. Our aim is to facilitate processing of these data using cutting-edge technologies, such as Map-Reduce frameworks, GPU-accelerated farms, or HPC supercomputers.

Our research group closely cooperates with teams representing large-scale astroinformatics projects, such as IVOA, LSST, or Heidelberg Institute for Theoretical Studies We are seeking an enthusiastic researcher with computer science background interested in astronomy to work on the design of the hierarchical cube architecture and processing pipelines scalable to petabyte volumes.

Open Position

A candidate should have a solid background in algorithms and data structures, database and networking technologies, parallel and distributed algorithms, and should have programming skills in Python, C, and experience with big data frameworks (Map-Reduce, Spark, Dask…). Basic knowledge of cloud environment using Docker, JupyterHub, JupyterLab or Google Colab, and interest in astronomy is an asset.

Pavel Tvrdik

People

Motivation

Goals

Open Position