Accepted paper at CCGRID’21 titled Living on the Edge: Efficient Handling of Large Scale Sensor Data by Roman Karlstetter et al.

The article investigates the various aspects related to the efficient handling (e.g., processing, encoding, compression) of large sensor data.

Real-time sensor monitoring is critical in many industrial applications and is, e.g., used to model and predict operating conditions to optimize operations as well as to prevent damage in machinery and systems.

In many cases, this data is generated by a myriad of sensors and stored or transmitted for post-processing by data analysts. Handling this data near its origin—on the edge—imposes significant challenges for storage and compression:

it is necessary to store it in a format that is suitable for large data analytics algorithms, which in most cases means columnar storage. Furthermore, to provide efficient storage and transmission of such sensor data, it must be compressed efficiently. However, existing solutions do not address these challenges sufficiently.

In this work, we present a holistic approach for fast streaming of large-scale sensor data directly into columnar storage and integrate it with a proven compression scheme. Our approach uses a pipelined scheme for streaming and transposing the data layout, combined with a byte-level transformation of data representation and compression, which we evaluate in comprehensive experiments. As a result, our approach enables the transformation of large-scale sensor data streams into an efficient, analytics-friendly format already at the sensor site, i.e., on the edge, at data ingestion time. By implementing our optimized approach in the open and widely used columnar storage format Apache Parquet, which we already partly upstreamed, we ensure its accessibility to the community.