This post discusses the strategies for handling errors in Apache Flink applications, the general principles discussed here apply to stream processing applications at large.
Before we can talk about how to handle errors in our consumer applications, we first need to consider the two most common types of errors that we encounter: transient and nontransient.
Retries are mechanisms used to handle transient errors by reprocessing messages that initially failed due to temporary issues.
DLQs are intended to handle nontransient errors affecting individual messages, not system-wide issues, which require a different approach. Additionally, the use of DLQs might impact the order of messages being processed.
Using side outputs in Apache Flink, you can direct specific parts of your data stream to different logical streams based on conditions, enabling the efficient management of multiple data flows within a single job.
After successfully routing problematic messages to a DLQ using side outputs, the next step is determining how to handle these messages downstream.
Another effective strategy is to store dead letter messages externally from the stream, such as in an Amazon Simple Storage Service (Amazon S3) bucket.
In this post, we looked at how you can leverage concepts such as retries and dead letter sinks for maintaining the integrity and efficiency of your streaming applications.