Considerations for Batch Jobs
ETL and Batch Jobs / Scheduled Jobs are a frequent occurrence in any enterprise solution
There are some aspects that need to be carefully thought through and designed/coded when implementing such functionality
Read optimization
Ensure repetitive loops of read and write to ensure the application always deals with a consistent memory usage and is not running the risk of unbounded memory requirements
Example, run iterations of read/transform/write
- Read 100 records
- Transform
- Write 100 or less records
Write Optimization
Use bulk mode when saving data into databases
Advantage(s)
- DB does not get a sudden burst of traffic
- Job will execute faster
Disadvantage(s)
- Memory footprint would be higher since we need to collect the data
- Large chuck of data lost in case of error
Error Handling
Ensure clear thought and direction to error handling — the more time you spend here during your design phase and iron out the details the more robust would be your job
Decide if you can continue running the application or loop in case of an error
Ensure logs are precise, the application would be sifting through a lot of data and if something goes wrong, we want to know exactly which record had issues to reduce debugging time.
Introduce retry for automatic resolution in case of APIs and DB calls to overcome intermittent network issues
Would the next execution of the job be able to continue from the same point and attempt to move forward?
- if the source data can support a timestamp/bookmark from which we can request for data, then it would be possible to restart from the last completed timestamp/bookmark
Consider skipping records which breach the error thresholds to enable the job to move ahead and not get stuck at a particular problematic record.
Parallel Processing
Evaluate feasibility for the job to consider running various aspects in parallel to speed up job execution times
e.g.
- Processing the batches
- Records within a batch
- Steps within each record
Execution
We can have code for multiple jobs in a single repo if the share certain common dependencies e.g. source and destination is same for a set of jobs
However ensure during execution each job is an independent process(ideally a Kubernetes Cron Job)
Why is this useful?
- Reduced resource utilization (application is running only at scheduled time)
- New container for each execution
- Dedicated execution space — No side effects from other jobs
Bringing it all together
do
{
get batch data from source [e.g. DB]: stop job in case of failure
for each record
transform: decide if it’s stop or continue in case of failure
collect for batch write
write batch data to destination: stop job in case of failure
} while ( there are more records in source [e.g. DB])
Avoid re-inventing the wheel
Use frameworks like Spring Batch if they fit your use case which are robust, manage the boilerplate and lets you focus on the business logic