Batch Job in Cloud Native World
Re-platforming Batch Job during Application Modernization/Migration
Context:
In this article I will try to explain what are the different design/implementation patterns we can have when we need to design and run scheduled batch jobs on a cloud native platform.
In the below article, examples, scenarios, and solutions are mostly in AWS platform and application written in .net-core, but the concept and pattern are perfectly applicable to other cloud platforms and other languages for their respective offering/services.
Before we start discussing the actual topic, I feel it is very important to describe my understanding/definition of a few terms used in this article, as those may have different understanding to different persons, e.g. Cloud Native Platform, Batch Job, etc.
What is Cloud Native Platform ? The platform which helps to run cloud native apps ( Twelve-Factor app principles). Computing platform running at PaaS layer or in Serverless. For example container platforms (EKS/ECS/Fargate) or Lambda are cloud native platforms to me. Application servers running on EC2/VM at IaaS layer and maintained by you are not a cloud native platform to me. Though it is debatable.
What is Batch Job ? Many people may have a different understanding of Batch Job. To me Batch job is a software process/module performing same tasks in bulk and at scheduled time in recurring occurrence/manner. e.g. Send emails at everyday mid-night, to all eligible customers to send birthday greetings. or Check in every hour whether a new file is in the staging area to process.
Any Batch job has two parts, one is Scheduling — when to execute, and the other part is processing which contains the business logic to execute.
While the core concepts of batch job remain same in the current cloud native world, except one major difference. In the classic VM or IaaS world batch job is a “background processing will never die”, but in cloud native PaaS/Serverless platforms the batch job is a “background processing will die & reborn”.
Some assumptions and out-of-scope :
Below are some points which we will not discuss in this article as it is out of scope for the context of this article.
- To solve a use case whether we use a batch pattern or not is debatable. There could be alternative approaches such as worker pattern, pub-sub model, near-real-time streaming/messaging pattern, etc. We can debate on the necessity of Batch Job, but can’t disagree that there will be some use cases where definitely batch job is required.
- This article will not cover the batch job at database tier, e.g. replicating/transferring data from one database table to another database table, etc.
- This article will cover batch job implementation at the App tier, mostly for handling business events/functionalities.
Some sample usecases of Batch Job at App tier:
Below are some sample usecase for app-tier batch job, which will help you to ideate which usecases we are going to cover with the possible implementations.
- Generating and sending reports, once in a day.
- Processing files, images etc. to extract data, or perform actions on them. may be in each alternate 2 hours.
- Orchestrating data across multiple systems within one logical user data flow.
- Applications are sending email in bulk.
- The Application is generating some files and putting in a FileServer via SFTP in batch mode.
- Some applications 2–3 batch jobs are running in regular manner and also they are orchestrating among themselves as they are interdependent. etc.
Problem statement:
When you are running a scheduled batch job in a physical machine or VM, you have full control on the operating system to schedule a batch job written in any programming language. You can use Windows Task scheduler/Service or Linux cron job etc., to schedule and invoke a program module to perform a certain task, and your machine/system/OSs/computing always available to perform that task. Sometimes we schedule the job in Control-M and execute code in the VMs or physical machines.
But when you are migrating to the PaaS layer or creating a new application to be hosted in Cloud Native PaaS platform, then you don’t have any access to physical machine/OS to schedule or run the batch job running in the App tier.
Also you have to keep in mind the ephemeral and disposable nature of cloud native PaaS offerings.
Possible solution patterns:
Below are some possible options to implement batch jobs in cloud native platforms, based on your needs and acceptable constraints.
We will not consider “AWS Batch” service as an option here, as the AWS Batch is for HPC (High Performance Computing) usecases.
Detail on how to implement these options below is out-of-scope of this article.
Option 1 : .net-core worker service (IHostedService)
Description: Both the batch job and scheduling can be implemented using the .net-core coding (as .net-core worker service application). Here you don’t need any external scheduler or additional AWS service. The scheduler and the execution code written in .net core (using IHostedService interface) and running within a container.
Maybe we can create a separate .net-core worker service application/project to host all batch jobs of a specific application and run on a separate container tagged as background container.
Here we are taking .net language as reference, but similar approach can be implemented in Java using ScheduledExecutorService / ServletContextListener or similar.
UseCase Fitment:
- Small (less resource intensive) batch job, running frequently in a recurring manner in a day. The batch job which runs quickly and finishes quickly.
- The batch job functionality scope within the application.
- No orchestration among multiple batch jobs required.
- No major monitoring and status reporting required.
- No auto/manual retry required for failure scenarios, and the Batch job itself has a mechanism to pick not-completed tasks/items from the last run.
- The Batch job itself has capability/code to put log entries for success and failure for external monitoring on demand.
Pros:
- Build & host the batch job alongside the application, no external tooling required to schedule the batch.
- Developers in full control in scheduling, processing and exception handling.
- Simple implementation.
- Same implementation across any cloud vendor platform. Same code will work in any container platform.
Cons:
- The container which will host those batch jobs has to be in running state even sometimes the jobs are not running. will increase your cloud bill.
- Need to do proper resource sizing for the container. As Auto scaling is not possible.
- No proper Dashboard for status /history /audit checking for the batch jobs.
- Separate design and effort for error tracking and retry mechanism.
- Need code change for any scheduling change.
- No auto healing, if the container goes down your batch job also will die.
Option 2: Hangfire
Description: Hangfire is an open-source framework that helps you to create, process and manage your background jobs.
Hangfire .net core library helps to create your own batch job of different types.
https://www.hangfire.io/overview.html
Here we are taking .net language as reference, but similar approach can be implemented in Java using JobRunr.
https://www.jobrunr.io/en/
Usecase Fitment:
- Mass notifications/newsletter; firing off web hooks; purge temporary files; batch export to xml, csv, json;
- If Many batch jobs run in a single container for maximum utilization of resources allocated to that container, then it is a good fit.
Pros:
- No Windows Service or separate process required or no external tooling required to schedule the batch.
- Open and free for commercial use.
- Supports all kinds of background tasks — short-running and long-running, CPU intensive and I/O intensive, one shot and recurring.
- Backed by Persistent Storage (filesystem, MSSQL, PostgreSQL).
- Automatic Retries.
- Integrated Monitoring UI
- Hangfire uses distributed locks to handle synchronization issues.
- Same implementation across any cloud vendor platform.
Cons:
- No dedicated support from HangFire in the Free tier.
- A Small yet learning curve for developers is involved.
- Need to check your company’s Open Source software usage policy, in case any special approval is required.
- The container which will host those batch jobs has to be in running state even sometimes the jobs are not running. will increase your cloud bill.
- Need to do proper resource sizing for the container. As Auto scaling is not possible.
- No auto healing, if the container goes down your batch job also will die.
Option 3: CloudWatch + Lambda
Description: On specified times, CloudWatch Event schedule can invoke Lambda which will have the code for execution.
Usecase Fitment:
- Short running and less resource intensive batch job.
- Small and simple code.
- The batch job runs one or multiple times in a day.
- Orchestration among multiple batch jobs NOT required.
- NO major requirement of retry and fancy dashboard.
- No requirement of Parallel processing
Pros:
- Pure serverless, pay as you use, without managing/maintaining any container image.
- Easy to apply devops to provision new batches.
- It has some way to implement alerting/notification on failure on invocation. (using cloudwatch).
- The end-to-end solution is cloud native.
Cons:
- Lambda functions’ limitations include code size limit and execution time limit.
- No history of execution, need custom setup using logging from code.
- No parallel execution.
- No Orchestration among multiple batch jobs.
Option 4: CloudWatch + Container in Fargate
Description: Place the processing logic/code with in container as Fargate Task, Invoke the Fargate Task from Cloudwatch event scheduler.
Usecase Fitment:
- Long running and high resource intensive batch job.
- Batch jobs run one or multiple times in a day.
- Orchestration among multiple batch jobs NOT required.
- NO major requirement of retry and fancy dashboard.
Pros:
- Container need not be running all the time, Cloudwatch will start and stop the task. So less billing.
- Provide some kind of execution history from AWS ECS console for Fargate Task execution.
- Provides some kind of execution metrics in the CloudWatch console. (under Metrics).
- Easy to apply devops to provision new batches.
- Specify the desired number of tasks to run in parallel. It allows multiple tasks in parallel from Cloudwatch as task count.
- It has some way to implement alerting/notification on failure on invocation. (using cloudwatch)
Cons:
- No inbuilt dashboard for status/error reporting.
- No inbuilt feature of retry.
- It is a fire and forget model, so Cloudwatch will trigger the batch job by invoking a task.
Option 5: CloudWatch + StepFunction + Container in Fargate/EKS
Description: CloudWatch Event schedule can invoke AWS Step function which will again invoke either ECS Fargate Task or the EKS Job (based on your container platform selection). Step Function can have multiple batch jobs invoked from within it.
Usecase Fitment:
- Need Orchestration among multiple batch jobs (small/lightweight).
- It could be a light or heavy(long running) batch job.
- Orchestration among multiple small/lightweight batch jobs from within Step Function is possible.
- You need notification on completion of batch job
- You need advance exception handling, retry and alerting mechanism around the batch job.
Pros:
- It has some way to implement alerting/notification on failure on invocation. (using cloudwatch).
- The end-to-end solution is cloudnative.
- Advance exception handling, retry and alerting mechanism
Cons:
- Basic dashboard for status/error reporting.
- Little additional cost/billing due to usage of StepFunction on top of ECS/EKS billing, but very minimal.
Option 6: EKS CronJob
Description: Inbuilt CornJob feature of Kubernetes cluster. Defining CronJob object which will call/execute a container image through K8s Job object.
Usecase Fitment:
- Long running batch job, small running batch jobs are also fine.
- No orchestration among multiple batch jobs required.
- No major monitoring and status reporting required.
- No auto/manual retry required for failure scenarios, and the Batch job itself has a mechanism to pick not-completed tasks/items from the last run.
Pros:
- Same implementation across any cloud vendor platform. cloud vendor neutral.
- Inbuild basic retry mechanism. Basic self-healing.
- Easy to maintain & configure
Cons:
- No dashboard for execution history/status reporting.
- No orchestration among multiple batch jobs possible.
- The Scheduling and processing module is closely coupled under the same platform.
Below is a summary matrix for the different options described above and their respective score in different architectural concerns.
So many options! any single one as a recommendation ?
With so many solutions available, you may get confused which one to choose out of so many. I will recommend you start with CloudWatch +StepFunction + Fargate/EKS/Lambda. So the simple guideline to follow is to start with as below
- If the batch job is simple, execution time within 10 minutes, then CloudWatch Event(EventBridge) + StepFunction + Lambda.
- If the batch job is long running, and heavy in execution then either “CloudWatch Event(EventBridge) + StepFunction + Fargate” OR “CloudWatch Event(EventBridge) + StepFunction + EKS” based upon your container platform (ECS Fargate / EKS)
Please remember, AWS Step Function provides good support for retry, exception handling, alerting(notification), concurrency and orchestration with very minimal additional cost. StepFunction also allows waiting for the end of execution of the batch job via Fargate/EKS/Lambda for response/outcome.
Note: Opinions and approaches expressed in this article are solely my own and do not express the views or opinions of my employer, AWS, Microsoft, Oracle, or any other organization.
Some of the product names, logos, brands, diagram are property of their respective owners.
Please: Post your comments to express your view where you agree or disagree, and to provide suggestions.