Saga-distributed-transactions

2 Phase commit for distributed transactions has many limitations. Saga is the recommended pattern in microservices for transactions distributed across multiple microservices , where each having its own database. Saga is sequence of local transactions. If one of the local transaction fails , compensating transactions can be used to roll back each of the previous steps.

Context and problem

Typically in monolithic application which uses a RDBMS acid properties of the database can be used to create the illusion that transactions are executing in a seqential faishon (or the transactions have exclusing access to data) , hence eliminating problems like Lost updates and Dirty Reads.

If data written by one transaction is

over written by another transaction(without reading it) then it is called lost update
read by another transaction then it is called dirty read.

RDBMS can provide isolated transactions but in microservices where each service has its own db, the transactions are not isolated. ie multiple sagas can run concurrently leading to problems like Lost updates and dirty reads.

2 phase commit protocol is an atomic commitment protocol for distributed systems . Note that XA Transactions (2 Phase Commit) 2 has many limitations and is not used much in modern applications now. Note that ejb for instance support distributed transactions which can span multiple ejb deployed on different ejb containers deployed on same or different servers.

Many nosql db like mogodb do not support xa.
Mesage brokers like kafka do not support xa.
Distributed trans are from of synchronous IPC which reduces avaliability as for distributed trans to commit all services must be availaible.
Chatty : numberous messages are passed with retries.
Reduced performance due to locks.
- In a microservices architecture where a request could be handled by multiple services, distributed transactions holding locks can cause serious performance issues and increases chances of dead locks.
The more the number of services, the more avalaibility becomes an issue. In moden microservices architectures where the number of services can be large XA is not an option. Note that as per CAP theorem https://en.wikipedia.org/wiki/CAP_theorem , When a network partition failure happens, system must be decided whether to
- cancel the operation and thus decrease the availability but ensure consistency or to
- proceed with the operation and thus provide availability but risk inconsistency.

ie there is choice between consistency and avalaibility and modern architectures favor avalaibility ie either you you a design for a system which is consistentency oriented (the reads return error or the most recent write) or avalaibility oriented (requests receive a non error response but most recent write is not guranteed).

Solution

Saga pattern can be used manage transactions which span across microservices.

A sagas is a sequence of local atomic transactions. (Instead of a global distributed transaction, have a sequence of local acid transactions)
Each service which is part of distributed transaction is a sagas participant.
Each sagas participant updates database and publishes an event atomically to trigger next local transaction in saga.(sagas are typically message driven)
- Note that update of local db and publishing of event must happen atomically , ie reliable publishing of event when ever state change happens is a critical problem to be solved in event driven microservices arhcitecture.
- Event stores can solve this problem as the store events and reliably publish messages.
- Another way to achive this is using transactional outbox + message relay. (event stores can also use this technique internally)
  - The service using the atomic commit capability of database (RDBMS/ NOSQL) should complete
    - The local updates and putting the message in transactional outbox (database table is used as a messsage queue) in a single atomic transaction.
    - The message relay pulls the event from outbox and inserts them into message broker. The message relay could use a polling based approach which is bit expensive for database and not can lead to latency as well. Alternately inter thread communication based approach could be used to notify a thread pool whenever there is entry in outbox. Note that idempotency is important as a message realy could push messages twice to message broker. Polling based approach also has edge conditions like keeping track of unhandled messages. Deletion of sent messages is one solution. Using a status column to keep track sent messages is another approach. note that one up id as an approach where u track the highest processed id can run into edge condtion that a transaction with message id 4 commits after transaction with message id 5.
Sagas is event driven and the advantage is that even if one of the saga participant service fails , the saga will eventually complete (with success or failure). The key advantage of sagas is that it permits greater avalaibility compared to XA.
Sagas distirbuted transactions do not provide
- Rollback capability : Every saga which updates/inserts/deletes has a corresponding compensating transaction. Hence If a local transaction fails, the saga executes a series of compensating transactions that undo the writes that were made by all the preceding local transactions executed by sagas participants. Note that
  - Compensatable transactions are transactions which can be rolled back via a compensating transaction.
  - Pivot transaction : If the pivot transaction succeeds the saga transaction will run to completion. All transactions after pivot transactions are retryable transactions.
  - retryable transactions - run after pivot transactions and are guranteed to succeed .
- Isolation.
There are two common implementation patterns
- Choreography
  - There is centralized point of control. Each sagas participant executes a local transaction and atomically emits an event which triggers the next local transaction. Choreography is reocmmended only for very simple transactions as there in the absense of point of control the transactions logic is difficult to understand/ track/ test.
- Orchestration.
  - The saga orchestrator handles the sagas transaction and tells the participants which operations to peform based on events. This is approach is better for complex workflows where new participants are being added over time, code readibility is good as there is centralized point of control. Sagas orchestrators like state machines, they send out commands and on receiving respoonse they kind of transition to the next state.Note that the orchestrator can be implict (one of the existing service takes on the additinal responsibility of being the sagas orchestrator) or an explicit one where the orchestator may be a separate service and it invokes sagas participants. The sagas participant subscribes to command channel and sends reply over the reply channel.
Note that a service that creates the saga most atomically performe local db changes and create saga. This would be hard to do atomically for a service which uses nosql with limited transactional capabilities.

Handling isolation related issues in sagas.

As allready metioned lost updates and dirty reads can happen becuase of lack of isolation in sagas.

Various techniques can be used to handle isolation issues.

Semantic lock. A sagas transaction which is compensatable sets flags in records created/ updated to indicate that the record is NOT committed and could be rolled back (Note that in case of retryable trans this is not required) . This flag will be cleared if the sagas completed successfully by a retriable transactions and compensating transaction in case the sagas fails. If the data is NOT commited then another sagas transaction which tries to read/write can be blocked or rejected . This ensures isolation and protect against both dirty reads and lost updates. Note that this approach is similar to what RDBMS do (pessimistic locking). While isolation issues are addressed with this approach , in complex work flows there are chances of dead locks (very similar to what happens in an rdbms) .
Reread value : Here a sagas reads before writes to ensure some other sagas has not modified the data. Eg in movie seat booking app before finally booking a seat , a sagas may want to confirm that the booking request has not allready been cancelled by another transaction.
Pessimistic view : Try to keep critical data updates in retryable transactions guranteed to succeed. Consider the following example.
- The steps in an order creation saga in a ecommerce application where orders can be booked by consumer if wallet balance is avalaible.
  - order is created in a pending state in order service
  - wallet balance avalaibility is checked in wallet service and wallet balance is debited.
  - if balance avaliable entry created in delivery service.
  - order is confirmed.
- the steps of the order cancellation saga
  - increase order balance
  - cancel delivery in delivery service
  - cancel order in order service

in the order cancellation saga if delivery cancellation fails as the shipping has allready happened then order cancellation saga has to be rolled back. in which case the increase in wallet balance has to be rolled back , but before it is rolled back , an order creation saga may allready have read the wallet balance leading to dirty read and a situation where order is approved although wallet balance is not there. if the steps of the order saga are reordered in the following way the dirty read problem is solved..

- - cancel order in order service
  - cancel delivery in delivery service
  - increase order balance
  - Note that increase order balance is now a retryable local transaction not a compensatable local transaction

Commutative updates : updates can happen in any order without risk of lost updates. An accounts debit and credit is a good example. (each change to account , debit or credit , involves a read and then update ) .

Disadvantages of sagas

Each local transaction in a sagas is acid compliant but the overall sagas transaction lacks isolation so dirty reads and lost updates anamolies can occur. The momeny one of the sagas participant commits a transaction , its changes are visible and can be read by other transactions. If these changes are rolled back then we have a problem as another transaction has allready read the changes. Handling of this basically means you have to think of lot more edge condtions while thinking of your business logic eg a order which is not approved yet might be cancelled. (in monolithis RDBMS transaction , isolation property can stop inconsistent states being read at the right isolation level).
Sagas can complicate API design : When a http request intiates sagas when to send back http response.
- option 1 is to send http response when a sagas completes, but
  - This reduces avalaibility as, to send the response all the services has to be up and
  - A thread will be blocked in the service waiting for the response to complete. (in java especially this has significant over heads)
  - This can be handled using reactive interactive gateway. read more here https://accenture.github.io/reactive-interaction-gateway/docs/intro.html
- Option 2 create the saga and return the response immediately. the response does not give the outcome of processing. if the client is using http then the client has to poll. if the client is using websocket then the client can be notified that the order processing is complete. This approach has the advantage that it improves avalaibility. Also user experience will be same although every thing executed asynchronously.

	Nishant Anand I am technical architect with around 18 years of experience in telecom , banking and health care domain. I was part of the founding team of Drishti and currently CTO at affordplan
	Consult Send Query