[DATAREDIS-834] Redis connection using LettuceConnectionFactory locks for a long time when the redis server is down. Created: 15/May/18  Updated: 07/Jun/18  Resolved: 07/Jun/18

Status: Closed
Project: Spring Data Redis
Component/s: Lettuce Driver
Affects Version/s: 1.8.13 (Ingalls SR13)
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Ugur Alpay Cenar Assignee: Mark Paluch
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Last updater: Mark Paluch

 Description   

I am trying to setup spring CacheManager using Redistemplate and LettuceConnectionFactory connected to a Redis sentinel

Below is the code I used (slightly reduced for simplicity)

@Bean
public CacheManager cacheManager(RedisTemplate redisTemplate){
    RedisCacheManager redisCacheManager = new RedisCacheManager(redisTemplate);
    return redisCacheManager;
}

@Bean
public RedisTemplate redisTemplate(LettuceConnectionFactory lcf){
    RedisTemplate redisTemplate = new RedisTemplate();
    redisTemplate.setConnectionFactory(ltf);
    return redisTemplate;
}

@Bean
public LettuceConnectionFactory lcf(LettucePool lettucePool){
    LettuceConnectionFactory lcf = new LettuceConnectionFactory(lettucePool);
    return lcf;
}

@Bean
public LettucePool lettucePool() {
  DefaultLettucePool lettucePool = new DefaultLettucePool(new      RedisSentinelConfiguration().master("mymaster") .sentinel(new RedisNode(<name>,26379))
lettucePool.afterPropertiesSet()
}
    

It is not expected that the redis server is always up and working 100% of time in a production environment. I did therefore try to simulate redis server going down by restarting the redis sentinel. The problem is that when the redis sentinel was down, the application stopped responding to the requests or took almost 30-50 seconds to respond. It seems like the LettuceConnectionFactory is locking when trying to reconnect, and builds up a large queue with pending requests which takes very long time when it tries to retry every request in the queue (not sure if this is the scenario, but it is was I understood from reading the documentation).

After lot of debugging I was able to fix this problem by copying and modifying the afterPropertiesSet method in DefaultLettucePool:

public void afterPropertiesSet() {
    if (clientResources != null){
       this.client = Redisclient.create(clientResources, getRedisUri)
    } else {
       this.client = RedisClient.create(getRedisUri())
    }

    /** Custom Code **/
    this.client.setOptions(ClientOptions.builder()
       .autoReconnect(true)
       .cancelCommandsonReconnectFailure(true)
       .disconnectedBehaviour(REJECT_COMMANDS)
       .socketOptions(SocketOptions.builder.connectTimeout(200, MILLISECONDS).build()).build();
   /** Custom code end **/

        client.setDefaultTimeout(timeout, MILLISECONDS)
        this.internalPool = new GenericObjectPool(new CustomLettucePool.LettuceFactory(client, dbindex), poolConfig);

}

I was therefore wondering if you could add an option to set the ClientOptions of RedisClient when configuring the DefaultLettucePool or find another solution to the locking problem by adding more configuration options. It is very important for the application to keep running as usual without cache when the Redis Server is down.

I tried to set the reconnectDelay and connectionTimeout but nothing helped. This was the only solution that actually worked. I also tried to setup without using the LettucePool.

I tried to use JedisConnectionFactory, but the problem with JedisConnectionFactory is that it won't reconnect to the redis sentinel after redis server restart.



 Comments   
Comment by Mark Paluch [ 15/May/18 ]

This sounds like a Lettuce issue and not necessary a Spring Data Redis issue. In general, if your server is not available, you can either fail fast with an exception or (what Lettuce is doing) wait a certain time until Redis comes back online. Pooling won't help here: If Redis is down, all pooled connections are disconnected.

What would you expect should happen if your server is not available?

Comment by Ugur Alpay Cenar [ 15/May/18 ]

Forgot to explain, but I also overrided the CacheErrorHandler and I am just logging the error without doing anything. The application then run as usual without using the cache. I am expecting for the application to not fail when Redis is not available and reconnect when it is available. After my modification on the LettucePool, the Redis CacheManager works as expected.

Comment by Mark Paluch [ 15/May/18 ]

I am expecting for the application to not fail when Redis is not available and reconnect when it is available.

These are two aspects of which to not fail when Redis is not available is a quite broad statement: RedisCache operations propagate exceptions if there's an issue with Redis I/O. Consuming exceptions with an own CacheErrorHandler does not propagate exceptions any further. This brings you to the point where you can suppress exceptions.

You get the reconnect feature from Lettuce and by tweaking timeout options/disconnected behavior you can tailor the actual behavior in unavailability scenarios.

I was therefore wondering if you could add an option to set the ClientOptions of RedisClient when configuring the DefaultLettucePool

We revised LettuceConnectionFactory in version 2.0 with specific client configurations where you can set ClientOptions directly. We don't have plans to update the API for 1.8.x versions, please upgrade to a newer version.

Comment by Ugur Alpay Cenar [ 19/May/18 ]

Sorry for late answer, I was busy the last couple of days. I tried to tweak the timeout/reconnect options but nothing helped. Only solution was to modify the RedisClient configurations which is not possbile through the current LettuceConnectionFactory on 1.8.13. I just think that it was strange that the application stopped responding when Redis was unavailable. I was expecting the ConnectionFactory to stop trying to reconnect after throwing exception which was not the case. I will try to upgrade to version 2.0 when I get the chance.

Comment by Mark Paluch [ 07/Jun/18 ]

Okay, I'm closing this ticket as we currently cannot do anything further here.

Generated at Fri Jul 10 07:26:35 UTC 2020 using Jira 8.5.4#805004-sha1:0444eab799707f9ad7b248d69f858774aadfd250.