Sitecore IP Geolocation Service resolving GeoIP information on the first request - using Circuit Breaker design pattern

A key part of Sitecore's personalisation offering is the ability to gear content towards a user's location. Sitecore provide plentiful information to help you set up your application to retrieve geo data for each originating IP address. Once you have this set up, Sitecore will call a web service behind the scenes, whenever it encounters a request for an IP address which it hasn't seen before. The result of this call is cached in a MongoDB collection to ensure that each IP address is only ever requested from the web service once. Once this process has completed, the Tracker.Current.Interaction.GeoData object is populated with the relevant geo data, and the out-of-the-box location-based personalisation rules will be ready to fire.

However, in the out-of-the-box setup of the Sitecore IP Geolocation Service, you'll find that the service is configured to run asynchronously; in practice, this means that if your Sitecore application is being requested from an IP which has not been seen before, the geo information is not available until the asynchronous web service call has completed - application requests are not blocked by the web service call. So it is likely that the user's initial request (or even initial few requests) will not be able to take advantage of any geo-based personalisation. If you want to ensure that geo data is available from the first application request, you can amend Sitecore to force application requests to block (for a specified time interval) until geo data is resolved - "Solution 1" on this page shows exactly how Sitecore suggest you do this.

But - any readers of Michaeal Nygard's book might be concerned about the possibility of introducing a cascading failure scenario to our application - as it seems that the above approach makes our application dependent on the Sitecore IP Geolocation Service (at least in some situations.) Issues with a recent client show that this concern is indeed valid: although there was no issue with the geolocation web service itself, our client's firewall software was preventing calls through to the web service, and the web service calls were not failing fast - they were timing out. With low site traffic comprising mostly of repeat visitors, there was no obvious impact on the site performance (with the obvious exception that any location-based personalisation would not be working) - however, a mailshot sent out to customers one day triggered a flood of new visitors to the site - causing a site outage for around 30 minutes.

Decompiling the code used for "Solution 1", it is clear what is happening during the above site outage: There is no concept of 'availability' of the Geolocation Service - we naively assume that it is always available (whereas ideally, we would rather not bother making the web service call if it seems that the web service is unavailable.) In the above 'cascading failure' scenario, many requests hit the application from previously unseen IP addresses, causing the web service call to be triggered. However, the web service is unavailable but does not 'fail fast' - therefore each application request is waiting 5 extra seconds for some information which will never arrive. The length of all these requests combined mean a lot of extra work being done by the application server, which then starts to struggle.

Circuit breaker design pattern to the rescue! As mentioned in Release It!, we can introduce some more intelligence into our web service client code, so the concept of 'availability' is now present. Briefly, the circuit breaker concept uses some heuristic to work out whether a resource is available or not. If the resource is unavailable, we stop attempting to use it for some period of time (to prevent unnecessary work by our application - plus the resource in question gets a bit of a break too) - i.e. the circuit is broken. After this period of rest has passed, we decide whether to close the circuit or to keep it 'open' based on the first few attempts to use the resource. For the geolocation web service call, we have decided that 5 exceptions with our web service call should cause us to stop calling the web service for the next 30 minutes.

 

Sitecore's IP Geolocation Service configuration provides us with enough extensibility points to make applying the circuit breaker pattern achievable easily. The root of the geolocation functionality is usually provided by Sitecore.CES.GeoIp.SitecoreProvider, the main method of which encapsulates all the work to get a WhoIsInformation object from the provided IP address:

public override WhoIsInformation GetInformationByIp(string ip)

 

An inspection of the decompiled source for this class shows that if for any reason the geo data cannot be retrieved from the web service (and it's not a valid 'failure' case, such as a 404 response for a particular IP address) then exceptions are thrown from the above method. This is very useful, as it means we can easily determine exactly in which situations the web service should be deemed 'unavailable' - the following replacement Provider class inherits from Sitecore.CES.GeoIp.SitecoreProvider, and simply wraps the base GetInformationByIp call with some circuit breaker code (more on this shortly) to ensure that after 5 exceptions are seen, we 'open' the circuit for 30 minutes. In practise this means that when the circuit is broken we will 'fail fast' and throw an exception instantly within our GetInformationByIp ​method, rather than calling the base GetInformationByIp ​method. The client code of our provider doesn't have to wait 5 seconds before receiving an exception or timing out. The new Provider class looks something like:

namespace RedMoon.Providers.GeoLocation
{
   public class CircuitBreakerLookupProvider : SitecoreProvider
   {
     private readonly bool _circuitBreakerEnabled;
     private readonly ICircuitBreaker _circuitBreaker;

     public CircuitBreakerLookupProvider(ResourceConnector<WhoIsInformation> geoIpConnector)
       : base(geoIpConnector)
       {
         _circuitBreakerEnabled = CircuitBreakerEnabled;
         _circuitBreaker = ResolveCircuitBreaker();
       }

     public CircuitBreakerLookupProvider(ResourceConnector<WhoIsInformation> geoIpConnector, EndpointSource endpointSource)
       : base(geoIpConnector, endpointSource)
       {
         _circuitBreakerEnabled = CircuitBreakerEnabled;
         _circuitBreaker = ResolveCircuitBreaker();
       }

     public override WhoIsInformation GetInformationByIp(string ip)
     {
       if (_circuitBreakerEnabled)
       {
         return _circuitBreaker.Execute(() => base.GetInformationByIp(ip));
       }
       else
       {
         return base.GetInformationByIp(ip);
       }
     }

     private bool CircuitBreakerEnabled
     {
       get
       {
          return Settings.GetBoolSetting("Analytics.PerformLookup.CircuitBreaker.Enabled", false);
       }
     }

     private ICircuitBreaker ResolveCircuitBreaker()
     {
       return DIResolver.Resolve<ICircuitBreaker>();
     }
   }
}

We can patch our new Provider in as follows:

<lookupManager defaultProvider="default">
  <providers>
    <clear/>
    <add type="Sitecore.CES.GeoIp.SitecoreProvider, Sitecore.CES.GeoIp">
      <patch:delete/>
    </add>
    <add name="default" type="RedMoon.Providers.GeoLocation.CircuitBreakerLookupProvider, RedMoon">
      <param ref="GeoIpConnector" />
    </add>
  </providers>
</lookupManager>

Now let's have a quick look at the implementation of ICircuitBreaker. As hand-rolling your own circuit breaker implementation is fraught with concurrency-based risks, I decided to use the Polly exception and fault handling library, which seemed to have rave reviews. Note that Polly also enables more sophisticated circuit breaker techniques too, such as measuring the rate of exceptions thrown, rather than just maintaining a count - however, I've started with a simple approach. It's key that the same Policy object is used to wrap all requests to the resource which we are protecting, and although I don't show this here, I use a DI container to enforce that there is only ever a single instance of the CircuitBreaker class:

namespace RedMoon.Infrastructure.Resilience
{
  public class CircuitBreaker : ICircuitBreaker
  {
    private readonly Policy _circuitBreaker;

    public CircuitBreaker()
    {
      var maxExceptionsBeforeBreaking = Settings.GetIntSetting("Analytics.PerformLookup.CircuitBreaker.MaxExceptionsBeforeBreaking", 5);
      var durationOfBreakMinutes = Settings.GetIntSetting("Analytics.PerformLookup.CircuitBreaker.DurationOfBreakMinutes", 30);

      _circuitBreaker = Policy
        .Handle<Exception>()
        .CircuitBreaker(maxExceptionsBeforeBreaking, TimeSpan.FromMinutes(durationOfBreakMinutes));
    }

    public TResult Execute<TResult>(Func<TResult> action)
    {
      return _circuitBreaker.Execute(action);
    }
  }
}

It looks like we're almost done - however, I happened to notice that the provided Sitecore.CES.Client.WebRequestFactory class creates web clients which don't appear to set a timeout - this means that a request would potentially wait for 100 seconds (the default timeout) before failing! Although application requests are limited to waiting for 5 seconds (changeable in config) for the geolocation data to be available (before decided to carry on rendering the page without this data), behind the scenes Sitecore will still be waiting on this request to complete for much longer, wasting more precious server resources. Also, although this doesn't create a problem for our circuit breaker logic, it does mean that it would potentially take the circuit a couple of minutes to realise it needs to break, rather than a few seconds. Luckily it's easy to add a replacement WebRequestFactory which uses a much smaller timeout - here I've decided to re-use the same 5 second interval which the application uses when delaying the page response whilst waiting for the geolocation web service:

namespace RedMoon.Providers.GeoLocation
{
  public class WebRequestFactory: Sitecore.CES.Client.WebRequestFactory
  {
    private const int MillisecondsPerSecond = 1000; 

    public override WebRequest Create(string requestUri)
    {
      Assert.ArgumentNotNullOrEmpty(requestUri, "requestUri");
 
      var timeoutSeconds = Settings.GetIntSetting("Analytics.PerformLookup.CreateVisitInterval", 5);
      WebRequest webRequest = WebRequest.Create(requestUri);
      webRequest.Timeout = timeoutSeconds * MillisecondsPerSecond;
      webRequest.Headers.Add("X-ScS-Nexus-Auth", AuthHeaderValue);

      return webRequest;
    }
  }
}

The new WebRequestFactory is then patched into the application as follows:

<GeoIpWebRequestFactory type="RedMoon.Providers.GeoLocation.WebRequestFactory, RedMoon" singleInstance="true"
 patch:instead="GeoIpWebRequestFactory[@type='Sitecore.CES.Client.WebRequestFactory, Sitecore.CES']" />

To prove the concept, I created a fake geolocation webservice, which returns a failure response after waiting for a very long time. After 5 exceptions, the logs show the circuit breaking as expected:

Exception: Polly.CircuitBreaker.BrokenCircuitException
Message: The circuit is now open and is not allowing calls.
Source: Polly .................

To try and ascertain how some of this new functionality might behave at scale, I set up some BlazeMeter load tests which simulated a ramp-up to 2000 virtual users. A first test used my fake broken geolocation web service, but with the circuit breaker disabled - this simulated the catastrophic site outages we saw on our client's production environment:

If we keep the web service broken but switch on the circuit breaker, the results are dramatically different (and an inspection of the logs reveals that the circuit breaker logic is triggered within seconds):

A final test ensures that the circuit breaker is not triggering when using the real web service. As we can see, there is no impact on performance, and an inspection of the logs shows that the circuit breaker is not triggered:

 

As an aside, it's a bit tricky to ensure that each of the load test requests seems to be coming from a new IP address (which is a prerequisite for an accurate load testing scenario). This is actually quite easy to fake but probably beyond the scope of this post - drop me an email if you'd like to know how!

 

One final caveat with the things I've outlined in this post - as I vaguely implied earlier, Sitecore will always first look to it's MongoDB cache of geo data before calling the geolocation web service. The downside of this is that if you have suffered from similar problems to the above - that is, firewall settings are blocking connections to the Sitecore geolocation web service - the failed IP lookups will have been cached - so people who have requested the application when the firewall issues are present will continue to not see any location-based personalisation even when the firewall issues are solved. Luckily there's a simple solution - Sitecore have said that it is fine to manually clear the GeoIps collection in the Analytics MongoDB database, to force the web service to be called. There is also a cache in memory, so the application will need to be restarted to clear this after you've cleared the MongoDB collection.


By James at 16 Jul 2016, 21:11 PM


Comments

Post a comment