Pattern analysis for data analysis

I seem to have been only like a boy playing on the seashore, and diverting myself in now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.

Isaac Newton

One of my goals this year is to finish the Advent of Code 2019. I started in December and realised that I just was not going to be able to finish before the end of the month. My brain was not trained in problem solving after spending so long on business issues. However, it has been extremely satisfying coming back to these problems to work on them. As of writing I am working on the second problem of Day 17.

In this post, I’d like to talk about Day 12, which I think I can definitely count as the first day that I doubled down on the wrong solution. The second half of the Day 12 solution was, essentially: Simulate a simplified 4 body problem and find the period before a single state (position and velocity) is repeated.

The answer to my solution was: 376203951569712 iterations or \(~3×10^{14}\) which in order to be simulated in less that 1 minute would require each iteration to be solved in 0.1 picoseconds.

Clearly, the brute force approach is not the way … but I didn’t read between the lines and understand the size of this problem.

I attacked the other problems using Python, and so started this one in Python too. I laid out the problem simply, by having 4 objects with 6 states (positions and velocities for 3 dimensions) and then a simulate function which moved the system forward one timestep.

It could do the smaller example in 5 or 6 minutes, but the larger example didn’t finish in under an hour. I brought up the profiler and started optimising. Lists instead of objects, unrolling loops, chasing down anything that seemed like a bottleneck.

I came to the conclusion that Python was the bottleneck. I left my program running and implemented in C.

This ran fast, about 40x faster. But, it still wasn’t enough. I was under the assumption that the solution would be near the largest example in size, but it was much larger. After unrolling loops, using better data structures, I managed to simulate the largest example in about an hour. The largest example had a solution at 4686774924 iterations, or \(~4×10^{10}\), and I found a trick to working out the final solution from the halfway point.

The fastest my solution would be able to get to the answer was in about 80,000 hours, or 9.16 years.

It was at this point I realised I had gone way, way, down a rabbit hole and decided to start looking at the data. Here is a sample of some velocities and positions over a few hundred timesteps.

A sample of data channels, over 100 time steps

Quite obviously, each data channel is repeating over a period. These are repeating over quite a short period, but in my final solution some data channels had periods of 9000 timesteps.

Suddenly, this became a problem of finding the time \(t\) when each period finished an iteration at the same time. Or, rephrased, we needed to find the least common multiple of all the data channel periods.

This, as it turned out, was the fast way of solving this solution.

The point of this story, is that we are really good at identifying patterns. However there is sometimes a bit of stigma to just putting the data in a spreadsheet and plotting it. Sometimes it is worth offloading visualisation to a program designed for just that, to help your brain manipulate the concept from a grounded position.

Discussion Points

  • Which problems have led you down the rabbit hole?
  • Whats a good technique for catching this habit sooner?
  • When has chasing a problem led to an interesting insight?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.