Often, when building complex models from data, it can be useful to start the process with a least-squares optimized model that fits data obtained from an experiment as a baseline/starting point.

I will consider a fairly straightforward example of instrument data that appears to fit a mixture model comprising three Gaussian distributions and a single Chi-squared distribution. In this example, the choice of these distributions is specific to the domain being considered and can be any mixture of functions relevant to the dataset you may be considering.

The instrument data looks like this,

*Instrument data*

The plot looks quite asymmetric with a long tail beyond 1.5. The black vertical lines along the x-axis are a rug plot that can be instructive in understanding the spread of the data over the x domain.

In this example, we assume that this dataset can be fitted by three Gaussians, which can largely account for the narrow peak just after x = 0 and a Chi-squared distribution (with a degree of freedom value >1) which may be able to account for the long tail beyond x = 1.5. Let’s set this up,

```
# f1 = k1*np.exp(-(x-m1)**2 / (2)*(s1)**2)
# f2 = k2*np.exp(-(x-m2)**2 / (2)*(s2)**2)
# f3 = k3*np.exp(-(x-m3)**2 / (2)*(s3)**2)
# f4 = k4*chi2.pdf(x,4,m4,s4)
```

I’ll be using the least_squares function from scipy.optimize to perform the least squares fitting of this model. This requires setting up the model as follows,

```
def objective(z):
y = np.exp(logprob) #instrument data
k1 = z[0]
m1 = z[1]
s1 = z[2]
k2 = z[3]
m2 = z[4]
s2 = z[5]
k3 = z[6]
m3 = z[7]
s3 = z[8]
k4 = z[9]
m4 = z[10]
f1 = k1*np.exp(-(x-m1)**2 / (2)*(s1)**2)
f2 = k2*np.exp(-(x-m2)**2 / (2)*(s2)**2)
f3 = k3*np.exp(-(x-m3)**2 / (2)*(s3)**2)
f4 = k4*chi2.pdf(x,4,m4,s4)
return y - (f1+f2+f3+f4)
```

In this context, y is the raw data from the plot above. The objective function thus defined will be subjected to a least-squares minimization that will attempt to reduce the Euclidean distance between y and f1+f2+f3+f4 (the model).

The parameters k1 to m4 will be adjusted by the least_squares function to obtain the lowest distance between the data and the model.

To begin the process, we will need to specify some initial conditions or “best guess” for the parameters k1 through m4 as well as some bounds. I’ve added some reasonable bounds, as well as included where I think k1 through m4 should lie (in order).

```
initial_values = [1.2, 0.2, 0.1, 0.3, 0.4, 0.1, 0.3, 0.7, 0.2, 0.5, 0.5, 0.5]
bounds=([0,0,0,0,0,0,0,0,0,0,0,0],
[np.inf,np.inf,np.inf,np.inf,np.inf,np.inf,np.inf,np.inf,np.inf,np.inf,np.inf,np.inf])
```

The least_squares function in scipy has a number of input parameters and settings you can tweak depending on the performance you need as well as other factors. It may be important to consider the algorithm used by this function to perform the minimization. This may depend on the type of problem you may be working on. It is always useful to check the latest documentation prior to implementation in code. Note the use of a tuple for the bounds variable (a min array and a max array combined to form a tuple).

For my problem, I used the Trust Region Reflective algorithm, which is suitable for large sparse problems with bounds. Calling least_squares,

`result = least_squares(objective,initial_values,method='trf',bounds=bounds)`

I like to write my results into variables so that I may plot the optimized model against the data to visually inspect how closely my model resembles the experimental/instrument data.

```
k1 = result.x[0]
m1 = result.x[1]
s1 = result.x[2]
k2 = result.x[3]
m2 = result.x[4]
s2 = result.x[5]
k3 = result.x[6]
m3 = result.x[7]
s3 = result.x[8]
k4 = result.x[9]
m4 = result.x[10]
s4 = result.x[11]
f1 = k1*np.exp(-(x-m1)**2 / (2)*(s1)**2)
f2 = k2*np.exp(-(x-m2)**2 / (2)*(s2)**2)
f3 = k3*np.exp(-(x-m3)**2 / (2)*(s3)**2)
f4 = k4*chi2.pdf(x,4,m4,s4)
```

Plotting the model against the data,

```
plt.plot(x,f1,label='Component [1]',color='darkorange')
plt.plot(x,f2,label='Component [2]',color='yellow')
plt.plot(x,f3,label='Component [3]',color='green')
plt.plot(x,f4,label='Component [4]',color='red')
plt.plot(x,(f1+f2+f3+f4),label='Combined Components',color='black')
plt.fill_between(x, np.exp(logprob), alpha=0.2, label='KDE')
plt.xlabel('RUWE', fontsize=12)
plt.ylabel('Frequency',fontsize=12)
plt.legend()
```

Looks like a reasonable fit,

*Model vs data*

This is a fairly typical workflow for this kind of problem. If the model does not fit the data it is often due to poor model selection or poor selection of initial conditions and bounds (or both!). In the case of poor model selection, it makes sense to go back and examine the context or domain knowledge of the problem in order to suggest a more robust model. In the event of poor initial values and bounds selection, you may need to adjust the values by trial and error until a reasonable fit is achieved.

I hope that this post was instructional. See you on the next one!

]]>*Ramen Around The World!*

This post combines a few basic techniques in order to generate some simulated data that follow the distribution of a given probability density function (p.d.f).

Let’s assume that we have a p.d.f of the form,

```
def pdf(x):
F = np.exp(-x**2/2)
return F
```

Plotting this to verify,

```
x = np.linspace(-8,8,100)
plt.plot(x,pdf(x))
```

The choice of function is arbitrary, I’m using a Gaussian p.d.f here for simplicity/as a toy problem.

The goal now is to generate some simulated data points that follow the same distribution as say, the p.d.f above (or equivalent). Let’s get figure out how to do this!

Inverse transform sampling (ITS)¹ is a robust technique that can generate a set of samples based on a given p.d.f thus delivering a simulated dataset that follows the parent distribution.

ITS involves taking the integral of the given p.d.f and using a fairly intuitive method to generate the required samples.

However, as far as I’m aware, popular Python packages such as SciPy and Numpy do not have an ITS implementation that comes out of the box in the form of a Python class or function.

As an alternative, there are some pretty neat ITS implementations built on SciPy and Numpy. For example, Peter Will’s implementation is great. I’ve tested it out on a few cases over the last couple of weeks and it worked really well for me. Peter’s code is well documented and getting the package set up using his Github repo takes less than five minutes. Peter has also provided a helpful getting started guide in the form of a Jupyter/ipython notebook.

I also came across some great conversations about ITS on Github as well as a feature request to SciPy. Have a look.

If you don’t want to ITS and integrate your way through the problem, here’s a great alternative.

This basically involves generating a large number of random data points in a 2D plane (assuming a 2D problem) on top of the given p.d.f then considering a random data point, checking it’s Y value against the Y value of the p.d.f and storing the corresponding X value if Y(random point) < Y(p.d.f).

Doing this over a set of Y values will generate a set of X values that follow the p.d.f. No integration required!

Let’s implement this in Python.

Let’s begin by specifying our p.d.f. I’m taking a slightly more complicated example for this implementation but again, you can use a p.d.f of choice.

```
def pdf(x):
k1 = 120.92391503
k2 = 74.32971291
k3 = 10.6030466
k4 = 6.57300503
K = 434.34648701
f1 = k1*np.exp(-(x-8.16388650e-01)**2 / (2)*(2.55854313e+00)**2)
f2 = k2*np.exp(-(x-1.03579461e+00)**2 / (2)*(1.14453980e+00)**2)
f3 = k3*np.exp(-(x-3.15216216e+00)**2 / (2)*(2.99104497e+00)**2)
f4 = k4*np.exp(-(x-3.72954147e+00)**2 / (2)*(6.86102440e-01)**2)
F = K*(2*np.exp(x)/np.sqrt(2*np.pi))*np.exp((-1/2)*np.exp(x)**2)
return F+f1+f2+f3+f4
```

Plotting once again,

```
x = np.linspace(-10,10,100)
plt.plot(x,pdf(x))
```

The next step is to generate some random points that can cover the entire grid over which the p.d.f is defined in the plot above.

I found that the Python packageShapely can be leveraged to generate these points given a bounded rectangular region as the grid above. Shapely has polygon objects which can be manipulated to generate points within any shape you can think of! Shapely is available via pip command or Anaconda’s package manager. Importing Shapely and defining the rectangular space (polygon) over which to generate random points,

```
import numpy as np
from shapely.geometry import Polygon, Point
poly = Polygon([(-10, 0), (10, 0), (10, 400),(-10, 400)])
min_x, min_y, max_x, max_y = poly.bounds
```

Developing a function that generates random points within this polygon,

```
def random_points_within(poly, num_points):
min_x, min_y, max_x, max_y = poly.bounds
points = []
while len(points) < num_points:
random_point = Point([random.uniform(min_x, max_x), random.uniform(min_y, max_y)])
if (random_point.within(poly)):
points.append(random_point)
return points
```

Let’s generate 100,000 random points in this space,

```
points = random_points_within(poly, 100000)
#sampled data points
xs = [point.x for point in points]
ys = [point.y for point in points]
```

Plotting to verify,

```
plt.scatter(xs, ys,alpha=0.2, label='Random Data Points')
x = np.linspace(-10,10,100000)
plt.plot(x,pdf(x),color='r',label='PDF')
plt.xlabel('X', fontsize=12)
plt.ylabel('Frequency',fontsize=12)
plt.legend()
```

I like to work with data frames so I’m going to implement the core logic that selects the relevant data points using Pandas. Feel feel to use whatever data structure you like at this point,

```
df = pd.DataFrame({'xs':xs,'ys':ys},index=None)
#using a list
l = []
for i in range(len(df)):
if df.loc[i,'ys']<pdf(df.loc[i,'xs']):
l.append([df.loc[i,'xs']])
under_curve = np.asarray(l)
#plotting the results
plt.hist(under_curve,bins=100,alpha=0.5, label='Simulated Data')
plt.xlabel('X', fontsize=12)
plt.ylabel('Frequency',fontsize=12)
plt.legend()
```

And that’s it! since the points under the p.d.f have the same probability values, the simulated data generated using this method will be consistent with the given p.d.f.

[1] Olver, Sheehan, and Alex Townsend. “Fast inverse transform sampling in one and two dimensions.” *arXiv preprint arXiv:1307.1223* (2013).

I’ve been working in the log domain over the last couple of weeks, specifically using the natural logarithm, denoted by “ln”. Life has been easier this way.

The dataset I’m working on has a component driven by a random variable X that is distributed normally (a Gaussian distribution) with mean = 0 and standard deviation = 1.

In my last post, I transformed this dataset entirely to the log domain by using the transformation x → ln(x) using numpy and python,

```
import numpy as np
df_log = np.log(df) # df = my dataframe
```

In case you missed it, the resulting discussion around this can be found here.

The component that makes up the bulk of the samples in this data is a good old fashioned Gaussian/normal distribution whose probability density function (p.d.f) is given by,

*The OG*

The mean, mu and standard deviation, sigma are 0 and 1 which gives us the simplest version of the normal distribution, the so-called *standard *normal distribution,

*Standard normal distribution*

Now, my random variable X has been transformed via the log operator to ln(X). What impact does this have on the distribution of X?

Well, for starters ln(X) represents a *transformed* random variable and the probability density function associated with the original random variable, X will be transformed into a *new probability density function*, which will look quite different to its progenitor. This is a crucial point.

So the question becomes, if X is represented by a p.d.f. N(0,1),* then what is the p.d.f of ln(X)?*

If,

then,

There are many ways to answer this question but I used a fairly simple “analytical” approach which I thought would be super useful for anyone going through a similar process (no Taylor expansions here!).

So here goes!

We know that the integral of the p.d.f of X, over its domain = 1. Thus we write,

I’m going to consider x > 0 for this derivation, without loss of any generality. Note that half of the area under the p.d.f curve is the integral from 0 to infinity when x > 0 and the other half of the area under the p.d.f curve is obtained when x < 0. This means that for x > 0 and because integrating under the curve is equivalent to finding the area under the curve,

or in more simple terms **(1)**,

Let’s consider the transformation. We need to transform both the variable under which the integration is performed as well as the limits of the integration. Let Y be a new random variable where Y = ln(X). Therefore,

When,

and,

Substituting in **(1) **above, we get,

Rearranging a bit **(2)**,

(I know, I know, I should have rationalised the denominator, but I’m an engineer doing physics and I don’t want to redo the LaTeX and my python code so maybe let me off the hook please?)

What does this imply? if you look at **(2) **closely, you’ll notice that the function represented by y integrates to 1 over the entire domain of y. This means that this function must be the p.d.f of y!

In other words, since Y represents the log-transformed random variable ln(X), the p.d.f of ln(X) must be,

I’ve played around with the notation here and “substituted” y with the symbol x for increased readability giving **(3)**,

That’s a rad looking p.d.f! it’s got exponents e-*verywhere! *😂

So what does this p.d.f look like when plotted? if the normal distribution looks like a bell curve, what might we expect from **(3)**?

To answer this question, we turn to python and matplotlib.

```
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns #makes plots pretty
x = np.linspace(df_log.min(), df_log.max(), 1000)
#pick any sensible domain here, I'm using the domain of my dataset #for simplicity
f = (2*np.exp(x)/np.sqrt(2*np.pi))*np.exp((-1/2)*np.exp(x)**2)
# this is expression (3) above or the p.d.f of ln(x)
plt.plot(x,f,label='p.d.f')
plt.ylabel('Frequency',fontsize=12)
plt.legend()
```

And voila!

If anyone is curious, this is the pen and paper version of the derivation above. Note that the pen and paper version is for a more general case of a normal distribution. See you on the next one!

*Pen and paper, more general version*

Many thanks to Nuzhi Meyen for his insights and Kasun Fernando for spotting an error in my LaTeX.

]]>I’ve been pretty busy working with some data from an experiment. I’m trying to fit a subset of the data to a model distribution/distributions where one of the functions follows a normal distribution (in linear space). Sounds pretty simple right? Based on the domain knowledge of this problem, I also know that the data can probably be fitted by a mixture model and more specifically a Gaussian mixture model. Brilliant you say! Why not try something like,

```
from sklearn.mixture import GaussianMixture
model = GaussianMixture(*my arguments/params*)
model.fit(*my arguments/params*))
```

But try as I might I couldn’t find parameters that should model the underlying processes that generated the data. I had all sorts of issues from overfitting the data to nonsensical standard deviation values. Finally, after a lot of munging, reading and advice from my supervisor I figured out how to make this problem work for me and move onto the next step. In this post I want to focus on why the log domain can be useful in understanding the underlying structure of the data and can aid in data exploration when used in conjunction with kernel density estimation (KDE) and KDE plots. Let’s look at this dataset in a bit more detail. Importing some useful libraries for later,

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#plot settings
plt.rcParams["figure.figsize"] = [16,9]
sns.set_style('darkgrid')
```

I’ve made the plots a little bigger and I’m using seaborn which enables me to manage the plots a little better and simultaneously make them look good! Reading the CSV with the data and getting to the subset that’s relevant to this project,

```
df = pd.read_csv("my_path/my_csv.csv")
df_sig = df[df['astrometric_excess_noise_sig']>1e-4]
df_sig = df_sig['astrometric_excess_noise_sig']
#describing the data
df_sig.describe()
```

As you can see, the maximum is close to 7000 while the minimum is of the order 1e-4. This is a fairly large range as the difference between the smallest and the largest value in this data frame is of the order 1e+7. This is where I had a bit of a moment/brain fart. Let’s walk through my moment!

I tried a fairly naive plot of this data and realised that it looks like this,

```
plt.hist(df_sig, bins=150)
plt.xlabel('Astrometric Excess Noise Sigma', fontsize=12)
plt.ylabel('Frequency',fontsize=12)
plt.legend()
```

So this is with something like 150 bins. This should have been my first clue! The maximum value and values that extend beyond a few hundred have relatively fewer number of samples compared to the values that are a lot closer to zero or even in the tens.

After a LOT of blind alleys, I switched to the log domain.

Yes everyone, the log domain!

(If you want to know all the blind alleys I went down drop me a DM on Twitter and I’ll explain. I’m going to focus on the solution here instead!)

Why the log domain? (specifically log of base e or the natural logarithm). If you look at the (hideous) histogram above you’ll notice that the count is not “sensitive” enough to pick up the low frequency and high-value samples that extend beyond a few tens on the x-axis. Furthermore, the domain knowledge indicated that this data might be due to three underlying processes and potentially can be explained by a mixture model of three components that map onto these processes. Sadly, this structure is not visible in the linear domain due to the massive spread in the data and the low frequency of some of the samples (which is to be expected in this kind of experiment).

Switching to the log domain (or log10 if you like!) can address this problem.

```
df_sig_log = np.log(df_sig)
#converting to a numpy ndarray here because I'll need it later
df_sig_log_np = df_sig_log.to_numpy()
plt.hist(df_sig_log_np, bins=150)
plt.xlabel('Astrometric Excess Noise Sigma', fontsize=12)
plt.ylabel('Frequency',fontsize=12)
plt.legend()
plt.grid(True)
```

Ah structure! I see you now. It looks like the data has a lot more structure than the linear pot was able to reveal. The domain knowledge indicates that one of the components of the Gaussian mixture model is a mean zero Gaussian with a standard deviation of 1. This means that by about x = 3 on the linear scale the Gaussian should significantly taper down and approach zero (99% of the samples within this linear Gaussian). Converting to the log domain, log(3) is approximately 1.09… which means the samples contributing to this Gaussian should end somewhere between 0.0 and 2.5 on the log domain plot.

So what’s all this other stuff beyond 1.09…?

Well, some of it is the tail of the N(0,1) Gaussian but it looks like there’s a whole bunch of other data points with reasonably high frequencies.

*Drumroll*

These are the other components of the Gaussian mixture model as noted by the domain knowledge! *Voila!*

Now we’re ready to start exploring the data and understand the behaviour of each component and how they may fit this experimental dataset.

When fitting a mixture model, a kernel density estimation is a great way to start. Jake Vanderplas probably has one of the best writeups on how to perform a KDE on Python. The code may be a little dated as we have moved onto newer versions of numpy, scipy etc. since his post was published but it’s still an amazing resource as long as you tweak the code a little bit. His book is also highly recommended!

So what is a KDE? there are a whole bunch of resources that you can use to understand what a KDE is and how it functions, Jake Vanderplas for example states,

Kernel density estimation (KDE) is in some senses an algorithm which takes the mixture-of-Gaussians idea to its logical extreme: it uses a mixture consisting of one Gaussian component per point, resulting in an essentially non-parametric estimator of density.

This video is also pretty neat and gets to the point pretty quickly.

This method is great at illustrating how a KDE works. I’ve also added a rug plot to indicate the locations of the samples/experimental data. This is helpful when comparing a KDE, a histogram and the spread of the samples across the domain.

```
from scipy.stats import norm
data = df_sig_log_np
#I'm keeping the domain restricted to the domain of the data
x_d = np.linspace(df_sig_log.min(), df_sig_log.max(), 1000)
density = sum(norm(xi).pdf(x_d) for xi in data)
plt.fill_between(x_d, density, alpha=0.5)
#this is called a rug plot and indicates the location of the data on the x axis
plt.plot(data, np.full_like(df_sig_log_np, -0.1), '|k', markeredgewidth=1)
plt.xlabel('Astrometric Excess Noise Sigma Log - KDE', fontsize=12)
plt.ylabel('Frequency',fontsize=12)
```

*KDE using scipy.stats norm*

The KDE looks a lot like our original log domain histogram! You’ll notice that the peak values of the plots don’t agree with each other. This is a result of the bandwidth of the KDE and/or the bin size of the histogram and is definitely not a deal-breaker!

Let’s superimpose both plots,

```
from scipy.stats import norm
data = df_sig_log_np
x_d = np.linspace(df_sig_log.min(), df_sig_log.max(), 1000)
density = sum(norm(xi).pdf(x_d) for xi in data)
plt.fill_between(x_d, density, alpha=0.5, label='KDE')
plt.plot(data, np.full_like(df_sig_log_np, -0.1), '|k', markeredgewidth=1,label='Observational Samples')
plt.hist(df_sig_log_np, bins=25, color='coral', alpha=0.5, label='Observational Hist')
plt.xlabel('Astrometric Excess Noise Sigma Log', fontsize=12)
plt.ylabel('Frequency',fontsize=12)
plt.legend()
```

There you have it folks, if you have a data set (often from an experiment) with samples across a large domain AND low-frequency values at higher (or lower) end of the domain, converting to the log domain may be for you!

Tune in for my next post as I explore KDE methods in detail and look at a few other popular approaches to visualise the KDE, all in the log domain of course!

See you on the next one!

]]>Our exploration starts with some publicly available data via the CNEOS API, specifically, I will be examining the following dataset.

Setting the following values “Observed anytime”, “Any impact probability”, “Any Palermo scale” and “Any H” returns a database query which at the time of this writing produces a dataset with 990 rows. I have converted the dataset to CSV and connected it to a Google Collab notebook making a runnable Python 3.7 notebook with minimal effort. I’ll put a link to the CSV file I used here. Feel free to use a Python 3.7 setup of your choice!

The dataset that has been converted to CSV is a specialised subset of NEO data that focuses on Impact Risk (the Sentry Dataset). Since I’m using Google Collab for this example, I had to use the IO file upload feature to upload the file from my PC to Collab prior to reading the file as a CSV, and storing it in a Pandas data frame.

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from google.colab import files
uploaded = files.upload()
```

Once the upload tool has completed its task, it’s time to convert the CSV into a Pandas data frame. I like to keep the upload separate from the CSV reading in case I have to modify this approach later. On a new cell run,

```
sentry_data = pd.read_csv('cneos_sentry_summary_data.csv')
sentry_data.head()
```

Let’s look at the results

*Fig 1 — First five rows of the Sentry Data*

The “Object Designation” can be considered to be the unique identifier and it appears that the data is predictive in that if you examine the “Year Range” column it provides a forward-looking time frame.

Examining the next column it becomes clear that the data is providing predictive information regarding impacts that may occur during the specified time period. The next column is the Impact Probability. While this measure doesn’t singlehandedly inform us of the materiality of a potential impact, it does indicate the probability of potential impacts within the specified timeframe. It appears that the “Impact Probability” is a measure of cumulative probability. For example, it can be inferred that for the object with designation 101955 Bennu, 78 impacts may occur between the years 2175–2199 with an extremely low probability of 0.00037, cumulative.

The Impact Probabilities of the five objects considered seem to be fairly low so it would be interesting to order the data to determine which objects have relatively high probabilities of impact, particularly in year ranges much closer to 2020. The CNEOS website provides a neat legend for the columns under consideration.

*Fig 2 — CNEOS Sentry Data Legend*

Continuing our exploration, we use,

`sentry_data.info()`

It looks like there’s some missing data or NaNs in the Torino Scale column.

*Fig 3 — Data types summary*

Apart from that, there doesn’t seem to be any surprises. The Year Range column could do with some cleaning up, it would have been more useful if this range was broken into two columns as “Year From” and “Year To” as opposed to using a single string object to denote a range.

The numerical values are of a float data type with integers being used for the number of potential impacts. Looking more closely at the descriptive statistics,

`sentry_data.describe()`

*Fig 4 — Descriptive stats*

There’s a lot to take in here. It’s worth noting that the maximum cumulative impact probability seems to be 0.049 or 4.9%. There also seems to be an object that may potentially impact Earth 1091 times. The Year Range and other non-numerical data are not included in this table as they are string objects.

Let’s plot some histograms. I’m setting the fig size to be fairly reasonable as I suspect that we may miss seeing some data on a smaller plot.

`sentry_data.hist(figsize=(25,25))`

*Fig 5 — Histograms*

The plots provide a richer view of the data than just descriptive statistics. A really cool point to note is that we can barely see the Maximum Cumulative Impact Probability (4.9%) on the histogram in this plot as the count may be too small (a few events).

Okay so we did some basic descriptive stats and got to know the data types and the context a little bit using the CNEOS legend and some basic Python functions.

In my next post, we will be exploring the dataset further and will gain a deeper understanding of measuring the materiality of potential impacts and other interesting concepts that come up in this type of astronomical work.

The notebook for this post can be found here.

]]>Our Solar System is a strange place and there’s a lot we don’t know and don’t fully understand. There’s no better way to reflect on this point than to take a historical perspective. The discovery and characterisation of the planets and other bodies in our Solar System can serve as a great starting point. I’ve been spending some time on Solar System dynamics and thought I’d take a close look at asteroids, specifically the Main Asteroid Belt between Mars and Jupiter as well as Near Earth Asteroids (NEAs). There’s much to learn here so let’s dive in!

*Fig 2 — Asteroids are more abundant in the inner regions of the Solar System, (image credit, Murray, Carl D., and Stanley F. Dermott. Solar System Dynamics. Cambridge: Cambridge University Press, 2000)*

Taking a modern perspective, asteroids are minor planets and as the classification suggests, they tend to be significantly smaller than a planet or planet moons. The term is more commonly used to describe the group of small objects that inhabit the inner portions of our Solar System. Around 200 years ago very little was know about these small objects. The first asteroid to be discovered, Ceres (or 1 Ceres) was only discovered in 1801.

Ceres is a beautiful object, nearly spherical to the point where it was initially considered to be a planet. Ceres was subsequently classified as an asteroid in the mid-1800s and 2006 was re-classified as a “dwarf planet”. The distinction is somewhat important as Ceres has a considerably higher mass and radius compared to it’s smaller asteroid neighbours. The taxonomy can be a little confusing and for a long time, some of these terms were used interchangeably. To this day it can be quite challenging to pin down a concrete definition that works for all situations and objects. Asteroids are not to be confused with comets, which have highly eccentric (“elongated”) orbits around the sun as well as spectral properties that differ from asteroids and are known to originate in the outer regions of the Solar System. We will look at comets in a little more detail when we examine the Near-Earth Object (NEO) data in posts to come.

It is worth mentioning that the Titus-Bode “Law” played an important role in the discovery of asteroids, but Giuseppe Piazzi,who discovered Ceres, did so by accident without the use of this empirical rule. The Titus-Bode “Law” probably deserves a post as it has a colourful and rich history. This “empirical rule” has since been debunked, most notably by the excellent statistical analysis by Dermott in 1972/197³³, but alas that is for another post and another time. If you’ve got the inclination and the background, I highly recommend reading the paper as it is a masterclass in statistical thinking and analysis. Since the discovery of Ceres, much progress has been made in understanding the dynamics and composition of the asteroids¹ and there has been significant commercial interest in mining these objects for resources² as asteroids are known to contain carbon, silicon, iron, nickel as well as other metals and rocky minerals. Astronomers estimate that there may be billions of asteroids with dimensions > 100m in our Solar System alone.

Several metrics can be used to characterise asteroids¹ and this data can, in turn, be used to classify them. The criteria that are most commonly used to classify asteroids are;

- Orbital characteristics
- Spectral (reflectance) features

Studying the orbital characteristics can be an excellent starting point in gaining a deeper understanding of asteroids.

*Fig 3 — This image depicts the two areas where most of the asteroids in the Solar System are found. The binary asteroid 288P is part of the asteroid belt. Credit: ESA/Hubble, M. Kornmesser*

An important point to consider when considering the orbital characteristics of the objects in the Main Asteroid Belt is that the belt has structure. Much like the rings of Saturn, the Main Belt has zones or gaps that are virtually free of objects. These gaps are known as Kirkwood Gaps⁴ and are a result of Jupiter’s strong gravitation. There’s a fascinating relationship between these gaps and the orbital periods/radii in comparison to Jupiter (see ratios in Fig 2 above) but we will not explore these features in this post. There’s a common misconception (perpetuated by science fiction) that asteroids in the Main Belt tend to be densely packed. Asteroids are fairly sparsely distributed with distances of the order of millions of kilometres separating them from each other. Astronomers have also grouped asteroids into families based on their orbital characteristics and it is hypothesised that certain families were formed as a result of the destruction of larger composite objects as these families also tend to have similar compositions⁵

Another misconception is that all asteroids are solid, rocky objects. However, analysis of the data provided by the Hyabusa spacecraft on the asteroid Itokawa has indicated that asteroids can be loosely coupled collections of debris with the overall structure of the asteroid having a lower than expected density⁶.

This post was largely inspired by some data I’ve been looking at and I realised that the data can serve as an excellent starting point for anyone who wants to study asteroids. We will explore a specialist dataset, and try to gain an understanding of the objects being considered using exploratory data analysis (EDA).

There has been considerable interest in detecting and studying Near-Earth Objects (NEOs) and Near Earth Asteroids (NEAs) which make up a majority of the NEOs. Near-Earth Asteroids are a special group of asteroids that tend to have orbits that can overlap or coincide with the orbit of the Earth. More specifically, these objects have perihelion distance less than 1.3 AU and yes this means that these objects may pose a significant threat of colliding with the Earth!

While open data has been made available to the wider community, the sources can be difficult to navigate if you don’t have a background in astronomy, databases and APIs. JPL’s Centre for Near-Earth Object Studies (CNEOS) is a bit of an exception in this regard. This excellent resource has several publicly available datasets which can be used to further our understanding of asteroids, albeit the ones that get close to Earth and it’s neighbours.

In the next post of this series, we will examine a subset of the observational data recorded and catalogued by CNEOS. Many excellent software packages have been made available to the community, however, I will assume that the reader will be using a fairly standard Python 3.7 distribution (Anaconda for example) without any specialised packages or environments.

[1] Bell, Jeffrey F., Donald R. Davis, William K. Hartmann, and Michael J. Gaffey. “Asteroids-The big picture.” In Asteroids II, pp. 921–945. 1989.

[2] Badescu, Viorel, ed. Asteroids: Prospective energy and material resources. Springer Science & Business Media, 2013.

[3] Dermott, S. F., and DERMOTT SF. “Bode’s law and the preference for near-commensurability among pairs of orbital periods in the Solar System.” (1972).

[4] Wisdom, Jack. “The origin of the Kirkwood gaps-A mapping for asteroidal motion near the 3/1 commensurability.” The Astronomical Journal 87 (1982): 577–593.

[5] Lazzaro, Daniela, Thaı́s Mothé-Diniz, Jorge M. Carvano, Cláudia A. Angeli, Alberto S. Betzler, Marcos Florczak, Alberto Cellino et al. “The Eunomia family: a visible spectroscopic survey.” Icarus 142, no. 2 (1999): 445–453.

[6] Fujiwara, Akira, J. Kawaguchi, D. K. Yeomans, M. Abe, T. Mukai, T. Okada, J. Saito et al. “The rubble-pile asteroid Itokawa as observed by Hayabusa.” Science 312, no. 5778 (2006): 1330–1334.

]]>Digital Ocean DO is a public cloud service provider which is a good alternative to the more popular AWS, Azure and Google Cloud. I came across DO thanks to their excellent documentation which in my opinion, blows AWS out of the water. While AWS will try to sell you on getting their training and accreditation, DO assumes that you just want to learn straight away and start implementing.

Having said that I hadn’t really implemented anything on DO, just played around with something my friend Giles had put together for his shiny server. Giles and I have been working on Programming for Policy, an online programming course aimed at policy professionals and analysts from around the world. While I’m more partial to Python when it comes to data analytics, I do believe that R can be a great starting point for policy analysts who are new to programming and want to work on statistical analysis as opposed to programming for machine learning or even software engineering. The good news is that we have had a LOT of interest over the last couple of weeks, especially since the in person/live short course at Microsoft Reactor, Sydney.

One of the key features of this course is that we plan to be data driven from the get-go. This means that we need to monitor engagement and other metrics so that we can course correct (no pun intended!) if required. We decided to evaluate a number of platforms, both proprietary and open source that could potentially host our content. More importantly we needed to monitor engagement and other data point in real time. Finally it came down to just two options Open edX and Moodle. As a user and student, I was super familiar with both platforms. Moodle is one of the pioneering open source learning management (LMS) platforms and I had encountered Moodle back in my undergraduate days at the University of Moratuwa and edX through all the MOOC courses I had taken.

After a bit more reading we decided to trial Open edX since it had a bit of an edge over Moodle at this point in time in 2020.

It’s no secret that Open edX is notoriously challenging to setup and manage. Additionally, since our course would be offered free of charge,we had a bunch of budget constraints to consider. For us, this meant that we needed to make Open edX work on Digital Ocean! Not an easy task by any measure.

Given that Open edX is difficult to install and manage, a number of developers had come up with some great solutions. Bitnami’s AWS Image is a good example and Regis Behmo has developed a docker image that can make installing Open edX relatively painless.

The Docker option looked super promising and Regis is a very nice guy who’s always happy to help. Now, I’m pretty good at deploying docker images but I’ve always had a problem with how opaque the process can be, especially for complex applications that rely on multiple services (the irony!). After trying multiple times to get the image up and running on a DO droplet I had to call it quits. Most of the errors had to do with setting up the SQL database and I couldn’t figure out why the process failed. The whole installation was fairly opaque. Forums were helpful up-to a point. Honestly, I wouldn’t recommend this approach if you’re on DO.

The AWS image from Bitnami wasn’t going to work for us either because we don’t have AWS credits to run the server. Since DO had given us some credits I had to make Open edX work on it. I had to find an alternative between installing Open edX from scratch manually (a nightmare) and running a docker image (not enough control).

Challenge accepted!

]]> *image courtesy lifelib*

I’ve been researching new actuarial models (as one does) for life insurance in order to better understand what’s been done in the past and where we are going in terms of novel actuarial models and novel underwriting models. A key aspect of this work involves evaluating new tools and experimenting with them, with a particular focus on building APIs where possible.

Recently, I came across some work by Fumito Hamamura in Japan. Hamamura is the author of modelx and lifelib, two open source libraries that allow you to build actuarial models using Python 3.

While modelx itself is pretty interesting, particularly for actuaries who are transitioning from the traditional spreadsheet universe, lifelib is what excites me at the moment.

The great thing about the package is that it allows for greater model integrity right out of the box, as compared to say a spreadsheet that has to change hands multiple times with stakeholders and business units within the insurance value chain. The package can be set up on suitable centralized/local or cloud-based infrastructure. User access can then be moderated by a technology layer of choice. It goes without saying that version control and governance of the models written in lifelib can be more robust.

The models are also more extensible as they take advantage of object orientation in python.

From a philosophical perspective, I’m totally on board with pythonic actuarial science. When it comes to actuarial work, a lot of work has been done in the R community in terms of building packages that support the various needs of the actuarial community. It’s great to see attention being directed towards building python libraries that will unlock some of the best aspects of software engineering and enable actuaries and other stakeholders in the insurance value chain to build more robust and extensible models.

Quick tip, if you (like me) love Anaconda, make sure to create a new Python 3.7 env. to experiment with using

```
>> conda create --name myenv
>> conda activate myenv
```

(replace myenv with a name of your choice)

Also note that with Anaconda package manager, the following command will not pull the lifelib package to your env.

`>> conda install lifelib`

Instead use, within myenv

`>> pip install lifelib`

I’ll be experimenting with lifelib over the next few weeks. Wish me luck!

]]>Public APIs for climate data have become increasingly popular as journalists, academics, researchers, and enthusiasts try to make sense of historical climate data and forecasted climate data in the context of rapidly accelerating climate change.

In addition to academic papers from around the world, the most authoritative data on the subject of climate change comes from the Intergovernmental Panel on Climate Change (IPCC) data. While the IPCC website carries a complete dataset that includes comprehensive results from the numerous experiments and modeling exercises, navigating the site can be a little cumbersome for the uninitiated.

The World Bank Climate API is a great alternative that can be used to obtain broad forecasted statistics by country, for temperature and precipitation. I think this API is a great place for anyone who wants to get started with public climate data APIs and practice putting data sets together and analyzing them to look for basic statistical patterns.

My personal computer needed to be upgraded to Python 3.7 and I thought this would be a great opportunity to write a script that can make accessing data from this API a lot easier to manage, particularly when it comes to accessing the data using a GET method and writing the data to a file whether JSON or CSV.

The script currently allows a user to write the data to a CSV file or JSON file. Since the API supports XML, I may update the code with an XML option as well.

The script can be found here. Happy gathering.

]]>Medium is/has been a pretty great tool for getting up and started pretty quickly. What I most love about Medium is its simplicity, especially when it comes to putting up a post after writing something on my phone using the Medium app.

After Medium started closing down some features like embedded posts, I became less interested and took down my content to figure out what to do next.

Josh Nicholas introduced me to Blot recently and I loved experimenting with it. I’m happy to announce that I’ve since connected my domain to Blot where I will be posting about some of the more interesting projects I’ve done in the past that most people may not be aware of. I’ve also got a bit of writing I want to put out there and hopefully, I’ll get a chance to do that too.

Blot is great. I can write posts in markdown on VSCode while I’m working and JotterPad on my phone when I’m traveling. The content can be “hosted” on Dropbox or GitHub. There are some great templates, Google Analytics support and a “write drafts before you publish” feature. Blot does not support custom server-side code and is probably best suited for blogging and setting up a site to highlight your work. For me, right now Blot is the perfect platform to host https://www.praveenjayasuriya.com/

]]>We took a three pronged approach based on the MIT Model of Innovation (Market, Implementation and Technology) and were able to take this product to market in the US within a relatively short 12 month period. I served as the lead engineer and focused on platform and product development. The patented platform was licensed to THINX and is now sold under the SPEAX brand.

A detailed description of this project can be found on my Work page.

]]>Co-team member Tharindu Athauda and I developed early hypotheses for the product including developing technology specifications, testing specifications by working with technology and research partners such as Holst Centre and developing early commercialization strategies. The innovation consultancy, 4iNNO played an advisory role on this project and the project was supervised by Dr. Dilruk Yahathugoda at MAS Innovation.

Access to the platform and related intellectual property was provided to key brand partners via a B2B licensing model. Subsequently, the platform was used to create a direct to consumer (B2C) brand called Illumio owned by MAS Holdings, the parent company of MAS Innovation.

A detailed description of this project can be found on my Work page.

*Note: due to the commercially sensitive nature of this project, key details have been intentionally omitted due to disclosure obligations*

With a research and development phase lasting roughly a year, the goal of this project was to build software that could synchronize two or more DICOM image clients over a low bandwidth network enabling remote collaboration.

The user experience goal was to provide a Google Docs like experience to health professionals using the platform.

The synchronization platform was built to work with the open source Java based image processing tool ImageJ.

The work was published in The 15th International Conference on Biomedical Engineering, (4th to 7th December 2013, Singapore) and the paper can be accessed here, via Springer.

A detailed description of this project can be found on my Work page.

]]>A detailed description of this project can be found on my Work page.

]]>