**Unsupervised Learning Using Primary Components Analysis (PCA) and Clustering with K-means and Gaussian Mixtures Model (GMM)**

**Udacity Project: Finding Customer Segments**

In this project, I analyzed a dataset from a wholesale distributor containing data on various customers’ annual spending amounts on 6 different product categories. I attempted to find some internal structure in the data in order to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.

**Data Exploration and Preprocessing.**

I began by looking at some basic statistics for each of the spending categories. One of the first things I noticed was that all of the spending categories had a rightward skew with means significantly higher than the medians(50%).

Fig 1. Statistics for the different spending categories, spending amounts are scaled to monetary units. |

Next I wanted to see if there was any relationship between the spending categories so I made a scatter plot matrix. It was apparent that some of the variables were significantly correlated so I ran a linear regression on each pair of variables and found the pairs that had significant R2 values.

- Grocery – Detergents_Paper – .855
- Grocery – Milk – .530
- Detergents_Paper – Milk – .438

Fig 2. Matrix of scatter plots that allows us to look for correlation visually between spending in different categories. |

On the main diagonal of the scatter plot matrix we have histograms of each spending category. Looking at these histograms we see that there is a significant rightward skew to the data since customers do not spend less than zero. To remove this skew we take the natural logarithm of the data.

Fig 3. Scatter matrix of the natural log of the original data. |

Looking at the log transformed histograms we can see that some of the features like the Detergents_Paper category are bimodal. This is an indicator that there probably are different categories of customers with different spending habits.

**Primary Components Analysis(PCA)**

Since not all of the spending categories were independent I decided to do a primary components analysis to reduce the dimensionality of the dataset. This was implemented using sklearn and the results can be seen in the table below.

Fig 4. Results of Primary Components Analysis |

I chose to use the first two PCA Dimensions since together they explained 71.9% of the variance of the data and using two dimensions allowed me to easily create clustering visualizations. If I wanted more precise clustering I could have used the first four dimensions which explained 93.1% of the variance.

The first component consists of large amounts of sales in Milk, Grocery, and Detergents_Paper while not selling as much fresh or frozen food. A customer high in this component could be a convenience or grocery store. A customer low in this component would be more of a restaurant that sells fresh and frozen food and less Grocery and Detergents_Paper items.

The second component could represent supermarkets since a customer that ranked high in this component would sell lots of everything and especially lots of fresh and frozen food relative to a convenience store. Another interpretation is that this component represents the size of the customer since it is positive in all features.

Clustering

I compared two different methods to cluster the data. The first clustering method used was **k-means** and the procedure for k-means is listed below.

- The center of each cluster center is assigned randomly
- All of the data points are assigned to the closest cluster using euclidean distance.
- The cluster center of each group is moved to the mean location of all the data in the cluster
- Repeat steps 2 and 3 until no data points change cluster assignment.

The **Gaussian Mixture Model(GMM) **assumes that the data is generated by a mixture of Gaussian distributions. As opposed to the hard assignments of k-means the GMM generates probabilities that a given data point came from a specific underlying distribution. The number of underlying distributions is somewhat analogous to the number of clusters in k-means.

To determine the appropriate number of clusters and to compare the effectiveness of k-means and GMM I used the average Silhouette Score of all the data points. From the sklearn website “The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b – a) / max(a, b).”

Number of Clusters |
K-means Score |
GMM Score |

2 |
0.4191660832 |
0.3980290036 |

3 |
0.3934731915 |
0.3862818427 |

4 |
0.3303215508 |
0.3477241544 |

5 |
0.346912915 |
0.211466356 |

6 |
0.3605691213 |
0.2397653042 |

7 |
0.3656849207 |
0.3149063883 |

8 |
0.3556267606 |
0.1759933829 |

**Fig 5. **Average silhouette score for different cluster methods and number of clusters.

**Clustering Results**

Looking at the silhouette scores I could see that both the K-means and GMM methods scored highest when clustering into two groups. The K-means scored higher than the GMM so I would use K-means if making a discrete decision to put a customer into one category or another.

Fig 6. Cluster assignments using 2 clusters and k-means |

Fig 7. Actual customer category assignment from distributor. |

As seen in figures 6 and 7 above the classification computed by the k-means algorithm came about as close as could be expected to matching the assignments given to the customers by the distributor. The fact that the retailers are the cluster that is high in PCA component one makes sense since this component had high spending in Grocery and Detergents_Paper categories. It is also evident in both my clustering assignments and those provided by the retailer that component 2 didn’t make a huge difference in cluster assignments.