Monday, October 9, 2017

Big data as a Tool for Better Master Data

High-quality master data is often decisive for the success of Analytics and Big Data projects. An analysis that starts with incomplete or incorrect Master Data will usually yield incorrect results. One can rightly claim that Master Data in some way lays the foundation for Big Data. This correlation is fairly obvious and has been described many times (see[1],[2], and[3]).

Conversely, there is a connection that at first glance does not seem to be so obvious: analytical methods and Big Data can help to improve the quality of your Master Data. We would like to give you an example of this in the following.

Relevance of Attributes for the Customer's Purchase Decision


When it comes to hundreds of thousands of item records, manual revision and enrichment can often become very time-consuming. This makes it all the more important to focus on the relevant attributes. It would be mindless to put effort into improving the data quality of attributes that afterwards turn out to be insignificant. If you succeed in identifying the relevant attributes, then you can achieve a higher quality increase with the same effort. But how do you know which are the really important attributes?

When assessing the relevance of attributes, the knowledge and experience of stakeholders should be brought in by appropriate participation. In e-commerce, however, involving stakeholders is rather difficult, as the customer is the stakeholder. And no customer would like to participate in a survey to answer whether this or that piece of information was the more important one for his purchase decision.

Find Relevance by Means of Choice-based Conjoint Analysis


Here statistics comes to your aid. As early as in the 1970s, psychologists and market researchers developed a method called Choice-based Conjoint Analysis (CBCA), which enables us to provide information about the "perceived benefit" of a single product characteristic from the customer's point of view. This method still works even if many other product features go into the purchase decision. By applying the CBCA, you can deduct the benefit of each individual product feature from the total value of the product.

The method is based on purchasing decisions of customers, who have to choose between different but comparable products (e. g. smartphones with different amounts of memory, processor speed, etc.). Such a choice situation can be achieved in an online shop by simple means and the customer's behaviour can be easily understood by analyzing the log files. Based on a statistical benefit and decision model, a formula with a number of free parameters is derived. This formula can be used to calculate the probability with which the customer will decide on one of the products offered in the choice situation. The parameters are then adjusted iteratively until the calculated value matches the customer's actual behaviour in the best possible way (maximum likelihood method). In the end, each of the inferred parameters reflects the partial benefit of a certain product characteristic. From the partial benefit, we can then deduce the relevance of the attribute in relation to other attributes.

Big Data: Implementation with Apache Spark


This procedure, especially when applied to e-commerce, quickly results in an order of magnitude of tera bytes. Classic software for Conjoint Analysis is not necessarily the best choice in this setting. At one of our customers we used a Spark/Hadoop cluster to assess the relevance of attributes. We were able to implement the maximum likelihood method relatively easily using the Apache Spark Machine Learning Library (MLlib).

Conclusion

With the help of the Choice-based Conjoint Analysis, the relevance of attributes can be determined on the basis of customers' buying decisions in an online shop. This helps to improve the quality of Master Data in a targeted manner.