Turning questions to answers, not tables to tables

DATAWORKSHOP CLUB CONF

Few words about Aleksandra Mozejko presentation on DataWorkshop Club Conf 2018.

Aleksandra Mozejko shared her insights after 2 years of working as a data scientist at Sigmoidal. The reality is much more complex than what you see on Kaggle or during online courses. The point is that you simply won't get a data set, do a little cleaning and run an effective model afterwards. The fact is that every model which is deployed on production has an impact on people's decisions, shaping the reality. Keeping business goals in mind, Aleksandra demonstrates how data science looks like when a real client comes to the office.

What are we missing?

It all starts with data that clients bring to the table and sometimes they come empty handed. Just like US drone manufacturer who wanted to build construction site monitoring system. The problem was well staged but we had to work together to generate data from scratch. You always get some data to start, right? In this case, the whole process was inverse, we had to figure out what clips, at what angle, with which objects do we actually need to make it work.

Always on alert

Even when data sets land on your desk, stay cautious, despite all client's assurances. We were working on an internet article recommendation engine and we were provided with 16k neatly labelled samples. At first we thought like, awesome, let's get into to it! But then we discovered that the article content, labelled by analysts as interesting, holds some peculiar data. There were plenty cookie notifications! That could bend the perspective of our model if we didn't spot that on time. Get to know the data you are given, inside out, otherwise you may end up sideways.

In tune with the client

Yet another NLP example, big law firm carrying out economic investigations and risk management. As a part of their operations they conduct candidate screening on executive level in search of frauds or misconducts that could greatly damage a company if such candidate is hired. Our task was challenging on multiple levels. We had to build a machine learning model to assist analyst's assessments. It was supposed to work fast on noisy sensitive data and generate a ranking of significant articles. At that point it was all coming down to choosing a right metric. We chose ndcg score to pick interesting articles but another problem appeared. It was imperative for the client not to miss any detail and our model was ranking most of top spots with one story if a given person was a involved in a big scandal. The analysts were becoming less focused while scrolling and sustaining their attention was also essential objective to fulfill. So we started grouping similar articles into clusters trying out dozens of embeddings but the client also didn't find it facilitating. At that time we had realized that perhaps our solution is not in tune with client's problem. We sent him a sample batch of articles to see how the client groups them himself. Indeed, we got it all wrong. Instead of adjustable groups the client wanted fixed labels such as 'legal cases' or 'PR scandals', that's it. The real problem was much easier to solve than we assumed.

Lesson for you

If you're learning data science imagine that you're a client. Try to solve a problem for yourself instead of just picking a data set. This a good way to go through the entire process of collecting data, posing questions, searching for answers and proper machine learning algorithms, verifying solutions and proving performance of a model that you've built. Build a recommendation system for yourself basing on your interests.

Let us know if you have any questions about DataWorkshop Club Conf 2019: conf(at)dataworkshop.eu

Follow DataWorkshop on Twitter or Facebook

Ula
Event manager

DataWorkshop.eu

Strony

Znajdziesz nas