Data scientists and product owners have a lot of great ideas. But often these ideas are missing data to answer the given questions and build a solution around them. We talked to ML Conference speakers Markus Nutz and Thomas Pawlitzki about how to build a data pipeline starting from “zero data”.
Find out how to solve the cold start problem!
JAXenter: Databases need maintenance, we know that. But over the years impenetrable data thickets have grown in many companies. In your session you talk about unraveling the chaos, but where do you start?
Markus Nutz: Fortunately, Freeyou hasn’t been around for that long, so we’ve been able to keep track of everything so far. The answer is probably pretty boring: documentation. Documentation includes all involved parties, which means that the requirements of the product owners, data scientists and data architects all have equal status. We are aware that the data is our basis for differentiating ourselves from other insurers.
Thomas Pawlitzki: I have nothing more to add to this. Our own database is still controllable. The development team talks a lot about features and changes so that the individual team members are aware of database changes. You don’t have to explain anything to data gurus like Markus.
In the last few weeks, I have also looked at various frameworks that we can use in the development of our API. Some of them already offer features for data migration. For example, there you can store schema changes in relational databases as code and apply them, but also perform a rollback. Perhaps we will soon use such solutions to test the whole thing in its early stages.
Learn more about ML Conference:
JAXenter: How can we solve the “cold start problem”?
Markus Nutz: In general, keep your eyes open to see where and what kind of data is available. Statistics about traffic accidents, for example, are often available in small inquiries in the state parliament. This was quite surprising to me. Pictures for a first image classifier are available online. Customer inquiries arise all by themselves!
Thomas Pawlitzki: You should also consider when it makes sense to create your own model or which “ready-made” model to take. For example, we also use an API for image recognition. These APIs are very easy to integrate and do a really good job with general problems. We’d rather put our energy into providing solutions to problems which general APIs can’t solve. We still have very little data here. Fortunately Markus knows enough tricks to polish small data sets and still come up with usable models.
Markus Nutz: Data augmentation, e.g. changing images, inserting spelling mistakes into words, translating mails into English and back again, window slicing at Time Series Data – these are all strategies that make the existing “few” data as efficient as possible to use! When it comes to models, regarding images and text transfer learning or course, we are particularly interested in Tensorflow Hub. Its a library from Google for reusable Machine Learning modules.
In general, we also pay attention to using suitable models for our existing data, which don’t require the largest amounts of data to function well. Logistic regression or random forests are simply super!
JAXenter: In connection with the construction of a data pipeline you speak of “zero data”. Please give us a concrete example.
Markus Nutz: Oh, that was misleadingly described then. We have chosen “Zero Data” because data – now it’s getting trite – exists everywhere around us and also available to us. We can evaluate initial ideas with data sets from Kaggle or the relatively new Google Dataset Search, and reference official statistics using Openstreetmap data. The simply unbelievably detailed data allows us, for example, to estimate the risk of vandalism for bicycles and cars based on a location or find a good route from bicycle dealer to bicycle dealer for our distribution. It’s a free lunch, so to say.
Thomas Pawlitzki: Yes, that was really surprising and enjoyable when we encountered a problem with theft in a workshop on our bike insurance. We had briefly considered how we could access a good database and whether we could address the various police stations or not. However, a 5-minute search on the net has shown that (at least for the location we examined) a daily newspaper offers up-to-date data. We were surprised and of course very happy about that.
JAXenter: How do you maintain your data pipeline?
Markus Nutz: Phew! I’d like to have a good answer to that, but we don’t have a good recipe yet. I’d say: testing. What helps in any case is that we, as an organization, have a common understanding. Data is what enables us to offer a better product that can distinguish us from the market. That’s why we’re all very motivated to make this happen!
Thomas Pawlitzki: Yes, sometimes we are a bit “casual” and there’s still room for improvement. Nevertheless, the whole thing works surprisingly well, probably due to the great commitment of all the team members.
Thank you very much!
Markus Nutz and Thomas Pawlitzki will be delivering a talk at ML Conference in Berlin on Wednesday, December 5 about their experience with the cold start problem and building a data pipeline. Starting from “zero data”, how do they arrive to a data pipeline with open, found and collected data? Their data pipeline enables building data products that help customers in their daily life.
Carina Schipper has been an editor at Java Magazine, Business Technology and JAXenter since 2017. She studied German and European Ethnology at the Julius-Maximilians-University Würzburg.