Connecting Disinformation with tidygraph

I recently participated in a hackathon organised by EU’s anti-disinformation task force where they gave us access to their data base. The data base consists of all disinformation cases the group has collected since it started in 2015. Their data can also be browsed online on their web page www.euvsdisinfo.eu. The data contains more than 7000 cases of disinformation, mostly news articles and videos, that were collected and debunked by the EUvsDisinfo project. »

House-Cleaning: Getting rid of outliers II

In the previous post, we tried to clean rental offerings of outliers. We first just had a look at our data and tried to clean by simply using threshold derived from our own knowledge about flats with minor success. We got slightly better results by using the IQR rule and learned two things: First, the IQR rule works better if our data is normally distributed and, if it’s not, transforming it can work wonders. »

House-Cleaning: Getting rid of outliers I

Working with real-world data presents many challenges that sanitized text book data doesn’t have. One of them is how to handle outlier. Outliers are defined as points that differ significantly from other data points and they are especially common in data obtained through manual input processes. For example, on an online listing site, someone might accidentally pressed the zero-key a bit too often and suddenly the small rental flat is as expensive as a palace. »

Conference Time: Predictive Analytics World 2019

In the last two years, I pretty much only went to very technical conferences, such as the PyData Berlin, the PyCon or the SatRday. They’re all great conferences, organized by awesome people and I will definitely go again but this fall I decided to try out a new conference and check out the Predictive Analytics World in Berlin. First, because it’s always good to try out new things and also because in the last months I was wondering a lot how data teams can be made more useful, somehow more aligned with the business challenges, which frankly isn’t talked much in Python talks about how to deploy machine learning models. »

Reproducible (Data) Science with Docker and R

In my data team at work, we’ve been using Docker for a while now. At least, the engineers in our team have been using it, we data scientists have been very reluctant to pick it up so far. Why bother with a new tool (that seems complicated) when you don’t see the reason, right? Until I was about to hold my Houseprice Talk again and wanted to make some small changes to my xaringan slides and nothing worked. »