Data science at Jodel: tech setup and data products

Stay tuned!Tech setupWe don’t really have any ML platforms or complex data science infrastructures in place.

Since the beginning of the data science team, we had some basic AWS EC2 instances.

One was mainly for RStudio and larger in-memory computations, another one to run Jupyter notebooks and one more with high GPU power for training of complex models.

We have also used EC2 instances for fast deployment of data products, and it gave us great head start because we could move fast, but it was messy because we didn’t stick to a proper DevOps culture, and it was also expensive and not optimised.

Since then, with more emphasis within Engineering being put on DevOps and costs, we have moved a step forward into using Docker.

Our data products are now containerised and can be deployed anywhere.

This way we can share the ownership of the infrastructure and deployment with our SRE team and build good self-contained data products.

Lately, we added something new to our stack in order to get more independent with our image filter product.

We use the Google Cloud Platform for this, leveraging their ML Engine, VMs and Cloud Functions, etc.

It is harder to evaluate this part.

Getting into Docker has helped us greatly in improving the quality and reliability of our products and made it easier for other engineers to understand them because a Docker file is a pretty good base of documentation on its own.

Having no “ML platform” — except for Google Cloud that we use in one instance — has pros and cons, we are less independent as a team and have more work of our own, but on the other hand get more resource-efficient, learn the DevOps part of data science as well and don’t get into any vendor lock-in.

Data productsOver the past two years, we’ve worked on many data products.

We give a description of the most important ones, ordered approximately by the time they were developed:Experimentation Platform is a tool that was created in order to automate one of the most time-consuming tasks: analysing and preparing controlled experiments.

After doing that for months and progressing at a slow pace, we wanted to give the full A/B testing pipeline to the hands of the product managers.

Through meetings with the teams and different iterations, we tried to build an app that could be understood and used by non-technical users and provide meaningful and statistically sound insights about the experiments that we run.

The analyses that would take the data science team up to three days are now done by the product team in minutes and we keep improving the app to make things automated and reliable, and also, to get the best insights we possibly can.

User Clustering is an exploratory analysis started at a time when the overview about our users was still very blurred.

Applying clustering on our user base and doing some further analysis helped us to discover different engagement groups in our app with different behaviours and properties.

The output of this analysis helped us to define important KPIs and be more focused to help us retain and understand specific audiences.

Image Filter is based on a deep learning model, with the goal to prevent sexually explicit pictures that are posted on Jodel from reaching the users.

This is the most obvious cost saving service we have and has been super useful.

We have a system in place where the cases that are unclear go to our human reviewers, which allows us to find a cost effective solution to protect our users well.

Jodel Text Analytics was our first baby steps to the world of natural language processing.

Being an app where people post mostly textual content, it was very important to start collecting some quantifiable knowledge about this content.

It started with basics like language, noun and entity detection.

Later, we partnered with another company to create a content labelling model which distributes our content into 7 buckets that we defined as the main content types on Jodel.

X-PRO is our framework, inspired by a concept from Spotify, which uses machine learning to detect user behaviours that directly influence a specific top-line KPI.

We used this to get closer to the quantitative “Aha moment” of our users.

And we repeatedly use it to find out which user behaviours to focus on to lift the right numbers up.

Personalisation and recommendation have been widespread in social media industry.

We have created some systems to personalise the user experience for our users with the goal to give them a better experience and to allow more diverse user groups to use the product.

Therefore, we worked (and are still working) on two main projects:Channel Recommender: Channels at Jodel are something like Facebook groups; they are an area where people with common interests meet and share and discuss opinions, ideas, etc.

Our system was built over the general metrics and activity of our users in channels to give them the opportunity to find the content they are looking for.

Personalised Feed: The landing feed of Jodel is chronological to show users the newest updates from their community.

Even though this is a legit way of sorting content that has worked well for us, we see an opportunity in a more tailored approach.

Being connected to the people by location puts you in a position where you can read any type of content, which is something that is difficult to control by design (unlike Facebook with friends or Instagram with followers).

This makes it harder to get the best content for the user.

We got some great experimental results but still working on it to make it even better.

ConclusionWe do have some critical thoughts on our overall progress, even though we are generally really satisfied.

First, from the product point of view, make sure things are aligned.

We did a few things because we felt they were necessary but then sometimes lacked the resources from product teams to bring them fully to production, or there even was something data related that had higher priority from the product standpoint.

Second, share knowledge about these projects better inside the team.

At the start, we created big knowledge silos and we thought it has to be that way because we move faster and everyone has their own expertise that the others would need to understand first.

This was not true, since if we do proper code reviews and actually work collaboratively on projects we make better products, we are happier, we are more knowledgeable and we are faster.

Last, being really production oriented.

Think about how the product will be used and make sure we can make it perform and we can fit it into our product and systems well.

There is no time to produce things that are not going to be used.

.. More details

Leave a Reply