One notorious problem with deep learning and deep neural networks (DNNs) is, that they can become **black boxes**. Lets say, that we have fitted a network with good test performance on a given classification problem. However we are now stuck. We cannot make sense of the final weights that have been learned or adequately **visualize the problem space**! Another issue arises in the real world. In practical applications of applying neural networks we often fall back to** train ensembles of networks**. We use the averaged output of many models. This can be more powerful than the output of one single…

In this post, we will build a machine learning pipeline using multiple optimizers and use the power of Bayesian Optimization to arrive at the **most optimal configuration for all our parameters**. All we need is the sklearn Pipeline and Skopt.

You can use your favorite ML models, as long as they have a sklearn wrapper (looking at you XGBoost or NGBoost).

The critical point for finding the best models that can solve a problem are *not* just the models. We need to** find the optimal parameters** to make our model work optimally, given the dataset. This is called finding or…

One profound claim and observations by the media is, that the rate of suicides for younger people in the UK have risen from the 1980s to the 2000s. You might find it generally on the news , in publications or it is just an accepted truth by the population. But how can you make this measurable?

In order to make this claim testable we look for data and find an overview of the suicide rates, specifically England and Wales, at the Office for National Statistics (UK) together with an overall visualization.

https://www.ons.gov.uk/visualisations/dvc661/suicides/index.html

Generally, one type of essential questions to ask…

How do you handle missing data, gaps in your data-frames or noisy parameters?

You have spent hours at work, in the lab or in the wild to generate or curate a dataset given an interesting research question or hypothesis. Terribly enough, you find that some of the **measurements **for a parameter **are missing**!

Another case that might throw you off is **unexpected nois**e that was introduced at some point in the experiment and has doomed some of your measurements to be **extreme outliers**. …

We should always aim to create better Data Science workflows.

But in order to achieve that we should find out what is lacking.

Classical Machine Learning is pipelines work great. The usual workflow looks like this:

- Have a use-case or research question with a potential hypothesis,
- build and curate a dataset that relates to the use-case or research question,
- build a model,
- train and validate the model,
- maybe even cross-validate, while grid-searching hyper-parameters,
- test the fitted model,
- deploy the model for the use-case,
- answer the research question or hypothesis you posed.

As you might have noticed, one severe shortcoming is…

Single parameter models are an excellent way to get started with the topic of probabilistic modeling. These models comprise of one parameter that influences our observation and which we can infer from the given data. In this article we look at the performance and compare two well established frameworks — the statistical language STAN and the Pyro Probabilistic Programming Language (PPL).

One old and established dataset is the cases of kidney cancer in the U.S. from 1980–1989, which is available here (see [1]). Given are U.S. counties, their total population and the cases of reported cancer-deaths. Our task is to…

There exists the idea that practicing something for over 10000 h (ten-thousand-hours) lets you acquire enough proficiency with the subject. The concept is based on the book Outliers by M. Gladwell. The mentioned 10k hours are how much time you spend practicing or studying a subject until you have a firm grasp and can be called proficient. Though this amount of hours is somewhat arbitrary, we will take a look on how those many hours can be spent to gain proficiency in the field of Data Science.

Imagine this as a learning budget in your Data-apprenticeship journey. If I were…

When you want to gain more insights into your data you rely on programming frameworks that allow you to interact with probabilities. All you have in the beginning is a collection of data-points. It is just a glimpse into the underlying distribution from which your data comes. However, you not only want simple data-points in the end. What you want is elaborate, talkative density distributions with which you can perform tests. For this, you use probabilistic frameworks like TensorFlow Probability, Pyro or STAN to compute posteriors of probabilities. As we will see, the computation of this is not always feasible…

The driver behind a lot of models that the average Data Scientist or ML-engineer uses daily relies on numerical optimization methods. Studying the optimization and performance of different functions helps to gain a better understanding of how the process works. The challenge we face on a daily basis is that someone gives us a model of how they think the world or their problem works. Now, **you** as a Data Scientist **have to find the optimal solution to the problem**. For example, you look at an energy-function and want to find the absolute, global minimum for your tool to work…

Have you ever wondered how to account for uncertainties in time-series forecasts? Have you ever thought there should be a way to generate data-points from previously seen data and make judgement calls about certainties?* I know I have.* If you want to build models that capture probabilities and hold confidences we recommend using a probabilistic programming framework like Pyro. In a previous article we have looked at NGBoosting and have applied it to the M5 forecasting challenge on Kaggle. As a quick recap — the M5 forecasting challenge asks us to predict how the sales of Walmart items will develop…

I am a Data Scientist and M.Sc. student in Bioinformatics at the University of Copenhagen. You can find more content on my weekly blog http://laplaceml.com/blog