Data Analysis

A ready-to-run code including preprocessing, parameters tuning and model running and evaluation.

Image by Buffik from Pixabay

In this short tutorial I illustrate a complete data analysis process which exploits the scikit-learn Python library. The process includes

  • preprocessing, which includes features selection, normalization and balancing
  • model selection with parameters tuning
  • model evaluation

The code of this tutorial can be downloaded from my Github Repository.

Load Dataset

Firstly, I load the dataset through the Python pandas library. I exploit the heart.csv dataset, provided by the Kaggle repository.

import pandas as pddf = pd.read_csv('source/heart.csv')
df.head()


Data Science Teaching

A quick tutorial for beginners to get started with the very popular software for statistics and data analysis.

Image by Author

The R software is a very popular software for statistical computing and graphics. It provides many packages which can be also used for data science, especially for data analysis.

This article belongs to the series R for Beginners, which tries to help beginners to get started with the R software. In my previous article, I dealt with vectors. In this article, I deal with matrices and, in particular I focus on the following aspects:

  • create a matrix
  • assign names to rows and columns
  • select items
  • expand the matrix with new rows or columns
  • basic statistics.

1 Create a Matrix

A matrix is a multidimensional…


Data Visualisation

A quick tutorial to build an interactive Choropleth map with the popular Javascript library

Image by Author

Many Javascript libraries exist to build and animate maps, such as Leaflet.js and Highcharts. In this article I exploit the very famous Data Driven Documents (D3) library (version 5), which is more than a simple graph library.

D3 is a Javascript library which permits to manipulate documents, based on data.

In this tutorial I will build a choropleth map which shows the population of each country of the world. I have modified the original code, by adapting it to D3 v5 and enriching it with interactivity and annotations.

The full code can be downloaded from my Github Repository.

Setup

Firstly, I…


Data Visualisation

A ready-to-run notebook which exploits the very recent sqlite3 features provided by Observablehq

Photo by Luke Chesser on Unsplash

Recently, the Observablehq team has released a new feature, which permits to import sqlite3 databases into a notebook. This features is very powerful, since it permits to dynamically query the dataset through the classical SQL syntax. The original tutorial provided by Mike Bostock is available at this link.

In this tutorial, I exploit the new sqlite3 feature to build a simple bar chart, which updates dynamically, according to users’ selection.

As example dataset, I use the Generic Food Database, provided by data.world and available at this link. The following table shows a snapshot of the Generic Food Database:


Technology Discussions

It seems that the Semantic Web has almost gone. Is this true? In this article we try to retrace the history of Semantic Web and define new challenges and opportunities where it could be applied to becoming trending again.

Image by Adina Voicu from Pixabay

In the early 2000s, one of the most popular topics was the Semantic Web. The Semantic Web, also known as Web of Data or Web 3.0, tried to give a structure to the content of Web pages such that they were understandable not only by humans but also by machines.

The Semantic Web is the Web of Data, instead of the previous versions of Web, which were the Web of documents.

The main technologies associated to the Semantic Web are:

  • the Web Ontology Language (OWL), which permits to define a common vocabulary to represent data
  • the Resource Description Framework (RDF)…

Data Collection

A ready-to-run code which exploits the read_html() function of the Python Pandas library

Image by Goumbik from Pixabay

Almost all the Data Scientists working in Python know the Pandas library and almost all of them know the read_csv() function. However, only few of them know the read_html() function.

The read_html() function permits to extract tables contained in HTML pages very quickly. The basic version of this function extracts all the tables contained in the HTML page, while the usage of some specific parameters allows the extraction of a very specific table.

In this tutorial, I focus on the following HTML page, containing the groups of Euro 2020 football competition:


A quick overview and a ready-to-run code to understand the (D, P, Q,M) seasonal order of the SARIMA model of the Python statsmodels library.

Image by romnyyepez from Pixabay

Some months ago, I wrote an article, which described the full process to build a SARIMA model for time series forecasting. In that article, I explained how to tune the p, d and q order of a SARIMA model and I evaluated the performance of the trained model in terms of NRMSE.

One comment about that article was that the proposed model was basically an ARIMA model, since it did not consider the seasonal order. I thanked the comment’s author and I investigated this aspect.

And now I am here to explain you an interesting aspect that I discovered and…


Machine Learning

A Python ready-to-run code which implements the K-Neighbours Classifier in scikit-learn, from data preprocessing to production.

Image by Author

In this tutorial, I illustrate how to implement a classification model exploiting the K-Neighbours Classifier. The full code is implemented as a Jupyter Notebook and can be downloaded from my Github repository.

As an example dataset, I exploit the Titanic dataset provided in the Kaggle Challenge: Titanic — Machine Learning from Disaster. The objective of this challenge is to build a model, which predicts whether a passenger survived or not during the Titanic disaster, given some passenger’s features.

Load Dataset

The dataset is composed of three files:

  • train.csv — which contains the dataset used to train the model
  • test.csv — which contains…


Data Preprocessing

A ready-to-run code with different techniques to sample a dataset in Python Pandas

Image by Peggy und Marco Lachmann-Anke from Pixabay

It may happen that you need only some rows of your Python dataframe. You can achieve this result, through different sampling techniques.

In this tutorial, I illustrate the following techniques to perform rows sampling through Python Pandas:

  • random sampling
  • sampling with condition
  • sampling at a constant rate

The full code can be downloaded from my Github repository.

Load Dataset

In this tutorial, I exploit the iris dataset, provided by the scikit-learn library and I convert it to a pandas dataframe:

from sklearn.datasets import load_iris
import pandas as pd
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)


Data Visualisation

A quick tutorial to make wonderful HTML pages with your Observable graphs.

Image by Author

Observablehq is a very popular notebook to write code exploiting the D3.js library. Thanks to the many examples and tutorials available on the Web, you can fork already built notebooks and customise them for your needs.

However, once built a graph, it is not immediately easy to embed into another Web site.

In this tutorial, I propose two strategies to embed a graph into a Web site:

  • through iframe
  • producing the Javascript of the graph.

In both cases, firstly, you must to publish your notebook, by clicking the publish button.

In addition, in both cases, you should follow the following…

Angelica Lo Duca

I’m a computer scientist with experience in the field of Web applications, Data Science, Data Journalism, Blockchain and Semantic Web.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store