Monthly Summary

A Summary of My July Articles: from Python/R Tutorials to Open Discussions

Here a quick recap of articles I wrote in July

The articles I published in July can be classified in the following groups:

  • Discussions
  • Data Structures
  • Data Collection
  • Data Analysis
  • Data Visualisation
  • Other Topics

In this article I do a quick recap of my July articles. I also insert the link to the original articles, in the case you want to deepen the topic.

1 Discussions

1.1 Adversarial Machine Learning

Adversarial Machine Learning is a technique which tries to modify an existing Machine Learning model, in order to introduce errors in predictions.

An adversary can perform attacks to a ML model at two levels:

  • Training: the attacker tries to perturb the model or the dataset at the training time, for example by injecting fake data or modifying data in the dataset;
  • Testing (or Inference): this kind of attack is performed when the model has been already trained.

In order to defend a ML system from Adversarial ML attacks, the following steps should be followed:

  • identify the potential vulnerabilities of the ML system
  • design and implement the corresponding attacks and evaluate their impact on the system
  • propose some countermeasures to protect the ML system against the identified attacks.

1.2 Some considerations on Semantic Web

Before moving into the data science field, I used to work in the field of the Semantic Web. In this article, I wonder whether the Semantic Web is really dead, or there is still space for this old research field.

My conclusion is that the Semantic Web, especially the Linked Data initiative is still alive in the Cultural Heritage sector.

2 Data Structures

2.1 Learning R: Matrices

I deal with matrices and, in particular I focus on the following aspects:

  • create a matrix
  • assign names to rows and columns
  • select items
  • expand the matrix with new rows or columns
  • basic statistics.

2.2 Sampling a Dataframe in Python Pandas

In this tutorial, I illustrate the following techniques to perform rows sampling through Python Pandas:

  • random sampling — given a dataframe with N rows, random Sampling extract X random rows from the dataframe, with X ≤ N.
  • sampling with condition — extract only some rows which satisfy a given condition.
  • sampling at a constant rate — sampling at a constant rate, which means that you want that there is a constant distance between two adjacent samples.

3 Data Collection

3.1 HTML Scraping with Python Pandas

In this tutorial I describe a simple mechanism to extract tables from HTML pages with Python Pandas. This can be achieved through the read_html() function, which is very simple and fast. In most cases, the scraped tables need some cleaning process.

4 Data Analysis

4.1 How to speedup a scikit-learn classification task

In this tutorial, I evaluate the time elapsed to fit all the default classification datasets provided by the scikit-learn library, by varying the n_jobs parameter from 1 to the maximum number of CPUs. As example, I will try a K-Neighbors Classifier with Grid Search with Cross Validation.

4.2 Time Series forecasting with SARIMA model

A SARIMA model can be tuned with two kinds of orders:

  • (p,d,q) order, which refers to the order of the time series. This order is also used in the ARIMA model (which does not consider seasonality);
  • (P,D,Q,M) seasonal order, which refers to the order of the seasonal component of the time series.

In this article, I focus on the importance of the seasonal order.

4.3 K-Neighbours Classification

In this tutorial for beginners, I illustrate how to set up, train and finalise a K-Neighbours Classifiers using the scikit-learn library. The following steps should be followed:

  • data preprocessing
  • model training
  • model testing
  • model finalisation

4.4 Three tricks to speed up and optimise your Python

In this article I illustrate three tricks to optimise your Python code:

  • if you need to run scientific computations, you can exploit the numba package
  • if you need to deal with large datasets, you can exploit the pyspark package or, whenever possible, downgrade the columns datatype.

5 Data Visualisation

5.1 Using sqlite3 in Observablehq

In this tutorial, I exploit the new sqlite3 feature to build a simple bar chart, which updates dynamically, according to users’ selection.

As example dataset, I use the Generic Food Database, provided by data.world and available at this link. In addition, I will build a dynamic bar chart, which shows the number of items for each sub group, provided the main group. The group choice is done through a dropdown selection.

5.2 D3.j for Beginners: Maps

In this tutorial I will build a choropleth map which shows the population of each country of the world.

5.3 How to insert an Graph drawn in Observablehq into a HTML page

In this tutorial, I propose two strategies to embed a graph into a Web site:

  • through iframe
  • producing the Javascript of the graph.

In both cases, firstly, you must to publish your notebook, by clicking the publish button.

In addition, in both cases, you should follow the following steps:

  • download the embedding code from Observable
  • insert the code into your HTML page.

5.3 How to Run Animation in Altair

In this tutorial, I illustrate a mechanism which combines the power of Streamlit with Altair, in order to render an animated line chart.

The resulting animation should look like the following one:

6 Other Topics

6.1 Preserving the layout of a manipulated document

If many tutorials exist on how to manipulate a text, indeed I don’t have found any complete tutorial on how to export the manipulated text to a document with the same layout of the original one.

In this short tutorial, I describe how to achieve this objective, with less than 10 lines of Python code!

6.2 Build a Readme file

Here, I propose a simple online tool, called readme.so, which is specifically thought to build a Readme file very quickly.

It is completely free ant takes just few minutes to understand how it works. Readme.so supports many languages, including Italian, French, Spanish and many others.

6.3 How to spend your time when you are waiting for a Data Analysis Output

In this article, I suggest you two possible alternatives to fill the waiting time:

  • Focus on your project — you can try to improve your project.
  • Open your mind — You may try to improve your skills and knowledge in different ways, such as attending webinars, online courses and much more.

Summary

In this article, I have described a quick summary of the articles I published in July. If you want to stay up-to-date, you can follow me and also read my new publications.

Stay tuned :)

If you wanted to be updated on my research and other activities, you can follow me on Twitter, Youtube and and Github.

Related Articles

Top 1000 Medium Writer in May, June and July 2021. I write on Data Science, Python, Tutorials and, occasionally, Web Applications.