Coding for Data Science and Data Management
This course is part of the MSc in Data Science for Economics, University of Milan, Italy. Academic year 2019/2020.
The course aims at providing technical skills about coding/scripting aspects for data analysis and to manage persistent data storage of sources and results involved in analysis. On the one side, the Python programming language and the R framework are illustrated. The goal is to deal with essential notions about data structures and control structures of both Python and R. On the other side, the goal is to present the core notions of relational databases, such as keys, integrity, and primary/foreign key constraints, as well as the SQL language for data definition, manipulation, and query. Recent and innovative NoSQL solutions are also discussed, with special focus on a document-oriented system called MongoDB.
The course is divided in three modules:
- R (Emanuele Guidotti)
- Python (Nicolò Cesa-Bianchi)
- Databases (Stefano Montanelli)
Syllabus (R module)
Introduction to the R framework and R Studio
- Variable assignment
- Extension packages
- R help, StackOverflow, Google
Basic Data Types
Basic Data Structures
- Code quality and modularity
- How to implement R functions
Speeding up the code
- References to parallel computing
- Local & Remote files
- R packages
- RESTful APIs
- Web scraping
- Base plotting system
Building interactive interfaces, documents and websites
- R Shiny
Building R packages
- Base packages
- Rcpp packages
- R Shiny packages
If time allows…
- Debugging in Rstudio
- Connect R to Python: the
- Connect R to databases: the
Exam (R module)
Create an R package which deals with task(s) of your choice. The aim of the project is to asses your ability in dealing with the R programming language. You are free to choose the task(s) you like most.
The project is an individual project, even if you are strongly encouraged to work in groups and help each other. Please do take advantage of online resources and note that copying code snippets is totally fine. On the other hand, copying the whole project or most of it is (obviously) not allowed.
16 February 2020, 23:59.
Passing grade (18/30)
The package must be installed with the following command:
ID is the ID of your repository. If the package cannot be installed or raises errors, the exam is failed.
Additional points (9/30)
The package must deal with some of the following topics. Choose the topics you like most. In any case, max 9 points can be earned.
|Data Acquisition||Import data via local/remote files, R packages, APIs or web scraping.||3|
|Data Visualization||Visualize data with ||3|
|Data Analysis||Use R packages to perform some kind of data analysis. The complexity of the analysis is not relevant. The aim is to show your ability to work with R packages.||3|
|Interactive Interface||Build an interactive interface with ||3|
|Code Optimization||Speed up the code using ||3|
Additional Points (3/30)
Up to 3 additional points are assigned based on:
|README||Use the Markdown syntax to provide a high quality ||1|
|Documentation||Document your functions using ||1|
|Code quality||Coding style and modularity.||1|
Laude (30/30 with honors)
If you achieve 30/30 in the previous steps, honors can be awarded based on the code quality.
to set up the repository create an account on GitHub and email me with your GitHub username. The email must be sent from your University address and use “DSE Coding Midterm” as subject. On the other hand, I suggest to create your GitHub account using your personal email address. ↩︎