Can Python Displace R for Data Science?

r vs python, r vs python 2017, r vs python for machine learning, r vs python 2018, r vs python quora, r vs python salary, r vs python speed, r vs python reddit, r vs python syntax, akademia vertabelo

R and Python are two of the most popular data science languages, but which one is better? And will Python replace R in the near future? Let’s find out!

R vs. Python: the Basics

First, some history.

R first appeared in 1990; it was derived from the language S, a statistical programming language developed for statisticians. It was (and still is) commonly used in educational settings and is a favorite among biostatisticians.

At the end of the day, R excels at one thing and one thing only: statistics. In short, R does not support the wider range of operations that Python does. Yet some data scientists still choose R in their work.

Python, like R, was also released in 1990s, but the language’s core philosophy is much broader than just statistics. Unlike R, Python is a general-purpose programming language, so it can also be used for software development and embedded programming.

The main motivation for Python was creating a small core language with a large standard library and an easily extensible interpreter. At its core, Python is a high-level language that abstracts away many unnecessary programming details to make it easier for programmers to use and understand its syntax.

Today, both R and Python have large, helpful communities with many open-source libraries and packages; they’re also both admired by data scientists. So which one is better? And which one should you choose to start your data science journey?

Why not learn both?

Personally, I’m a long-time R programmer. I first picked up R on one of my first jobs; it was just one of many open-source languages at the time. Back then, my department used commercial products like SAS and DB2, and I was a big fan of SAS, so I was curious to learn something new. R programming became my “thing,” something that I practiced outside work in my spare time.

Later, this proved to be a great decision because the new programming knowledge I gained brought me new job opportunities and offers—in short, I became more competitive on the market. And to make things even better, we finally started using R at my company—so learning it really did pay off!

Of course, change is the one constant in programming—it’s never a good idea to just stick to one language and not learn anything new. With new projects at work, I found I had to develop my skills and use new tools. So over the past several months, I started to learn Python.

Currently, I’m working on a project that involves training and developing a neural network. Since these learning algorithms often involve lots of similar (and parallel) calculations, performance has been a big concern.

To speed things up, those kinds of calculations are moved to the GPU (graphics processing unit) side. Tensorflow is a framework/tool that enables the realization of such an architecture; its API is written in Python, so we’re not using R for this project.

Due to the needs of this specific architecture, we chose to work on the project in Python. And I’ve really started to appreciate this decision. Python’s biggest strength is arguably its simplicity—the fact that you can write neat code without much effort. It’s a language with a very generous, low learning curve.

R: messy code and data frames

I have to admit that I was a bit confused when I first started learning R. I was an experienced programmer, sure, but this was an entirely different language than what I was used to.

I remember one thing I had trouble understanding is the notion of a “data frame,” as well as how R is an entirely vectorized and functional programming language.

Readability is also really important in programming, but when it comes to R, you need to be aware that you’ll be working with some messy code at the beginning. But never fear—once you get used to the language, programming in R practically becomes second nature.

R is used by enterprise solutions

Although R is traditionally used in academic settings, some companies are trying to make custom R packages with commercial solutions and support for those tools. Oracle, Microsoft, and IBM are just a few of the many companies that are developing R packages for use with their existing services and databases. So if you learn R, you’ll open yourself to new and exciting job opportunities!

R has some cool visualization packages for exploratory analysis

So what are R’s strengths? Well, R excels when it comes to exploratory analysis and descriptive/inferential statistics. It has many open-source packages that you can download from the command line.

Keep in mind that R is one of the leading “academic” languages, so most open-source algorithms and functions related to statistics, data mining and data science will first be implemented in R and then later in other languages. If you want to use brand-new data science algorithms, then R is the way to go.

Also, R has some wonderful data visualization packages, like ggplot2 and plotly. For fast prototyping and interactive visualization, the R Shiny package is an excellent choice. When it comes to data visualization, R takes the cake.

Python: neat code and simplicity

As a high-level language, Python has very simple syntax and is easy to use. The difference is immediately noticeable, especially if you’re coming from relatively low-level languages like Java or C++.

Source: https://image.slidesharecdn.com

Python has a much higher purpose

As we mentioned earlier, R focuses on statistics, data analysis, and exploratory analysis. So its usage is limited to these disciplines. On the other hand, Python has a much higher purpose: as a general-purpose language, it is widely used in applications, web development, and even game development. In short, Python goes beyond just data science.

Tools created for integration

What about integrating R and Python? Is it possible to these two languages on a single project? The answer is yes—there are tools (like the feather package) that enable us to exchange data between R and Python and integrate code into a single project.

Always competing

When it comes to data analysis and data science, most things that you can do in R can also be done in Python, and vice versa. Usually, new data science algorithms are implemented in both languages. But performance, syntax, and implementations may differ between the two languages for certain algorithms.

It’s up to you with which language you choose for your learning path. Usually, statisticians or data analysts start with R and developers with Python. In any case, before you start to learn one or the other, you should identify the goals you want to achieve and how you see yourself applying these languages either in your life or on the job.

If you’re interested in learning Python, be sure to also check out the Introduction to Python for Data Science course (data analysis in Python, no IT background needed).

R or Python?

As a data scientist, you’ll need to choose the right tools for the job. And in terms of programming languages, it comes down to our two contenders: R and Python.

In general, you should use R if you need to:

  • Produce good data visualizations.
  • Conduct inferential statistics and analyses, especially in academic settings.
  • Work with exploratory analysis.

On the other hand, you should use Python if you:

  • Need to work with deep learning or computations that rely on GPUs.
  • Want to develop desktop apps, web apps, or video games.
  • Prefer to write short, clean, and legible code.

At the end of the day, neither of these languages is “better” than the other—each has its strengths and weaknesses.

R and Python are the two most popular and powerful data science languages on the market—so if you want to pursue a career in data science, you need to learn at least one of them. And remember—start with one, but eventually learn them both. You really can’t go wrong with either language!

Marija Ilic

Marija works as a data scientist in the banking industry. She specializes in big data platforms (Cloudera and Hadoop) with software and technologies such as Hive/Impala, Python and PySpark, Kafka, and R. Marija has an extensive background in DWH/ETL development in the banking industry. Her main interests are predictive modeling, real-time decision-making, and social network analysis. Outside of work, Marija enjoys listening to her favorite LPs on her old gramophone—and never grows tired of its soothing crackle.