Why I Prefer Python for Data Analysis

I’ve written a lot about data analysis with Python recently. I wanted to explain why it’s been a language of choice. Here are some of the reasons I find Python so easy to use, yet powerful.

Python Offers Quick Interactive Calculations

Python lets me run statistical calculations much faster than I could ever do by hand. When I started on my statistics course back in college, I had to calculate basic descriptive statistics like mean, median, and standard deviation. The length of the datasets made this unwieldy, even with the scientific calculator I had. Was I entering the data correctly? I quickly switched over to the TI graphing calculator I had. The textbook I was using happened to show the statistical functions for that model.

I think I still have my graphing calculator somewhere. I don’t need it with Python. The standard joke is thatyou can use it as a desk calculator. The interpreter on its own is easy to use for basic calculations.Upgrading to IPythongives it a lot of creature comforts that can make interactive Python much easier, such as history recall.

Running descriptive statistics on an array of random numbers in Python.

While I do like messing around with my Casio scientific calculator occasionally, since I like the tactile buttons, using Python is so much easier. I can generate a list of random numbers in Python and find descriptive statistics immediately. I can run operations on lists of 100 items. That would be difficult to do by hand or even with a calculator. Python is just a lot easier.

Lots of Libraries to Make Things Easier

One of Python’s best features is the number of libraries you can use with the language. Not only does Python come with lots of libraries on its own, an approach it calls “batteries included,” but there are many more libraries that I can tap into for data analysis.

The built-in math and statistics libraries are helpful for basic calculations, but there are even better ones that I routinely set up inmy Mamba environment.

Linear regression of tips vs total bill made with Seaborn,

One foundation isNumPy. This is a library for creating and manipulating large numerical arrays efficiently, but it also comes with a lot of functions that I can perform on them, such as linear algebra and statistics. It’s the latter that will be the focus of this article.

I can take the mean of this list:

I can also take the standard deviation:

These are some basic operations. It’s the other libraries that keep me in Python.

Graphics are a powerful way of exploring your data.Seaborn, which I’ve written about previously, is one powerful tool. A good example is visualizing scatter plots and linear regressions.

Python Seaborn restaurant bar plot by day.

Let’s use an example from Seaborn. We’ll import a dataset of restaurant bills in New York City, along with tips and other variables such as the number of people at the table and whether they were smokers or not.

Let’s plot the tip amount vs. the total bill:

I can also view box plots and bar charts with Seaborn. Let’s look at the total bills in the restaurants over several days:

I can also examine data more formally withPingouin, a library for conducting statistical tests. Again, let’s look at the tips database. The number to pay attention to is the correlation coefficient, which tells us how good a fit the line is over the data. The square of the correlation coefficient is listed in the “r2” column.

Python Seaborn restaurant tips linear regression result.

Other times, you want to look at how a value changes over categories. For that, analysis of variance, or ANOVA, is useful.

Let’s go from restaurant bills to penguin bills. This code will examine a dataset of three penguin species: Adélie, Chinstrap, and Gentoo. We’ll see if the difference in bill length in millimeters is significant across species.

Kaggle datasets listing on their website.

The p-value is 0.0, which means that the differences are significant

More Intuitive Method Names

R has been the traditional champion in data analysis when it comes to programming language, but it’s the fact that Python’s method names seem to be more mnemonic that puts Python on top for me.

I don’t have a problem with R itself. I’ve used it before, and it’s a great language for data analysis, but maybe just not so great for me. For example, instead of the equals sign, assigning to a vector, the list of numbers similar to NumPy arrays, uses the “Tab completion in IPython makes entering the commands so much easier than retyping anything, and I can use the history mechanism to get back at previously typed commands.

Notebook price data Jupyter notebook.

Lots of Data to Play With

I’d long been interested in data analysis, especially since I heard about how Python and its libraries were increasingly being used. I’d had some familiarity with Python, but I wanted to get involved. But where was I going to find data?

Fortunately, it’s easy to find data. As mentioned earlier, a lot of the libraries let you access public datasets so you can learn how these libraries work and validate their results. We saw this in the earlier examples.

The other way to get data is to generate it randomly. This is another method that’s useful for learning how a library works. The only downside of random data is that it changes every time.

I can also create my own datasets. This is what I did fora recent project looking at phone and tablet battery life.

There are also a lot of datasets other people have generated on sites likeKagglethat I can download and explore using the tools I’ve mentioned above. I can even download data from government websites.

Interactive Operations Give Immediate Feedback

One thing I like about Python in data analysis is that the interactive mode gives me immediate feedback. I don’t have to wait for a compile cycle or write a script before I can see my results. When I run a calculation in IPython, I will see the result in the terminal window. If I’m making a plot, it will pop up in another window.

Often, when I’m working through data, I’ll get ideas for other operations I want to perform. I can then act on them. That’s the real power of interactive programming.

Jupyter Notebooks Let Me Record My Calculations

As useful as IPython is, a lot of the operations I perform will be gone after that session, after I close the window. I can save any plots, but the calculations will likely be gone. I can set up a log, but I have to remember to do that. If I want to remember how I did something, I’ll have to sift through the command history.

This might be why the developers of IPython created theJupyter notebook interface. Jupyter lets me build interactive notebooks. I typically create one when I explore a dataset and want to record the results. I can come back to it when I want to. The other good thing is that I can share Jupyter notebooks with other people, such as onmy GitHub account.

Jupyter has become a de facto standard in scientific computing and data for good reason. It’s easy enough for scientists of all stripes to record and share their findings with their colleagues.

These are the reasons that Python is my programming language of choice, especially for data analysis. It’s an easy language to get started with, and it’s grown with me as I’ve learned more and expanded what I can do.

Python Offers Quick Interactive Calculations#

Lots of Libraries to Make Things Easier#

I can take the mean of this list:#

I can also take the standard deviation:#

Let’s plot the tip amount vs. the total bill:#

More Intuitive Method Names#

Lots of Data to Play With#

Interactive Operations Give Immediate Feedback#

Jupyter Notebooks Let Me Record My Calculations#