View on GitHub

Pds-fall-2013 Setting up your Programming Environment

An Environment for Data Science

This course will rely on two primary tools- the Python programming language, and the unix command line.

Python is a high level programming language that has become widely used in a variety of settings; for example it is used increasingly in production systems with companies such as Google. Python’s design favors readability and clarity over flexibility, making sharing code among a group of developers much easier, enabling a user to readily understand what third-party code does functionally, in addition to easing the pain of debugging. Python is interpreted rather than compiled, giving a faster turn around in the development cycle.

Importantly, Python has an active user base. This makes it relatively easy to find others who can help with development problems, and means there is rich online literature illustrating others’ experiences engineering Python systems of all kinds. Another important manifestation of Python’s popularity is its wide variety of libraries. There are currently mature open source libraries for numerical and statistical computation, data analysis, web programming, data processing, interacting with databases, and just about any other task a data scientist is likely to encounter.

A common criticism of the Python language is a difficult setup process when installing additional libraries. Fortunately, Anaconda provides a self-contained, easy to use installer, giving a quick ramp-up to the Python programming language and most of the data-oriented libraries used in this course.

The rise of data science and the big data age has resulted in a variety of tools and libraries for performing statistical modeling and machine learning, data visualization, and data processing. One common theme in the development of data science is an increased reliance on open source tools and unix-like operating environments. An ever-increasing number of production systems are running in a unix-like environment, and these environments are becoming more widely used for the various tasks of the data scientist. Being familiar with unix is a powerful skill- as one’s proficiency with the unix shell increases, so too does efficiency in completing and automating many tasks.

Below are some installation notes for the Windows and OS-X operating systems.

Windows

OS-X

Additionally, throughout this course, we will be using git and github, and MySQL. The following software is optional for this course, but useful.

Windows

OS-X