An Environment for Data Science
This course will rely on two primary tools- the Python programming language, and the unix command line.
Python is a high level programming language that has become widely used in a variety of settings; for example it is used increasingly in production systems with companies such as Google. Python’s design favors readability and clarity over flexibility, making sharing code among a group of developers much easier, enabling a user to readily understand what third-party code does functionally, in addition to easing the pain of debugging. Python is interpreted rather than compiled, giving a faster turn around in the development cycle.
Importantly, Python has an active user base. This makes it relatively easy to find others who can help with development problems, and means there is rich online literature illustrating others’ experiences engineering Python systems of all kinds. Another important manifestation of Python’s popularity is its wide variety of libraries. There are currently mature open source libraries for numerical and statistical computation, data analysis, web programming, data processing, interacting with databases, and just about any other task a data scientist is likely to encounter.
A common criticism of the Python language is a difficult setup process when installing additional libraries. Fortunately, Anaconda provides a self-contained, easy to use installer, giving a quick ramp-up to the Python programming language and most of the data-oriented libraries used in this course.
The rise of data science and the big data age has resulted in a variety of tools and libraries for performing statistical modeling and machine learning, data visualization, and data processing. One common theme in the development of data science is an increased reliance on open source tools and unix-like operating environments. An ever-increasing number of production systems are running in a unix-like environment, and these environments are becoming more widely used for the various tasks of the data scientist. Being familiar with unix is a powerful skill- as one’s proficiency with the unix shell increases, so too does efficiency in completing and automating many tasks.
Below are some installation notes for the Windows and OS-X operating systems.
Windows
- Cygwin, a unix command line emulator for Windows. Includes notes for configuring with Anaconda.
- Installing Anaconda Python, configuring with cygwin. (official documentation)
OS-X
- Terminal or Iterm2 for a Unix command line environment
- Installing Anaconda Python (official documentation)
Additionally, throughout this course, we will be using git and github, and MySQL. The following software is optional for this course, but useful.
Windows
- Git installation guide on Windows. Note: git comes pre-installed on cygwin when using the prior installation guide
- MySQL setup on Windows. Setting up a MySQL database on Windowns
- MySQL Workbench. An interactive front end to MySQL databases. Often convenient for storing queries and visualizing results.
OS-X
- Installing Git on OS-X
- Installing MySQL on OS-X
- Sequel Pro. A nice application front end to MySQL databases