Python has become an increasingly common language for data scientists, back-end engineers, and front-end engineers, providing a unifying platform for the range of disciplines found on an engineering team. One of the benefits of Python is that it allows software developers to choose and make use of zillions of good code packages. Among the huge number of excellent Python packages, a data scientist may use Pandas for data manipulation, NumPy for matrix computation, matplotlib for plotting, SciPy for mathematical modeling, and Scikit Learn for machine learning. Another benefit of using Python is that it allows developers to contribute their own code packages to the community or share a library with other Python programmers. At Endgame, library sharing is very common across projects for agile product development. For example, the implementation of a new clustering algorithm as a Python library can be used in multiple products with minimum adaptation. This tutorial will cover the basic steps and recommended practices for how to structure a Python project, package the code, distribute it over a Git repository (Github or a private Git repository) and install the package via pip.
For busy readers, I’ve developed a workflow diagram, below, so that you can quickly glance at the steps that I’ll outline in more detail throughout the post. Feel free to look back at the workflow diagram anytime you need a reminder of how the process works.
Step One: Setup
Let’s suppose we are going to develop a new Python package that will include some exciting machine learning functionality. We decide to name the package "egclustering" to indicate that it contains functions for clustering. In the future, if we are to develop a new set of functions for classification, we could create a new package called "egclassification". In this way, functions designed for different purposes are organized into different buckets. We will name the project folder on the local computer as "eglearning". In the end, the whole project will be version controlled via Git, and be put on a remote Git repository, either GitHub or a private remote repository. Anyone who wants to use the library would just need to install the package from the remote repository.
Term Definitions
Before we dig into the details, let’s define some terms:
- Python Module: A Python module is a py file that contains classes, functions and/or other Python definitions and statements. More detailed information can be found here.
- Python Package: A Python package includes a collection of modules and a ___init___.py file. Packages can be nested at any depth, provided that the sub-directories contain their own __init__.py file.
- Distribution: A distribution is one level higher than a package. A distribution may contain one or multiple packages. In file systems, a distribution is the folder that includes the folders of packages and a dedicated setup.py file.
Step Two: Project Structure
A clearly defined project structure is critically important when creating a Python code package. Not only will it present your work in an organized way and help users find valuable information easily, but it will also be much easier to add new packages or files in the future if the project scales.
I will take the recommendation from "Repository Structure and Python" to structure a new project, only adding a new file called README.md which is an introductory file used on GitHub, as shown below.
README.rst
README.md
LICENSE
setup.py
requirements.txt
egclustering
__init__.py
clusteringModuleLight.py (This py file contains the code.)
helpers.py
docs
conf.py
index.rst
tests
test_basic.py
test_advanced.py
The project structure is well explained on the page referenced above. Still, it might be helpful to emphasize a few points here:
- setup.py is the file that tells a distribution tool, such as Distutils or Setuptools, how to install and configure the package. It is a must-have.
- egclustering is the actual package name. How would we (or a distribution tool) know that? Because it contains a __init__.py file. The __init__.py file could be empty, or contain statements for some initiation activities.
- clusteringModuleLight.py is the core file that defines the classes and functions. A single py file like that is called a module. A package may contain multiple modules. A package may also contain other packages, namely sub-packages, as long as there is a __init__.py included in a package folder. A project may contain multiple packages as well. For instance, we may create a new folder on par with "egclustering" called "egclassification" and put a new __init__.py under it.
- Once you find a structure you like, it can serve as a template for future structures. You only need to copy and paste the whole project folder and give it a new project name. More advanced users can try using some template tools, for example, cookiecutter.
Step Three: Setup Git and GitHub (or private GitHub) Repository
Ctrl+Alt+t to open a new terminal, and type in the following two commands to install Git on your computer, if you haven't done so.
sudo apt-get update
sudo apt-get install git
If the remote repository will be on GitHub (or any other source code host, such as bitbucket.org), open a web browser and go to github.com, apply for an account, and create a new repository with a name like 'peterpan' in my case. If the remote repository will be on a private GitHub, create a new repository in a similar way. In either situation, you will need to tell GitHub your public key so that you can use ssh protocol to access the repository.
To generate a new pair of ssh keys (private and public), type the commands in the terminal:
ssh-keygen -t rsa -C "your_email@example.com"
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_rsa
Then go to the settings page of your github account and copy and paste the content in the pub file into a new key. The details of generating ssh keys can be found on this settings page.
You should now have a new repository on GitHub ready to go. Click on the link of the repo and it will open the repo's webpage. At the moment, you only have a master branch. We need to create a new branch called "develop" so that all the development will happen on the "develop" branch. Once the code reaches a level of maturity, we put it on "master" branch for release.
To do that, click "branch", and in the blank field, type "develop". When that's done, a new branch will be created.
Step Four: Initiate the Local Git and Syn with the Remote Repository
So far, we have installed Git locally to control the source code version, created the skeleton structure of the project, and set up the remote repository that will be linked with the local Git. Now, open a terminal window and change the directory (command ‘cd’) in the project folder (in my case, it is ~/workspace/peterpan). Type:
git init
git add .
The period “.” after “git add” indicates to add the current folder into Git control.
If you haven't done so already, you will need to tell Git who you are. Type:
git config --global user.name "your name"
git config --global user.email "your email address"
Now let's tell local Git what remote repository it will be associated with. Before doing that, we need to get the URL of the remote repository so that the local Git knows where to locate it. On your browser, open the remote Git repository webpage, either on Github or your private GitHub. On the bottom of the right-side panel, you will see URL in different protocols of https, SSH, or subversion. If you're using GitHub and your repository is public, you may choose to use the https URL. Otherwise, use the SSH URL. Click the "copy to clipboard" button to copy the link.
In the same terminal, type:
git remote -v
to check what remote repositories you currently have. There should be nothing.
Now use the copied URL (which in my case is git@github.com:richardxy/peterpan.git) to construct the command below. "peterpanssh" is the name I gave to this specific remote repository which helps the local Git to identify which remote repository we deal with.
git remote add peterpanssh git@github.com:richardxy/peterpan.git
When you type in the command “git remote -v” again, you should see the new remote repository has been registered with the local Git. You can add more remote repositories in this way by using “git remote add” command. In the case when you would like to delete a remote repository, which basically means "break the link between the local git and the remote repository", you can do Git remote rm (repository name), such as:
git remote rm peterpanssh
If you don't like the current name of a repository, you can rename it by using the following command.
git remote rename (oldname) (newname), such as:
git remote rename peterpanssh myrepo
At the moment, the local Git repository has only one branch. Use “git branch” to check, and you will see “master” only. A better practice is to create a “develop” branch and develop your work there. To do this, type:
git checkout -b develop
Now type “git branch” again and hit enter in the terminal window, and you will see the branch “develop” with an asterisk attached ahead of it, which means that the branch “develop” is the current working branch.
Now that we have linked a remote Git repository with the local Git, we can start synchronizing them. When you created the new repository on the remote Git (Github or your company's private Git repository), you may have opted in to add a .gitignore file. At the moment, .gitignore file exists only at the remote repository, but not at the local git repository. So we need to pull it to the local repository and merge it with what we have in the local repository. To do that, we use the command below:
git pull peterpanssh develop
Of course, peterpanssh is the name of the remote repository registered with the local git. You may use your own name.
“Git pull” works fine in small and simple projects like this. But when working on a project that has many branches in its repository, separate commands "git fetch" and "git merge" are recommended. More advanced materials can be found at git-pull Documentation and Mark's blog.
Once the local Git repository has everything the remote Git repository has (and more), we can commit and push the contents in the local Git to the remote Git.
The reason for committing to Git is to put the source code under Git's version control. The workflow related to committing usually includes:
Modify code -> Stage code -> Commit code
So, before we actually commit the code, we need to stage the modified files. We do this to tell Git what changes should be kept and put under version control. The easiest way to stage the changes is to use:
git add -p
That will bring up an interactive session that presents you with all the changes and lets you decide to stage them or not. As we haven't made many changes so far, this interactive session should be short. Now we can enter:
git commit -m "initial commit"
The letter "m" means "message", and the string after "-m" is the message to describe the commit.
After committing, the staged changes (by the "git add" command) are now placed in the local Git repository. The next step is to push it to the remote repository. Using the command below will do this:
git push peterpanssh HEAD:develop
In this case, "peterpanssh" is the remote repository name registered with the local Git, and "develop" is the branch that you would like to push the code to.
Step Five: Develop the Software Package
So far, we have built the entire infrastructure for hosting the local project, controlling the software versions both locally and remotely. Now it's time to work on the code in the package. To put the changes under version control (when you’re done with the project, or any time you think it’s needed), use:
git add -p
git commit -m "messages"
git push repo_name HEAD:repo_branch
Step Six: Write setup.py
When your code package has reached a certain level of maturity, you can consider releasing it for distribution. A distribution may contain one or multiple packages that are meant to be installed at the same time. A designated setup.py file is required to be present in the folder that contains the package(s) to be distributed. Earlier, when we created the project structure, we already created an empty setup.py file. Now it's time to populate it with content.
A setup.py file contains at least the following information:
from setuptools import setup, find_packages
setup(name='eglearning',
packages=find_packages()
)
There are a few distribution tools in Python. The standard tool for packaging in Python is distutils, and setuptools is an upgrade of distutils, with more features. In the setup() function, the minimum information we need to supply is the name of the distribution, and what packages are to be included. The function find_packages() will recursively go through the current folder and its sub-folders to collect package information, as long as a __init__.py is found in a package folder.
It is also helpful to provide the meta data for the distribution, such as version, a description of what the distribution does, and author information. If the distribution has dependencies, it is recommended to include the installation requirements in setup.py. Therefore, it may end up looking like this:
from setuptools import setup, find_packages
'
setup(name='eglearning',
version='0.1a',
description='a machine learning package developed at Endgame',
packages=find_packages(),
install_requires=[
Pandas>=0.14',
'
Numpy>=1.8','
'
scikit-learn>=0.13',
'
elasticsearch',
'
pyes',
],
)
To write more advanced setup.py, Python documentation or this web page are good resources.
When you are done with setup.py, commit the change and push it to the remote repository by typing the following commands:
git add -p
git commit -m 'modified setup.py'
git push peterpanssh HEAD:develop
Step Seven: Merge Branch Develop to Master
According to Python engineer Vincent Driessen, "we consider origin/master to be the main branch where the source code of HEAD always reflects a production-ready state." When the code in the develop branch enters the production-ready state, it should be merged into the master branch. To do this, simply type in the terminal under the project directory:
git checkout master
git merge develop
Now we can push the master branch to the remote repository:
git push peterpanssh
Step Eight: Install the Distribution from the Remote Repository
The Python package management tool "pip" supports the installation of a package distribution from a remote repository such as GitHub, or a private remote repository. pip currently supports cloning over the protocols of git, https and ssh. Here we will use ssh.
You may choose to install from a specific commit (identified by a MD5 check-sum) or whatever the latest commit in a branch. To specify a commit for cloning, type:
sudo pip install -e git://github.com/richardxy/peterpan.git@4e476e99ce2649a679828cf01bb6b3fd7856281f#egg=MLM0.01
In this case, "github.com/richardxy/peterpan.git" is the ssh clone URL with ":" after ".com" being replaced with "/". This is tricky and it won't work if you omitted the replacement. The parameter "egg" is also a requirement. The value is up to you.
If you opt to clone the latest version in the branch (e.g. “develop” branch), type:
sudo pip install -e git://github.com/richardxy/peterpan.git@develop#egg=MLM0.02
You only need to specify the branch name after "@" and before "egg" parameter. This is my preferred method.
Then pip will check if the installation requirements are met and install the dependencies and the package for you. Once it's done, type:
pip freeze
to find the newly installed package. You will see something like this:
-e git://github.com/richardxy/peterpan.git@2251f3b9fd1b26cb41526f394dad81016d099b03#egg=eglearning-develop
Here, 2251f3b9fd1b26cb41526f394dad81016d099b03 is the MD5 checksum of the latest commit.
Type the command below to create a requirements document that registers all of the installed packages and versions.
pip freeze > requirements.txt
Then open requirements.txt, replace the checksum with the branch name, such as “develop”, and save it. The reason for doing that is, the next time when a user tries to install the package, there might be new commit and therefore the MD5 would have changed. Using the branch name will always point to the latest commit in that branch.
One caveat: if virtualenv is used, the pip freeze command should look like this so that only the configurations in the virtual environment will be captured:
pip freeze -l > requirements.txt
Conclusion
This tutorial covers the most fundamental and essential procedures for creating a Python project, applying version control during the development, packaging the code, distributing it over code-sharing repositories, and installing the package via cloning the source code. Following this process can help non-computer science-trained data scientists get more comfortable using well-known collaborative tools like Python for software development and distribution.
Read more blog posts by Richard Xie.
Read more blog posts about Data Science.