Python Basics
Johann Mitlöhner, 2012
Why use Python?
- open source, public domain programming language, install and use at no cost
- huge user community (number 8 in
programming language popularity), lots of help online
- clear syntax
- easy to learn
- fast to program
- large set of modules for wide array of application domains
- available for all major operating systems
Python focuses on productivity and low maintance cost.
However, there are situations when not to use Python:
- Maximum performance needed, use C
- Statistics, use R
History and Characteristics
- Version 0.9.0 published in 1991 by Guido van Rossum
- Interpreted language
- Reasonably fast execution, but still at least factor 10 - 20 compared to C
- Minimal syntax, 'executable pseudo code'
- Block structure marked by indentation
- No variable or type declarations
Python is well suited for data preparation, printing various file formats, and
automating repeated tasks with other software packages, such as statistics systems.
Accessing databases is very easy, and due to it simple and clear syntax, the language
can be learned in a short time. Therefore, Python is often used as a 'glue' between
different systems, such as retrieving data from a DB, formatting in specific ways,
and calling other software for further processing, such as statistical analysis.
Some examples (Unix, Python 2)
Python can be used interactively, but usually by Script:
use text editor to create file hello.py, e.g. pico hello.py:
#!/usr/bin/python
print "Hello!"
Save file and leave editor. Then,
chmod +x hello.py
./hello.py
- the first line states that this is file will be run by the Python interpreter found in directory /usr/bin
- the filename suffix .py is not necessary on Unix, but generally considered good practise
There are other options to make this script run:
- env
- instead of specifying the path to the interpreter you could also use #!/usr/bin/env python
- this will search for a python binary in your path; this is usually, but not always, what you want
- enter echo $PATH to see the directories in your path
- call interpreter
- drop the first line completely and call the python interpreter on the command line:
- python hello.py instead of ./hello.py
Indentation is used to mark blocks, such as loops:
for i in range(10):
print "i:", i
- the range(n) function returns the numbers 0 to n-1
- the print command puts blanks between the arguments
- use "i:"+str(i) to avoid the blank
- note that the integer i must be converted to a string in this case
- use blanks to indent, not tabs
- tab spacing varies with editor/terminal/computing environment
- one blank is always exactly one character
- be consistent, e.g. always 2 blanks per level
Some other operations:
- initialising a counter: cnt = 0
- incrementing a counter: cnt += 1
- initialising a string: str = ""
- or: str = '' (note that we want the straight single quotes, not the slanted ones)
- adding to a string: str += " next"
- quotes within a string: str = "Hugo's home"
- multiple assignment: (x,y) = (42,43)
- all items in list starting at i: lst[i:]
- all items up to n: lst[:n]
Python supports object oriented programming, but this
text will not cover more than necessary to use the standard tools.
x = "ab,cd,efg".split(",")
x
['ab', 'cd', 'efg']
len(x)
3
x[2]
'efg'
- A string is split into several strings; the result is a list of strings.
- Indexing starts with 0 in Python.
Reading and Writing Files
The following program will read a file; if you put this code
in a file called fileread.py, it will read itself!
f = open("fileread.py")
for rec in f:
print rec.strip()
f.close()
- open() creates a file pointer
- strip() removes leading and trailing whitespace (blanks, tabs, newlines)
- open files should be closed automatically when a program terminates; however,
- it is good practise to use close() explicitely
The following program will write to a file called tmp.dat:
from random import randint
f = open("tmp.dat", "w")
f.write("x y\n")
for i in range(1,11):
x = randint(15000, 25000)
y = randint(10, 50)
f.write(str(x) + " " + str(y) + "\n")
f.close()
- the import statement makes additional modules accessible
- use import random to import all functions from the module
- than, refer to the functions as e.g. random.randint()
- it is usually preferable to import only the functions that are actually needed
- the second argument of open() indicates that the file will be
- created it it does not exist
- overwritten if it already exists
- newline \n starts a new line in the file
- in a string, newline is written as the two characters \ and n
- in Unix file systems, newline is actually one character
- newline is represented differently in some other file systems
- however, the same Python code will produce a newline in each operating system
- the randint(a,b) returns a random integer in the interval [a,b]
After running this program the file tmp.dat will contain something like this:
x y
20779 23
20163 42
19763 10
17358 11
24389 47
19770 19
21764 48
22383 37
17829 50
17884 12
The format of this file allows it to be used as a data file in the statistics
system R.
Functions and Modules
Well-structured code is much easier to maintain. Function definition is a must
for every non-trivial program. For large applications, defining separate modules
usually achieves better structure.
#!/usr/bin/python
# use modulo % to check even/odd
def odd(x):
if (x % 2) == 1: return True
else: return False
def main():
for i in range(10):
if odd(i):
print i, "is odd!"
if __name__ == '__main__': main()
- the # character starts a comment in Python
- the keyword def introduces a definition
- the values True and False are pre-defined boolean values
- since the result of odd() is used in main(), it should return a result value
The last line decides whether to start the main() function:
- if this file is executed as a scipt, this main() function will be called
- if this file is imported by another python program, this main() function will not be excuted
- this is useful when you want to write modules containing their own main() functions for testing
Coding style:
- if there is only one statement in a block, that statement may follow on the same line (as in odd())
- however, many consider it better style to put each statement on a separate line (as in main())
Your code will be easier to read and debug if you stick with a particular coding style, e.g. for
- the number of spaces for indent
- using blank between values and operators or not, e.g. x + y versus x+y
- naming - abbreviations, upper/lower case, underscores
The following module should go into a file called tools.py;
it contains its own main() function:
#!/usr/bin/python
def prime(x):
for i in range(2, x):
if(x % i) == 0: return False
return True
def main():
print "main() from tools"
if __name__ == '__main__': main()
However, when imported in the following code, the tools main() function will
not be executed.
#!/usr/bin/python
import tools
def main():
for i in range(1,10):
if tools.prime(i):
print i, "is prime!"
if __name__ == '__main__': main()
Instead, the main() function in this file is called.
- alternatively, we could have imported just the prime() function with from tools import prime
- note that the first character in tools.py is # which in Python starts a comment line,
- therefore the #!/usr/bin/python line is conveniently ignored
- the filename of the second script does not matter; however,
- the tools module should reside in the same directory as the program importing it
- alternatively, module files can be placed in any directory in the environment variable PYTHONPATH
Accessing Databases
The following program will access a Postgres database, execute
an SQL select statement, and print the result. Obviously, this will
only work on a computer where
- a Postgres database management system is running
- a database as described in the file db.txt exists,
- with the table PROD containing a column color
- the package python-psycopg2 is installed (Python-Postgres access)
#!/usr/bin/python
import psycopg2
import os
def testpg():
conn = psycopg2.connect(dbstr())
cur = conn.cursor()
cur.execute("select color, count(*) from PROD group by color limit 5;")
for rec in cur:
(color, cnt) = rec
print color, cnt
cur.close()
conn.close()
def main():
testpg()
def dbstr():
return open("db.txt").read()
if __name__ == '__main__': main()
- when accessing databases, keep the confident information in a separate file
- for postgres, the file db.txt contains just one line which has to look like this:
dbname=mydbname user=myuserid password=mypasswd
- do a chmod go-r db.txt, and other users will not be able to read the file
- the cursor object can be used to execute SQL commands
- for select, the result can be accessed by looping over the cursor
- each record is composed of fields, which can be split into several values
- a multiple assignment statement like (x,y) = ... achieves the splitting
- open cursors and DB connections should be closed automatically when a program terminates; however,
- it is good practise to close cursor and connection explicitely
When inserting values into database tables, use the following code snippet
as a template:
...
id = 10
name = "Smith"
salary = 20000
cur.execute("insert into EMP (eid, name, salary) values (%s, %s, %s)", (id, name, salary))
...
cur.close()
conn.commit()
conn.close()
- obviously, you need a table EMP with the stated columns for this to work
- note how the (%s) part matches the number of values inserted
- without the commit() the changes will not be made permanent
Various Python-related
Interpreters and Compilers
- CPython
- standard Python interpreter
- supports the full language and all additional packages
- moderate performance for a wide array of programming tasks
- Pypy
- new just-in-time implementation of the Python interpreter
- interesting for a variety of reasons, aside from performance
- promises dramatic speedup once the project reaches maturity
- currently, a factor of no more than 2 is typical, compared to CPython
- unfortunately, some widely used packages are not yet fully supported, most importantly numpy
- get it from pypy.org
- to use it
- change the #!/usr/bin/python header to your pypy binary; or
- use #!/usr/bin/env pypy
- the second form will look for a pypy binary in your path
- Psyco
- older just-in-time compilation approach
- package python-psyco is available in various Linux distros, but the project is no longer maintained
- unfortunately, it only works for 32-bit, which rules out server applications typically running 64 bit
- To use it, install the package and just include the following two lines at the start of the program:
- import psyco
- psyco.full()
- Shedskin
- Python to C compiler, achieves C-level performance!
- developed by Mark Dufour; shows great promise, but not yet at a mature state
- many features of Python are not yet supported
- small, simple programs are likely to work
- download e.g. from google code and give it a try
Performance
Python was developed with programmer performance in mind, not computer performance.
Note that programmer time is much more expensive than computer time. However, some
applications demand certain response times, and Python may not be able to cope.
Performance benchmarks are notoriously difficult and misleading. Here, the Quicksort
algorithm has been implemented in several languages, using the same approaches everywhere
as far as the language permits. The table shows the execution time in seconds for sorting
n random integers, run on a 32 bit Intel Core 2 Duo E8400, 3 GHz, 6 MB cache, 4 GB RAM.
n | C | Java | Python | Pypy | Psyco
|
10,000 | 0.003 | 0.084 | 0.056 | 0.187 | 0.032
|
100,000 | 0.018 | 0.092 | 0.545 | 0.906 | 0.175
|
1,000,000 | 0.123 | 0.216 | 6.256 | 4.066 | 1.814
|
- The C language still shows the best performance.
- Java suffers from the runtime startup; however, with sufficient problem size, this effect
becomes less dramatic, compared to C.
- The standard C Python interpreter (Python 2.6.5) is slower than C by a factor of 20 to 50.
- The new Pypy Python interpreter manages to be faster than CPython only with sufficient
problem size.
- The Psyco just-in-time compiler manages a speedup of up to 3. Factors up to 20 can be achieved
in some circumstances.
Here is the same benchmark on a server running 64 bit Linux, a virtual machine with 6 cores,
Intel Xeon E7310 1.6 GHz, 2 MB cache, 8 GB RAM.
n | C | Java | Python | Pypy | Psyco
|
10,000 | 0.005 | 0.163 | 0.103 | 0.374 | -
|
100,000 | 0.021 | 0.256 | 0.888 | 1.227 | -
|
1,000,000 | 0.187 | 0.437 | 10.208 | 1.978 | -
|
- The 64 bit Java runtime lags further behind C than the 32 bit version (both hotspot, Java 1.6.0)
- Python (2.6.6) is again much slower than C and Java
- This Pypy (both 1.9) is faster by a factor of 5 for n=1,000,000
- There is no Psyco for 64 bit.
Some notes of caution for interpreting and drawing conclusions from these benchmarks:
- Sorting integers represents only a tiny fraction of application domains.
- In many situations, computation inside the local CPU will not be the bottleneck.
- Other operations, such as network transfer or database access, will be much slower.
- Therefore, Python's lack of performance is irrelevant for many applications.
- Performance can often be hugely improved when Python's built-in functions are used in a smart way.
- Many large-scale applications have been developed in Python, e.g. Dropbox.
- Python is used for various tasks in large systems, e.g. Google, Youtube.
External Code:
Python can call external code, such as C programs. If the performance bottleneck can
be tracked to a small number of simple functions, those pieces can be written in C with acceptable
effort and integrated by normal Python function calls. There are several alternatives, one of them
is ctypes.
The interface is not trivial, but its use it feasible for moderately experienced developers.
Other Resources
The primary source for everything concerning Python is
python.org.
A good source for solutions to common programming problems is
stackoverflow.com.