Pandas is an amazing data analysis toolkit for Python. It is designed to operate on relational or labeled data and gives you tools to slice and dice as you please.
In this two-part tutorial, you’ll learn about the fundamental data structures of Pandas: the series and the data frame. You’ll also learn how to select data, deal with missing values, manipulate your data, merge your data, group your data, work with time series, and even plot data.
Installation
To install, just pip install pandas
. It will take care of installing numpy too if you don’t have it installed.
Series
Pandas series are typed and labeled 1-D arrays. This means that each element can be accessed by its label in addition to its index.
Here is a series of integers where the labels are Roman numerals. You can index and slice using the labels or integer indices. Unlike with regular Python list slicing, when using labels the last item is included!
>>> s = pd.Series(np.arange(1,5), ['I', 'II', 'III', 'IV', 'V']) >>> s['III'] 3 >>> s[0] 1 >>> s['II':'V'] II 2 III 3 IV 4 V 5 >>> s[1:5] II 2 III 3 IV 4 V 5
If you don’t provide an index then a 0-based integer index is automatically created:
>>> s = pd.Series((50, 7, 88, 9)) >>> s 0 50 1 7 2 88 3 9
Now, here is a little secret for you. Pandas series are a wrapper around Numpy’s arrays.
>>> s.values array([50, 7, 88, 9]) >>> type(s.values)
Unlike Python lists or numpy arrays, operations on series align on the index. If the indexes don’t match then the union of indices will be used with missing values as appropriate. Here are a few examples using dicts as data so the keys become the series index:
>>> s1 = pd.Series(dict(a=1, b=2, c=3)) >>> s2 = pd.Series(dict(a=4, b=5, c=6, d=7)) >>> s1 + s2 a 5.0 b 7.0 c 9.0 d NaN >>> s1[1:] * s2[:-1] a NaN b 10.0 c 18.0
Data Frames
Data frames are the primary pandas data structure. They represent tables of data where each column is a series. Data frames have an index too, which serves as a row label. A data frame also has column labels. Here is how to declare a data frame using a dict.
>>> df = pd.DataFrame(dict(a=[1, 2, 3], b=[4,5,6], c=pd.Timestamp('20170902'), d=pd.Categorical(['red', 'green', 'blue']))) >>> df a b c d 0 1 4 2017-09-02 red 1 2 5 2017-09-02 green 2 3 6 2017-09-02 blue
Note that an integer index (row label) was created automatically. You can of course provide your own index:
>>> df.index = ('I II III'.split()) >>> df a b c d I 1 4 2017-09-02 red II 2 5 2017-09-02 green III 3 6 2017-09-02 blue
Importing and Exporting Data
Data frames can be constructed from a very wide variety of sources:
- dict of 1-D ndarrays, lists, dicts, or series
- 2-D numpy.ndarray
- structured or record ndarray
- another DataFrame
You can also import or load data from many file formats and databases such as:
- CSV
- Excel
- HTML
- HDFStore
- SQL
Here is how to read a CSV file:
data.csv -------- I,1,4,2017-09-02,red II,2,5,2017-09-02,green III,3,6,2017-09-02,blue >>> pd.read_csv('data.csv') I 1 4 2017-09-02 red 0 II 2 5 2017-09-02 green 1 III 3 6 2017-09-02 blue
Here is the complete list of read_functions():
>>> read_functions = [a for a in dir(pd) if a.startswith('read_')] >>> print('n'.join(read_functions)) read_clipboard read_csv read_excel read_feather read_fwf read_gbq read_hdf read_html read_json read_msgpack read_pickle read_sas read_sql read_sql_query read_sql_table read_stata read_table
There are corresponding methods on the data frame object itself for exporting the data to many formats and databases. Here is how you export to json and msgpack:
>>> df.to_json() '{"a":{"I":1,"II":2,"III":3}, "b":{"I":4,"II":5,"III":6}, "c":{"I":1504310400000,"II":1504310400000,"III":1504310400000}, "d":{"I":"red","II":"green","III":"blue"}}' >>> df.to_msgpack() b'x84xa3typxadblock_managerxa5klassxa9DataFramexa4axes x92x86xa3typxa5indexxa5klassxa5Indexxa4namexc0xa5dtype xa6objectxa4datax94xa1axa1bxa1cxa1dxa8compressxc0x86 xa3typxa5indexxa5klassxa5Indexxa4namexc0xa5dtype xa6objectxa4datax93xa1Ixa2IIxa3IIIxa8compressxc0 xa6blocksx93x86xa4locsx86xa3typxa7ndarrayxa5shapex91 x02xa4ndimx01xa5dtypexa5int64xa4dataxd8x00x00x00x00 x00x00x00x00x00x01x00x00x00x00x00x00x00xa8compress xc0xa6valuesxc70x00x01x00x00x00x00x00x00x00x02x00 x00x00x00x00x00x00x03x00x00x00x00x00x00x00x04 x00x00x00x00x00x00x00x05x00x00x00x00x00x00x00 x06x00x00x00x00x00x00x00xa5shapex92x02x03xa5dtype xa5int64xa5klassxa8IntBlockxa8compressxc0x86xa4locsx86 xa3typxa7ndarrayxa5shapex91x01xa4ndimx01xa5dtype xa5int64xa4dataxd7x00x02x00x00x00x00x00x00x00 xa8compressxc0xa6valuesxc7x18x00x00x00xedxafVbxe0 x14x00x00xedxafVbxe0x14x00x00xedxafVbxe0x14 xa5shapex92x01x03xa5dtypexaedatetime64[ns]xa5klass xadDatetimeBlockxa8compressxc0x86xa4locsx86xa3typ xa7ndarrayxa5shapex91x01xa4ndimx01xa5dtypexa5int64 xa4dataxd7x00x03x00x00x00x00x00x00x00xa8compress xc0xa6valuesx87xa3typxa8categoryxa5klassxabCategorical xa4namexc0xa5codesx86xa3typxa7ndarrayxa5shapex91x03 xa4ndimx01xa5dtypexa4int8xa4dataxc7x03x00x02x01x00 xa8compressxc0xaacategoriesx86xa3typxa5indexxa5klass xa5Indexxa4namexc0xa5dtypexa6objectxa4datax93xa4blue xa5greenxa3redxa8compressxc0xa7orderedxc2xa8compress xc0xa5shapex91x03xa5dtypexa8categoryxa5klass xb0CategoricalBlockxa8compressxc0'
Metadata and Stats
Pandas gives a lot of information about data frames. Check out these methods:
>>> df.index Index(['I', 'II', 'III'], dtype='object') >>> df.columns Index(['a', 'b', 'c', 'd'], dtype='object') >>> df.describe() a b count 3.0 3.0 mean 2.0 5.0 std 1.0 1.0 min 1.0 4.0 25% 1.5 4.5 50% 2.0 5.0 75% 2.5 5.5 max 3.0 6.
Selecting Data
Data frames let you select data. If you want to select a row by index, you need to use the loc
attribute. To select columns, you simply use the column name. Here is how to select individual rows, individual columns, a slice of rows, a slice of columns, and last but not least, a rectangular section (subset of rows and subset of columns from these rows):
Single row ---------- >>> df.loc['II'] a 2 b 5 c 2017-09-02 00:00:00 d green Multiple rows using integer index (no 'loc') -------------------------------------------- >>> df[:2] a b c d I 1 4 2017-09-02 red II 2 5 2017-09-02 green Single column ------------- >>> df['b'] I 4 II 5 III 6 Multiple columns ---------------- >>> df.loc[:, 'b':'c'] b c I 4 2017-09-02 II 5 2017-09-02 III 6 2017-09-02 Rectangular section ------------------- >>> df.loc[:'II', 'b':'c'] b c I 4 2017-09-02 II 5 2017-09-02 Using integer index (when actual index is not integer) ------------------------------------------------------ >>> df.iloc[:2, 1:3] b c I 4 2017-09-02 II 5 2017-09-02
In addition to those direct addressing data selections, you can also select based on values. For example, you can select only rows with even values in column b:
>>> df[df.b % 2 == 0] a b c d I 1 4 2017-09-02 red III 3 6 2017-09-02 blue
Sorting Data
Pandas gives you sorting too. Let’s sort the following data frame by index (rows) and by column. Multiple-level indexing is supported too:
index=['one', 'two', 'three', 'four', 'five'] df = pd.DataFrame(np.random.randn(5,2), index=index, columns=['a','b']) Sort by index (alphabetically and descending) --------------------------------------------- >>> df.sort_index(ascending=False) a b two -0.689523 1.411403 three 0.332707 0.307561 one -0.042172 0.374922 four 0.426519 -0.425181 five -0.161095 -0.849932 Sort by column -------------- >>> df.sort_values(by='a') a b two -0.689523 1.411403 five -0.161095 -0.849932 one -0.042172 0.374922 three 0.332707 0.307561 four 0.426519 -0.425181
Conclusion
In this part of the tutorial, we covered the basic data types of Pandas: the series and the data frame. We imported and exported data, selected subsets of data, worked with metadata, and sorted the data. In part two, we’ll continue our journey and deal with missing data, data manipulation, data merging, data grouping, time series, and plotting. Stay tuned.
In the meantime, don’t hesitate to see what we have available for sale and for study in the marketplace, and don’t hesitate to ask any questions and provide your valuable feedback using the feed below.
Powered by WPeMatico