Question

I have a pandas dataframe that has a string column in it. The length of the frame is over 2 million rows and looping to extract the elements I need is a poor choice. My current code looks like the following

for i in range(len(table["series_id"])):
    table["state_code"] = table["series_id"][i][2:4]
    table["area_code"] = table["series_id"][i][5:9]
    table["supersector_code"] = table["series_id"][i][11:12]

where "series_id" is the string containing multiple information fields I want to create an example data element:

columns:

 [series_id, year, month, value, footnotes]

The data:

[['SMS01000000000000001' '2006' 'M01' 1966.5 '']
 ['SMS01000000000000001' '2006' 'M02' 1970.4 '']
 ['SMS01000000000000001' '2006' 'M03' 1976.6 '']

However series_id is column of interest that I am struggling with. I have looked at the str.FUNCTION for python and specifically pandas.

http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern

has a section describing each of the string functions i.e. specifically get & slice are the functions I would like to use. Ideally I could envision a solution like so:

table["state_code"] = table["series_id"].str.get(1:3)

or

table["state_code"] = table["series_id"].str.slice(1:3)

or

table["state_code"] = table["series_id"].str.slice([1:3])

When I have tried the following functions I get an invalid syntax for the ":".

but alas I cannot seem to figure out the proper way to perform the vector operation for taking a substring on a pandas data frame column.

Thank you

Was it helpful?

Solution

I think I would use str.extract with some regex (which you can tweak for your needs):

In [11]: s = pd.Series(["SMU78000009092000001"])

In [12]: s.str.extract('^.{2}(?P<state_code>.{3}).{1}(?P<area_code>\d{4}).{2}(?P<supersector_code>.{2})')
Out[12]: 
  state_code area_code supersector_code
0        U78      0000               92

This reads as: starts (^) with any two characters (which are ignored), the next three (any) characters are state_code, followed by any character (ignored), followed by four digits are area_code, ...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top