Separate number from unit in a string in Python

https://stackoverflow.com/questions/2240303

19-09-2019
|

Question

I have strings containing numbers with their units, e.g. 2GB, 17ft, etc. I would like to separate the number from the unit and create 2 different strings. Sometimes, there is a whitespace between them (e.g. 2 GB) and it's easy to do it using split(' ').

When they are together (e.g. 2GB), I would test every character until I find a letter, instead of a number.

s='17GB'
number=''
unit=''
for c in s:
    if c.isdigit():
        number+=c
    else:
        unit+=c

Is there a better way to do it?

Thanks

Solution

s='17GB'
for i,c in enumerate(s):
    if not c.isdigit():
        break
number=int(s[:i])
unit=s[i:]

OTHER TIPS

You can break out of the loop when you find the first non-digit character

for i,c in enumerate(s):
    if not c.isdigit():
        break
number = s[:i]
unit = s[i:].lstrip()

If you have negative and decimals:

numeric = '0123456789-.'
for i,c in enumerate(s):
    if c not in numeric:
        break
number = s[:i]
unit = s[i:].lstrip()

You could use a regular expression to divide the string into groups:

>>> import re
>>> p = re.compile('(\d+)\s*(\w+)')
>>> p.match('2GB').groups()
('2', 'GB')
>>> p.match('17 ft').groups()
('17', 'ft')

tokenize can help:

>>> import StringIO
>>> s = StringIO.StringIO('27GB')
>>> for token in tokenize.generate_tokens(s.readline):
...   print token
... 
(2, '27', (1, 0), (1, 2), '27GB')
(1, 'GB', (1, 2), (1, 4), '27GB')
(0, '', (2, 0), (2, 0), '')

You should use regular expressions, grouping together what you want to find out:

import re
s = "17GB"
match = re.match(r"^([1-9][0-9]*)\s*(GB|MB|KB|B)$", s)
if match:
  print "Number: %d, unit: %s" % (int(match.group(1)), match.group(2))

Change the regex according to what you want to parse. If you're unfamiliar with regular expressions, here's a great tutorial site.

>>> s="17GB"
>>> ind=map(str.isalpha,s).index(True)
>>> num,suffix=s[:ind],s[ind:]
>>> print num+":"+suffix
17:GB

This uses an approach which should be a bit more forgiving than regexes. Note: this is not as performant as the other solutions posted.

def split_units(value):
    """
    >>> split_units("2GB")
    (2.0, 'GB')
    >>> split_units("17 ft")
    (17.0, 'ft')
    >>> split_units("   3.4e-27 frobnitzem ")
    (3.4e-27, 'frobnitzem')
    >>> split_units("9001")
    (9001.0, '')
    >>> split_units("spam sandwhiches")
    (0, 'spam sandwhiches')
    >>> split_units("")
    (0, '')
    """
    units = ""
    number = 0
    while value:
        try:
            number = float(value)
            break
        except ValueError:
            units = value[-1:] + units
            value = value[:-1]
    return number, units.strip()

How about using a regular expression

http://python.org/doc/1.6/lib/module-regsub.html

For this task, I would definitely use a regular expression:

import re
there = re.compile(r'\s*(\d+)\s*(\S+)')
thematch = there.match(s)
if thematch:
  number, unit = thematch.groups()
else:
  raise ValueError('String %r not in the expected format' % s)

In the RE pattern, \s means "whitespace", \d means "digit", \S means non-whitespace; * means "0 or more of the preceding", + means "1 or more of the preceding, and the parentheses enclose "capturing groups" which are then returned by the groups() call on the match-object. (thematch is None if the given string doesn't correspond to the pattern: optional whitespace, then one or more digits, then optional whitespace, then one or more non-whitespace characters).

A regular expression.

import re

m = re.match(r'\s*(?P<n>[-+]?[.0-9])\s*(?P<u>.*)', s)
if m is None:
  raise ValueError("not a number with units")
number = m.group("n")
unit = m.group("u")

This will give you a number (integer or fixed point; too hard to disambiguate scientific notation's "e" from a unit prefix) with an optional sign, followed by the units, with optional whitespace.

You can use re.compile() if you're going to be doing a lot of matches.

SCIENTIFIC NOTATION This regex is working well for me to parse numbers that may be in scientific notation, and is based on the recent python documentation about scanf: https://docs.python.org/3/library/re.html#simulating-scanf

units_pattern = re.compile("([-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?|\s*[a-zA-Z]+\s*$)")
number_with_units = list(match.group(0) for match in units_pattern.finditer("+2.0e-1 mm"))
print(number_with_units)
>>>['+2.0e-1', ' mm']

n, u = number_with_units
print(float(n), u.strip())
>>>0.2 mm

try the regex pattern below. the first group (the scanf() tokens for a number any which way) is lifted directly from the python docs for the re module.

import re
SCANF_MEASUREMENT = re.compile(
    r'''(                      # group match like scanf() token %e, %E, %f, %g
    [-+]?                      # +/- or nothing for positive
    (\d+(\.\d*)?|\.\d+)        # match numbers: 1, 1., 1.1, .1
    ([eE][-+]?\d+)?            # scientific notation: e(+/-)2 (*10^2)
    )
    (\s*)                      # separator: white space or nothing
    (                          # unit of measure: like GB. also works for no units
    \S*)''',    re.VERBOSE)
'''
:var SCANF_MEASUREMENT:
    regular expression object that will match a measurement

    **measurement** is the value of a quantity of something. most complicated example::

        -666.6e-100 units
'''

def parse_measurement(value_sep_units):
    measurement = re.match(SCANF_MEASUREMENT, value_sep_units)
    try:
        value = float(measurement[0])
    except ValueError:
        print 'doesn't start with a number', value_sep_units
    units = measurement[5]

    return value, units

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow