Domanda

Using Python 3.3 on Win8. I would consider myself a novice at scripting. I am trying to work with dates in an Excel spreadsheet with no leading zeros. The year is always 2 digits at the end, month comes first and then day is in the middle. I can extract the Excel column to place it solely on its own in a file. Below are some examples of what I may run into with thousands of lines to go through and fix the dates into recognizable formats:

1188 (mdyy) 11188 (problem date) 12188 (problem date) 13188 (mddyy) 21188 (mddyy) 111188 (mmddyy)

I guess I have 2 parts to my question: (1) What type of file is easiest to work with when modifying using Python (ex. XLSX, XLS, CSV, TXT, etc) (2) Any tips on coding the below logic with Python...maybe functions to use?

Below is my logic I would like to apply because I know there is not way to actually tell if the dates only have 5 digits and start with an "11" or "12", so I want to place ERROR instead so we can go back to manually fix those. The idea is less manual labor, the better.

  • Always 2 digit for year at the end, so it needs parsed out right away with remaining digits left
    • IF year digits from "00" to "30" THEN attach a leading "20" to form a 4 digit year
    • Else attach a leading "19" to form a 4 digit year
  • Count number of digits left after taking away year digits
    • IF total digits left = 2 THEN parse out first and second digits, AND add leading zero to both digits
    • ElseIF total digits left = 3 THEN
      • IF first two numbers are "11" or "12", print final results as "ERROR"
      • ElseIF first two numbers are "10" THEN parse out as is AND add leading zero to third digit
      • Else parse out first digit AND add leading zero THEN parse out remaining 2 digits as is
    • Else total digits left = 4, THEN do nothing
  • Make sure date is put back together in new format for final result

Thanks so much for any help and a kick start to go off on my own!

MY JOURNEY

Initially I needed help getting my logic into Python, then battled the following but was successful in the end with time, research and helpful people at stackoverflow: reading/writing/appending CSV files, fill leading zeros, fill leading digits for year, syntax, incorrect data types, etc...THANKS TO ALL THAT HELPED!!!

FINALIZED CODE BELOW!!!!!!!

import csv
# Change to location of CSV file
with open('c:\\Users\\Weez\\Desktop\\csv_test.csv', newline='') as csvfile:
    csvreader = csv.reader(csvfile)
    for line in csvreader:
        baddate = line[0]
        year = int(baddate) % 100
        md = int(baddate) // 100
# Check year values
        if year < 10:
            year = str(200)+str(year)
        elif year <= 50:
            year = str(20)+str(year)
        else:
            year = str(19)+str(year)
# Check month and day values
        if md < 100:
            month = md // 10
            month = str(month).zfill(2)
            day = md % 10
            day = str(day).zfill(2)
        elif md >= 1000:
            pass
        elif md <= 109:
            month = md // 10
            day = md % 10
            day = str(day).zfill(2)
        elif md == 110:
            month = md // 100
            month = str(month).zfill(2)
            day = md % 100
        elif md == 120:
            month = md // 100
            month = str(month).zfill(2)
            day = md % 100
        elif md <= 129:
            month = str("XX")
            day = str("XX")
        else:
            month = md // 100
            month = str(month).zfill(2)
            day = md % 100
        dateresult = str(month)+str(day)+str(year)
        print(dateresult)
# modes 'a' = append, 'w' = write, 'r' = read and other modes
        with open('c:\\Users\\Weez\\Desktop\\csv_test_output.csv', 'a') as csvoutput:
            csvoutput.write(dateresult)
            csvoutput.write('\n')
print('\n')
print('\n')
str(input("Process complete!  Press Enter to finish!"))
È stato utile?

Soluzione 2

Since the year is always two digits, you can eliminate that part of the problem right away.

year = date % 100
md = date // 100

Now you can eliminate the 2-digit and 4-digit cases:

if md < 100:
    month = md / 10
    day = md % 10
elif md >= 1000:
    month = md / 100
    day = md % 100

Now you're down to detecting the potential problem areas and resolving the ambiguity.

elif md <= 109:
    month = 10
    day = md % 10
elif md == 110:
    month = 1
    day = 10
elif md <= 129:
    month = None # ambiguous
    day = None
else:
    month = md / 100
    day = md % 100

You'll need to do some additional checking to make sure the month and day are within bounds.

Altri suggerimenti

For #1 you can use csv though I don't have any experience with other modules :(.

For #2, You can use builtin module, datetime

>>> from datetime import datetime

>>> date_unpadded_month = '1188'
>>> date_padded_month = '01188'
>>> date_2_digit_month = '11188'
>>> date_format = '%m%d%y'

>>> parsed = datetime.strptime(date_unpadded_month, date_format)
>>> parsed
>>> datetime.datetime(1988, 1, 1, 0, 0)

>>> parsed = datetime.strptime(date_padded_month, date_format)
>>> parsed
>>> datetime.datetime(1988, 1, 1, 0, 0)

>>> parsed = datetime.strptime(date_2_digit_month, date_format)
>>> parsed
>>> datetime.datetime(1988, 11, 1, 0, 0)
>>> parsed.month
>>> 11

If the dates in the spreadsheet are in order, you might be able to retroactively go back and fix previously ambiguous dates with a high level of success. For instance if you have

123087, 11188, 22288

The first and last dates are non-ambiguous (Dec-30-'87 & Feb-22-'88), and the middle date is either Jan-11-'88 or Nov-1-'88, but can be solved if you know that the three dates are in order.

Edit: here's some code for achieving this:

from datetime import datetime

data = '123087', '1188', '11188', '22288', '11188' # some 4, 5 and 6 digit dates
fmt = '%m%d%y'
results = []
# parse possible dates from data
for date_str in data:
    alt_date_str = ('0' + date_str)[-6:]
    dates = (datetime.strptime(d, fmt) for d in (date_str, alt_date_str))
    results.append(set(dates)) # make sure dates are unique

# iterate through results, removing anything older than the previous entries
oldest = datetime.min
for i in xrange(len(results)):
    results[i] = [d for d in results[i] if d > oldest]
    oldest = min(results[i])

# iterate backwards, removing anything newer than the previous entries
newest = datetime.max
for i in reversed(xrange(len(results))):
    results[i] = [d for d in results[i] if d < newest]
    newest = max(results[i])

# show dates, error if still ambiguous
for dates in results:
    if len(dates) > 1:
        print 'ERROR:', dates
    else:
        print dates[0]
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top