我的输入是一个字符串:
text = '''10 February 2023
abc
def
23 March 2023
ghi
jkl'''
我制作了下面的代码(使用正则态度,但我对任何其他替代方案都开放):
data = []
for m in re.finditer(r'(\d+ \w+ \w+)\n(.*)', text, flags=re.I|re.S):
data.append([m.group(1), m.group(2).splitlines()])
df = pd.DataFrame(data, columns=['date', 'letters']).explode('letters')
我的代码给我这个奇怪的结果:
date letters
0 10 February 2023 abc
0 10 February 2023 def
0 10 February 2023 23 March 2023
0 10 February 2023 ghi
0 10 February 2023 jkl
当我期待这个:
date letters
0 10 February 2023 abc
0 10 February 2023 def
1 23 March 2023 ghi
1 23 March 2023 jkl
如何修复我的代码?另外,您还有其他建议吗?我很想向他们学习。
分析解答
一个没有正则表达的选项,利用pd.to_datetime
识别日期识别的优势:
df = pd.DataFrame({'letters': text.splitlines()})
m = pd.to_datetime(df['letters'], errors='coerce').notna()
out = df.assign(date=df['letters'].where(m).ffill()
).loc[~m, ::-1].reset_index(drop=True)
替代语法:
s = pd.Series(text.splitlines())
m = pd.to_datetime(s, errors='coerce').notna()
df = pd.DataFrame({'date': s.where(m).ffill(), 'letters': s}
)[~m].reset_index(drop=True)
输出:
date letters
0 10 February 2023 abc
1 10 February 2023 def
2 23 March 2023 ghi
3 23 March 2023 jkl