我的输入是一个字符串:

text = '''10 February 2023
abc
def
23 March 2023
ghi
jkl'''

我制作了下面的代码(使用正则态度,但我对任何其他替代方案都开放):

data = []

for m in re.finditer(r'(\d+ \w+ \w+)\n(.*)', text, flags=re.I|re.S):
    data.append([m.group(1), m.group(2).splitlines()])
    
df = pd.DataFrame(data, columns=['date', 'letters']).explode('letters')

我的代码给我这个奇怪的结果:

               date        letters
0  10 February 2023            abc
0  10 February 2023            def
0  10 February 2023  23 March 2023
0  10 February 2023            ghi
0  10 February 2023            jkl

当我期待这个:

               date        letters
0  10 February 2023            abc
0  10 February 2023            def
1     23 March 2023            ghi
1     23 March 2023            jkl

如何修复我的代码?另外,您还有其他建议吗?我很想向他们学习。

分析解答

一个没有正则表达的选项,利用pd.to_datetime识别日期识别的优势:

df = pd.DataFrame({'letters': text.splitlines()})
m = pd.to_datetime(df['letters'], errors='coerce').notna()

out = df.assign(date=df['letters'].where(m).ffill()
                ).loc[~m, ::-1].reset_index(drop=True)

替代语法:

s = pd.Series(text.splitlines())
m = pd.to_datetime(s, errors='coerce').notna()

df = pd.DataFrame({'date': s.where(m).ffill(), 'letters': s}
                  )[~m].reset_index(drop=True)

输出:

               date letters
0  10 February 2023     abc
1  10 February 2023     def
2     23 March 2023     ghi
3     23 March 2023     jkl