我有一个带有3列的类和5行的学生中的dataframe。其中一些学生是重复的。我想从所有类中列出最常见的学生名称,并以降序列出它们,它们存在的次数以及它们存在的课程。
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})
biology statistics ecology
0 ryan sarah austin
1 sarah ed ryan
2 tom jacob tom
3 ed ryan sam
4 jackson de sarah
我希望输出看起来像这样:
ryan, 3 classes, (biology, statistics, ecology)
sarah, 3 classes, (biology, statistics, ecology)
tom, 2 classes, (biology, ecology)
ed, 2 classes, (biology, statistics)
jackson, 1 class, (biology)
jacob, 1 class, (statistics)
de, 1 class, (statistics)
austin, 1 class, (ecology)
...等等
任何帮助都会受到欣赏,我是一个初学者,所以我已经这样几个小时了。大脑被杀。谢谢!
分析解答
我们可以melt
. the dataframe来长期形成,然后groupby aggregate
.与命名聚合来获取类的数量,以及类的名称,最后我们可以先获得最高频率的学生:
output_df = (
df.melt(var_name='class name', value_name='student name')
.groupby('student name', as_index=False)
.agg(class_count=('class name', 'count'),
classes=('class name', tuple))
.sort_values('class_count', ascending=False, ignore_index=True)
)
output_df
:
student name class_count classes
0 ryan 3 (biology, statistics, ecology)
1 sarah 3 (biology, statistics, ecology)
2 ed 2 (biology, statistics)
3 tom 2 (biology, ecology)
4 austin 1 (ecology,)
5 de 1 (statistics,)
6 jackson 1 (biology,)
7 jacob 1 (statistics,)
8 sam 1 (ecology,)
我们可以在条件上将classes/class添加到class_count
并写入to_csv
.:
# Conditionally Add Classes/Class
output_df['class_count'] = output_df['class_count'].astype(str) + np.where(
output_df['class_count'].eq(1),
' class',
' classes'
)
# Write to CSV
output_df.to_csv('output.csv', index=False, header=None)
output.csv
:
ryan,3 classes,"('biology', 'statistics', 'ecology')"
sarah,3 classes,"('biology', 'statistics', 'ecology')"
ed,2 classes,"('biology', 'statistics')"
tom,2 classes,"('biology', 'ecology')"
austin,1 class,"('ecology',)"
de,1 class,"('statistics',)"
jackson,1 class,"('biology',)"
jacob,1 class,"('statistics',)"
sam,1 class,"('ecology',)"
设置和import:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'biology': ['ryan', 'sarah', 'tom', 'ed', 'jackson'],
'statistics': ['sarah', 'ed', 'jacob', 'ryan', 'de'],
'ecology': ['austin', 'ryan', 'tom', 'sam', 'sarah']
})