-
Notifications
You must be signed in to change notification settings - Fork 22
/
Copy pathfake-data-frame.html
186 lines (152 loc) · 7.01 KB
/
fake-data-frame.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<head>
<title>leontrolski - FakeDataFrame</title>
<style>
body {margin: 5% auto; background: #fff7f7; color: #444444; font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif; font-size: 16px; line-height: 1.8; max-width: 63%;}
@media screen and (max-width: 800px) {body {font-size: 14px; line-height: 1.4; max-width: 90%;}}
pre {width: 100%; border-top: 3px solid gray; border-bottom: 3px solid gray;}
a {border-bottom: 1px solid #444444; color: #444444; text-decoration: none;text-shadow: 0 1px 0 #ffffff;}
a:hover {border-bottom: 0;}
</style>
<link href="https://unpkg.com/[email protected]/themes/prism-vs.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.20.0/components/prism-core.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.20.0/plugins/autoloader/prism-autoloader.min.js"></script>
</head>
<body class="language-python">
<a href="index.html"><img style="height:2em" src="pic.png"/>⇦</a>
<p><i>2019-11-13</i></p>
<h1>How do pandas DataFrames work? (kinda)</h1>
<p>
When you're used to plain ol' <code>dict</code>s, <code>int</code>s, <code>list</code>s etc, <code>pandas.DataFrame</code>s exhibit some weirdo behaviour, particulary concerning assignment and operators. This page is a short walk-through of how some of these things happen (and a quick intro to Python's magic methods), you can see the outcome <a href="https://github.com/leontrolski/fake-data-frame">here</a>.
</p>
<p>
<em>Disclaimer:</em> the things presented here are not <em>entirely</em> as the <code>pandas</code> <code>DataFrame</code>s work, they are more intended as a guide to how they do.
</p>
<p>
The below examples use python type hints to help keep things a bit clearer, at the <a href="https://github.com/leontrolski/fake-data-frame/blob/master/fake_data_frame.py">top</a>, we have:
</p>
<pre><code>from typing import Any, Dict, List</code></pre>
<p>
<code>DataFrame</code>s are a collection of <code>Series</code> (AKA columns), let's start with a really dumb <code>FakeSeries</code>.
</p>
<pre><code>class FakeSeries:
def __init__(self, name: str, data: Dict[int, Any]):
self.name = name
self.data = data
def __repr__(self) -> str:
return f'<FakeSeries: {self.name} {self.data}>'</code></pre>
<pre><code>>>> my_series = FakeSeries("some_column_name", {0: 5, 1: 7, 2: 9})
<FakeSeries: some_column_name {0: 5, 1: 7, 2: 9}></code></pre>
<ul>
<li>Note how the <code>__repr__</code> method is used by <code>print()</code></li>
<li>There a list of all the magic methods you can override on a class <a href="https://docs.python.org/3/reference/datamodel.html#basic-customization">here</a></li>
<li>Note how we are storing the series as a map of indices (0, 1, 2) to values (5, 7, 9)</li>
</ul>
<p>
Now we will define our <code>FakeDataFrame</code>, it similarly has a useful <code>__init__</code> and <code>__repr__</code> (although this is only fully fleshed out in the <a href="https://github.com/leontrolski/fake-data-frame/blob/master/fake_data_frame.py">original</a>). On initialisation, it sets <code>self.series_map</code> which is a map of series names to series.
</p>
<pre><code>class FakeDataFrame:
def __init__(self, d: Dict[str, List[Any]]):
self.series_map = {
k: FakeSeries(k, {i: v for i, v in enumerate(l)})
for k, l in d.items()
}
self.length = len(list(d.values())[0])
def __repr__(self):
width = 5
...
return '\n'.join((headers, divider) + rows) + '\n'</code></pre>
<p>
Already, we can see the beginnings of a <code>pandas</code>-like <code>DataFrame</code> interface.
</p>
<pre><code>>>> df = FakeDataFrame({
'a': [4, 5, 6],
'b': [7, 8, 9],
})
a | b
-------------
4 | 7
5 | 8
6 | 9</code></pre>
<p>
Now the clever stuff begins, lets add two methods to <code>FakeDataFrame</code> so that we can retreive and set its <code>Series</code>.
</p>
<pre><code> # handle []
def __getitem__(self, key: str) -> FakeSeries:
return self.series_map[key]
# handle [] =
def __setitem__(self, key: str, value: FakeSeries) -> None:
if key not in self.series_map:
self.series_map[key] = FakeSeries(key, {})
for i, v in value.data.items():
self[key].data[i] = v</code></pre>
<p>
Let's retreive a series.
</p>
<pre><code>>>> df['b']
<FakeSeries: b {0: 7, 1: 8, 2: 9}></code></pre>
<p>
And let's set one.
</p>
<pre><code>>>> df['b'] = FakeSeries("not_b", {1: 'foo', 2: 'bar'})
>>> df
a | b
-------------
4 | 7
5 | foo
6 | bar</code></pre>
<p>
Note how that the name of the series didn't need to align with "b", and that we were able to assign to series <code>b</code> at only indices 1 and 2.
</p>
<p>
Now to add some more smarts to our <code>FakeSeries</code>.
</p>
<pre><code> # handle *
def __mul__(self, other: int) -> 'FakeSeries':
return FakeSeries(
self.name,
{i: v * other for i, v in self.data.items()},
)
# handle >;
def __gt__(self, other: int) -> 'FakeSeries':
return FakeSeries(
self.name,
{i: v > other for i, v in self.data.items()},
)
# handle []
def __getitem__(self, key: 'FakeSeries') -> 'FakeSeries':
return FakeSeries(
self.name,
{i: v for i, v in self.data.items() if key.data.get(i, False)},
)</code></pre>
<ul>
<li><code>__mul__</code> takes an integer and returns a new <code>FakeSeries</code> with each of the values multiplied by it</li>
<li><code>__gt__</code> takes an integer and returns a new <code>FakeSeries</code> where each of the values is greater than it</li>
<li><code>__getitem__</code> takes another <code>FakeSeries</code> called <code>key</code> and returns a new <code>FakeSeries</code> with each of the values that had an index value contained in <code>key</code>'s index</li>
</ul>
<p>
We can now do some super <code>pandas</code>-y stuff, let's remind ourselves of the DataFrame we're working with.
</p>
<pre><code> a | b
-------------
4 | 7
5 | 8
6 | 9</code></pre>
<pre><code>>>> df['b'] > 7
<FakeSeries: b {0: False, 1: True, 2: True}></code></pre>
<pre><code>>>> df['a'][df['b'] > 7]
<FakeSeries: a {1: 5, 2: 6}></code></pre>
<p>
And to put it all together.
</p>
<pre><code>>>> df['mult'] = df['a'][df['b'] > 7] * 2
>>> df
a | b | mult
---------------------
4 | 7 | NaN
5 | 8 | 10
6 | 9 | 12</code></pre>
<p>
Pretty cool huh!
</p>
</body>