You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 31, 2023. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+208-16
Original file line number
Diff line number
Diff line change
@@ -1,22 +1,43 @@
1
1
# TFRecorder
2
2
3
-
TFRecorder makes it easy to create TFRecords from images and labels in
4
-
Pandas DataFrames or CSV files.
5
-
Today, TFRecorder supports data stored in 'image csv format' similar to
6
-
GCP AutoML Vision.
7
-
In the future TFRecorder will support converting any Pandas DataFrame or CSV
8
-
file into TFRecords.
3
+
TFRecorder makes it easy to create [TFRecords](https://www.tensorflow.org/tutorials/load_data/tfrecord) from [Pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) or CSV Files. TFRecord reads data, transforms it using [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started), stores it in the TFRecord format using [Apache Beam](https://beam.apache.org/) and optionally [Google Cloud Dataflow](https://cloud.google.com/dataflow). Most importantly, TFRecorder does this without requiring the user to write an Apache Beam pipeline or TensorFlow Transform code.
4
+
5
+
TFRecorder can convert any Pandas DataFrame or CSV file into TFRecords. If your data includes images TFRecorder can also serialize those into TFRecords. By default, TFRecorder expects your DataFrame or CSV file to be in the same ['Image CSV'](https://cloud.google.com/vision/automl/docs/prepare) format that Google Cloud Platform's AutoML Vision product uses, however you can also specify an input data schema using TFRecorder's flexible schema system.
Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem). The TFRecorder project started inside [Google Cloud AI Services](https://cloud.google.com/consulting) when we realized we were writing TFRecord conversion code over and over again.
13
+
14
+
When to use TFRecords:
15
+
* Your model is input bound (reading data is impacting training time).
16
+
* Anytime you want to use tf.Dataset
17
+
* When your dataset can't fit into memory
18
+
19
+
12
20
## Installation
13
21
14
-
From the top directory of the repo, run the following command:
Specify the project, region, and path to the tfrecorder wheel for remote execution.
53
74
75
+
*Cloud Dataflow Requirements*
76
+
* The output_dir must be a Google Cloud Storage location.
77
+
* The image files specified in an image_uri column must be located in Google Cloud Storage.
78
+
* If being run from your local machine, the user must be [authenticated to use Google Cloud.](https://cloud.google.com/docs/authentication/getting-started)
79
+
54
80
```python
55
81
import pandas as pd
56
82
import tfrecorder
57
83
58
84
df = pd.read_csv(...)
59
85
df.tensorflow.to_tfr(
60
86
output_dir='gs://my/bucket',
61
-
runner='DataFlowRunner',
87
+
runner='DataflowRunner',
62
88
project='my-project',
63
89
region='us-central1'
64
90
tfrecorder_wheel='/path/to/my/tfrecorder.whl')
@@ -72,7 +98,7 @@ Using Python interpreter:
72
98
import tfrecorder
73
99
74
100
tfrecorder.create_tfrecords(
75
-
input_data='/path/to/data.csv',
101
+
source='/path/to/data.csv',
76
102
output_dir='gs://my/bucket')
77
103
```
78
104
@@ -83,6 +109,42 @@ tfrecorder create-tfrecords \
83
109
--output_dir=gs://my/bucket
84
110
```
85
111
112
+
#### From an image directory
113
+
114
+
```python
115
+
import tfrecorder
116
+
117
+
tfrecorder.create_tfrecords(
118
+
source='/path/to/image_dir',
119
+
output_dir='gs://my/bucket'
120
+
)
121
+
```
122
+
123
+
The image directory should have the following general structure:
124
+
125
+
```
126
+
image_dir/
127
+
<datasetsplit>/
128
+
<label>/
129
+
<imagefile>
130
+
```
131
+
132
+
Example:
133
+
```
134
+
images/
135
+
TRAIN/
136
+
cat/
137
+
cat001.jpg
138
+
dog/
139
+
dog001.jpg
140
+
VALIDATION/
141
+
cat/
142
+
cat002.jpg
143
+
dog/
144
+
dog002.jpg
145
+
...
146
+
```
147
+
86
148
### Verifying data in TFRecords generated by TFRecorder
This format looks like a Pandas DataFrame or CSV formatted as:
115
177
116
178
| split | image_uri | label |
@@ -119,9 +181,139 @@ This format looks like a Pandas DataFrame or CSV formatted as:
119
181
120
182
where:
121
183
*`split` can take on the values TRAIN, VALIDATION, and TEST
122
-
*`image_uri` specifies a local or google cloud storage location for the image file.
184
+
*`image_uri` specifies a local or Google Cloud Storage location for the image file.
123
185
*`label` can be either a text based label that will be integerized or integer
124
186
187
+
## Flexible Schema
188
+
189
+
TFRecorder's flexible schema system allows you to use any schema you want for your input data. To support any input data schema, provide a schema map to TFRecorder. A TFRecorder schema_map creates a mapping between your dataframe column names and their types in the resulting
190
+
TFRecord.
191
+
192
+
### Creating and using a schema map
193
+
A schema map is a Python dictionary that maps DataFrame column names to [supported
194
+
TFRecorder types.](#Supported-types)
195
+
196
+
For example, the default image CSV input can be defined like this:
197
+
198
+
```python
199
+
from tfrecorder import schema
200
+
201
+
image_csv_schema = {
202
+
'split': schema.split_key,
203
+
'image_uri': schema.image_uri,
204
+
'label': schema.string_label
205
+
}
206
+
```
207
+
Once created a schema_map can be sent to TFRecorder.
208
+
209
+
```python
210
+
import pandas as pd
211
+
from tfrecorder import schema
212
+
import tfrecorder
213
+
214
+
df = pd.read_csv(...)
215
+
df.tensorflow.to_tfr(
216
+
output_dir='gs://my/bucket',
217
+
schema_map=schema.image_csv_schema,
218
+
runner='DataflowRunner',
219
+
project='my-project',
220
+
region='us-central1')
221
+
```
222
+
223
+
### Supported types
224
+
TFRecorder's schema system supports several types, all listed below. You can use
225
+
these types by referencing them in the schema map. Each type informs TFRecorder how
226
+
to treat your DataFrame columns. For example, the schema mapping
227
+
`my_split_key: schema.SplitKeyType` tells TFRecorder to treat the column `my_split_key` as
228
+
type `schema.SplitKeyType` and create dataset splits based on it's contents.
229
+
230
+
#### schema.ImageUriType
231
+
* Specifies the path to an image. When specified, TFRecorder
232
+
will load the specified image and store the image as a [base64 encoded](https://docs.python.org/3/library/base64.html)
233
+
[tf.string](https://www.tensorflow.org/tutorials/load_data/unicode) in the key 'image'
234
+
along with the height, width, and image channels as integers using they keys 'image_height', 'image_width', and 'image_channels'.
235
+
* A schema can contain only one imageUriType
236
+
237
+
#### schema.SplitKeyType
238
+
* A split key is required for TFRecorder at this time.
239
+
* Only one split key is allowed.
240
+
* Specifies a split key that TFRecorder will use to partition the
241
+
input dataset on.
242
+
* Allowed values are 'TRAIN', 'VALIDATION, and 'TEST'
243
+
244
+
Note: If you do not want your data to be partitioned please include a split_key and
245
+
set all rows to TRAIN.
246
+
247
+
#### schema.IntegerInputType
248
+
* Specifies an int input.
249
+
* Will be scaled to mean 0, variance 1.
250
+
251
+
#### schema.FloatInputType
252
+
* Specifies an float input.
253
+
* Will be scaled to mean 0, variance 1.
254
+
255
+
#### schema.CategoricalInputType
256
+
* Specifies a string input.
257
+
* Vocabulary computed and output integerized.
258
+
259
+
#### schema.IntegerLabelType
260
+
* Specifies an integer target.
261
+
* Not transformed.
262
+
263
+
#### schema.StringLabelType
264
+
* Specifies a string target.
265
+
* Vocabulary computed and *output integerized.*
266
+
267
+
### Flexible Schema Example
268
+
269
+
Imagine that you have a dataset that you would like to convert to TFRecords that
270
+
looks like this:
271
+
272
+
| split | x | y | label |
273
+
|-------|-------|------|-------|
274
+
| TRAIN | 0.32 | 42 |1 |
275
+
276
+
First define a schema map:
277
+
278
+
```python
279
+
schema_map = {
280
+
'split':schema.SplitKeyType,
281
+
'x':schema.FloatInputType,
282
+
'y':schema.IntegerInputType,
283
+
'label':schema.IntegerLabelType
284
+
}
285
+
```
286
+
287
+
Now call TFRecorder with the specified schema_map
288
+
289
+
```python
290
+
import pandas as pd
291
+
import tfrecorder
292
+
293
+
df = pd.read_csv(...)
294
+
df.tensorflow.to_tfr(
295
+
output_dir='gs://my/bucket',
296
+
schema_map=schema_map,
297
+
runner='DataflowRunner',
298
+
project='my-project',
299
+
region='us-central1')
300
+
```
301
+
After calling TFRecorder's to_tfr() function, TFRecorder will create an Apache beam pipeline, either locally or in this case
302
+
using Google Cloud's Dataflow runner. This beam pipeline will use the schema map to identify the types you've associated with
303
+
each data column and process your data using [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started) and TFRecorder's image processing functions to convert the data into into TFRecords.
304
+
125
305
## Contributing
126
306
127
-
Pull requests are welcome.
307
+
Pull requests are welcome. Please see our [code of conduct](docs/code-of-conduct.md) and [contributing guide](docs/contributing.md).
308
+
309
+
## Why TFRecorder?
310
+
Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem).
311
+
312
+
TFRecords help when:
313
+
* Your model is input bound (reading data is impacting training time).
314
+
* Anytime you want to use tf.Dataset
315
+
* When your dataset can't fit into memory
316
+
317
+
318
+
In our work at [Google Cloud AI Services](https://cloud.google.com/consulting) we wanted to help our users spend their time writing AI/ML applications, and spend less time converting data.
0 commit comments