Skip to content
This repository was archived by the owner on Jul 31, 2023. It is now read-only.

Commit 8bd63b7

Browse files
cfezequieldependabot[bot]mbernico
authored
Release/1.0 (#33)
* Bump tensorflow from 2.3.0 to 2.3.1 Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.3.0 to 2.3.1. - [Release notes](https://github.com/tensorflow/tensorflow/releases) - [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md) - [Commits](tensorflow/tensorflow@v2.3.0...v2.3.1) Signed-off-by: dependabot[bot] <support@github.com> * Flexible schema (#20) * Beginning of flexible schema support. * Fixed linting issue. * added support for flexible schema in start of pipeline. * Daily code checkin. * Flexible Schema functional. * Updates for newer pylint. * Updated namedTuple for python 3.6 support. * Updated namedTuple for python 3.6 support. * Updated documentation. * Addressed the comments made my Kim and Carlos. * Addressed linting issues. * daily checkin, resoling code review issues. * Addressed code review issues. * Addressed all comments in code review. * Addressed code review comments. * Fixed linting error in schema.py. * Addressed nits. * Add image directory-to-dataframe parser. * Bump version to 1.0 for release. * Update TF to 2.3.1 in requirements.txt Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Mike Bernico <mikebernico@google.com>
1 parent 6fa13e4 commit 8bd63b7

18 files changed

+778
-230
lines changed

README.md

+208-16
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,43 @@
11
# TFRecorder
22

3-
TFRecorder makes it easy to create TFRecords from images and labels in
4-
Pandas DataFrames or CSV files.
5-
Today, TFRecorder supports data stored in 'image csv format' similar to
6-
GCP AutoML Vision.
7-
In the future TFRecorder will support converting any Pandas DataFrame or CSV
8-
file into TFRecords.
3+
TFRecorder makes it easy to create [TFRecords](https://www.tensorflow.org/tutorials/load_data/tfrecord) from [Pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) or CSV Files. TFRecord reads data, transforms it using [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started), stores it in the TFRecord format using [Apache Beam](https://beam.apache.org/) and optionally [Google Cloud Dataflow](https://cloud.google.com/dataflow). Most importantly, TFRecorder does this without requiring the user to write an Apache Beam pipeline or TensorFlow Transform code.
4+
5+
TFRecorder can convert any Pandas DataFrame or CSV file into TFRecords. If your data includes images TFRecorder can also serialize those into TFRecords. By default, TFRecorder expects your DataFrame or CSV file to be in the same ['Image CSV'](https://cloud.google.com/vision/automl/docs/prepare) format that Google Cloud Platform's AutoML Vision product uses, however you can also specify an input data schema using TFRecorder's flexible schema system.
96

107
!['TFRecorder CI/CD Badge'](https://github.com/google/tensorflow-recorder/workflows/TFRecord%20CICD/badge.svg)
118

9+
[Release Notes](RELEASE.md)
10+
11+
## Why TFRecorder?
12+
Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem). The TFRecorder project started inside [Google Cloud AI Services](https://cloud.google.com/consulting) when we realized we were writing TFRecord conversion code over and over again.
13+
14+
When to use TFRecords:
15+
* Your model is input bound (reading data is impacting training time).
16+
* Anytime you want to use tf.Dataset
17+
* When your dataset can't fit into memory
18+
19+
1220
## Installation
1321

14-
From the top directory of the repo, run the following command:
22+
### Install from Github
23+
24+
1. Clone this repo.
1525

1626
```bash
17-
pip install tfrecorder
27+
git clone https://github.com/google/tensorflow-recorder.git
28+
```
29+
30+
2. From the top directory of the repo, run the following command:
31+
32+
```bash
33+
python setup.py install
1834
```
1935

36+
### Install from PyPi
37+
```bash
38+
pip install tfrecorder
39+
````
40+
2041
## Example usage
2142

2243
### Generating TFRecords
@@ -30,7 +51,7 @@ import pandas as pd
3051
import tfrecorder
3152
3253
df = pd.read_csv(...)
33-
df.tensorflow.to_tfr(output_dir='gs://my/bucket')
54+
df.tensorflow.to_tfr(output_dir='/my/output/path')
3455
```
3556

3657
##### Running on Cloud Dataflow
@@ -51,14 +72,19 @@ To build from source/git:
5172
Step 2:
5273
Specify the project, region, and path to the tfrecorder wheel for remote execution.
5374

75+
*Cloud Dataflow Requirements*
76+
* The output_dir must be a Google Cloud Storage location.
77+
* The image files specified in an image_uri column must be located in Google Cloud Storage.
78+
* If being run from your local machine, the user must be [authenticated to use Google Cloud.](https://cloud.google.com/docs/authentication/getting-started)
79+
5480
```python
5581
import pandas as pd
5682
import tfrecorder
5783
5884
df = pd.read_csv(...)
5985
df.tensorflow.to_tfr(
6086
output_dir='gs://my/bucket',
61-
runner='DataFlowRunner',
87+
runner='DataflowRunner',
6288
project='my-project',
6389
region='us-central1'
6490
tfrecorder_wheel='/path/to/my/tfrecorder.whl')
@@ -72,7 +98,7 @@ Using Python interpreter:
7298
import tfrecorder
7399
74100
tfrecorder.create_tfrecords(
75-
input_data='/path/to/data.csv',
101+
source='/path/to/data.csv',
76102
output_dir='gs://my/bucket')
77103
```
78104

@@ -83,6 +109,42 @@ tfrecorder create-tfrecords \
83109
--output_dir=gs://my/bucket
84110
```
85111

112+
#### From an image directory
113+
114+
```python
115+
import tfrecorder
116+
117+
tfrecorder.create_tfrecords(
118+
source='/path/to/image_dir',
119+
output_dir='gs://my/bucket'
120+
)
121+
```
122+
123+
The image directory should have the following general structure:
124+
125+
```
126+
image_dir/
127+
<dataset split>/
128+
<label>/
129+
<image file>
130+
```
131+
132+
Example:
133+
```
134+
images/
135+
TRAIN/
136+
cat/
137+
cat001.jpg
138+
dog/
139+
dog001.jpg
140+
VALIDATION/
141+
cat/
142+
cat002.jpg
143+
dog/
144+
dog002.jpg
145+
...
146+
```
147+
86148
### Verifying data in TFRecords generated by TFRecorder
87149
88150
Using Python interpreter:
@@ -107,10 +169,10 @@ tfrecorder check-tfrecords \
107169
--output_dir=/tmp/output
108170
```
109171

110-
## Input format
172+
## Default Schema
111173

112-
TFRecorder currently expects data to be in the same format as
113-
[AutoML Vision](https://cloud.google.com/vision/automl/docs/prepare).
174+
If you don't specify an input schema, TFRecorder expects data to be in the same format as
175+
[AutoML Vision input](https://cloud.google.com/vision/automl/docs/prepare).
114176
This format looks like a Pandas DataFrame or CSV formatted as:
115177

116178
| split | image_uri | label |
@@ -119,9 +181,139 @@ This format looks like a Pandas DataFrame or CSV formatted as:
119181

120182
where:
121183
* `split` can take on the values TRAIN, VALIDATION, and TEST
122-
* `image_uri` specifies a local or google cloud storage location for the image file.
184+
* `image_uri` specifies a local or Google Cloud Storage location for the image file.
123185
* `label` can be either a text based label that will be integerized or integer
124186

187+
## Flexible Schema
188+
189+
TFRecorder's flexible schema system allows you to use any schema you want for your input data. To support any input data schema, provide a schema map to TFRecorder. A TFRecorder schema_map creates a mapping between your dataframe column names and their types in the resulting
190+
TFRecord.
191+
192+
### Creating and using a schema map
193+
A schema map is a Python dictionary that maps DataFrame column names to [supported
194+
TFRecorder types.](#Supported-types)
195+
196+
For example, the default image CSV input can be defined like this:
197+
198+
```python
199+
from tfrecorder import schema
200+
201+
image_csv_schema = {
202+
'split': schema.split_key,
203+
'image_uri': schema.image_uri,
204+
'label': schema.string_label
205+
}
206+
```
207+
Once created a schema_map can be sent to TFRecorder.
208+
209+
```python
210+
import pandas as pd
211+
from tfrecorder import schema
212+
import tfrecorder
213+
214+
df = pd.read_csv(...)
215+
df.tensorflow.to_tfr(
216+
output_dir='gs://my/bucket',
217+
schema_map=schema.image_csv_schema,
218+
runner='DataflowRunner',
219+
project='my-project',
220+
region='us-central1')
221+
```
222+
223+
### Supported types
224+
TFRecorder's schema system supports several types, all listed below. You can use
225+
these types by referencing them in the schema map. Each type informs TFRecorder how
226+
to treat your DataFrame columns. For example, the schema mapping
227+
`my_split_key: schema.SplitKeyType` tells TFRecorder to treat the column `my_split_key` as
228+
type `schema.SplitKeyType` and create dataset splits based on it's contents.
229+
230+
#### schema.ImageUriType
231+
* Specifies the path to an image. When specified, TFRecorder
232+
will load the specified image and store the image as a [base64 encoded](https://docs.python.org/3/library/base64.html)
233+
[tf.string](https://www.tensorflow.org/tutorials/load_data/unicode) in the key 'image'
234+
along with the height, width, and image channels as integers using they keys 'image_height', 'image_width', and 'image_channels'.
235+
* A schema can contain only one imageUriType
236+
237+
#### schema.SplitKeyType
238+
* A split key is required for TFRecorder at this time.
239+
* Only one split key is allowed.
240+
* Specifies a split key that TFRecorder will use to partition the
241+
input dataset on.
242+
* Allowed values are 'TRAIN', 'VALIDATION, and 'TEST'
243+
244+
Note: If you do not want your data to be partitioned please include a split_key and
245+
set all rows to TRAIN.
246+
247+
#### schema.IntegerInputType
248+
* Specifies an int input.
249+
* Will be scaled to mean 0, variance 1.
250+
251+
#### schema.FloatInputType
252+
* Specifies an float input.
253+
* Will be scaled to mean 0, variance 1.
254+
255+
#### schema.CategoricalInputType
256+
* Specifies a string input.
257+
* Vocabulary computed and output integerized.
258+
259+
#### schema.IntegerLabelType
260+
* Specifies an integer target.
261+
* Not transformed.
262+
263+
#### schema.StringLabelType
264+
* Specifies a string target.
265+
* Vocabulary computed and *output integerized.*
266+
267+
### Flexible Schema Example
268+
269+
Imagine that you have a dataset that you would like to convert to TFRecords that
270+
looks like this:
271+
272+
| split | x | y | label |
273+
|-------|-------|------|-------|
274+
| TRAIN | 0.32 | 42 |1 |
275+
276+
First define a schema map:
277+
278+
```python
279+
schema_map = {
280+
'split':schema.SplitKeyType,
281+
'x':schema.FloatInputType,
282+
'y':schema.IntegerInputType,
283+
'label':schema.IntegerLabelType
284+
}
285+
```
286+
287+
Now call TFRecorder with the specified schema_map
288+
289+
```python
290+
import pandas as pd
291+
import tfrecorder
292+
293+
df = pd.read_csv(...)
294+
df.tensorflow.to_tfr(
295+
output_dir='gs://my/bucket',
296+
schema_map=schema_map,
297+
runner='DataflowRunner',
298+
project='my-project',
299+
region='us-central1')
300+
```
301+
After calling TFRecorder's to_tfr() function, TFRecorder will create an Apache beam pipeline, either locally or in this case
302+
using Google Cloud's Dataflow runner. This beam pipeline will use the schema map to identify the types you've associated with
303+
each data column and process your data using [TensorFlow Transform](https://www.tensorflow.org/tfx/transform/get_started) and TFRecorder's image processing functions to convert the data into into TFRecords.
304+
125305
## Contributing
126306

127-
Pull requests are welcome.
307+
Pull requests are welcome. Please see our [code of conduct](docs/code-of-conduct.md) and [contributing guide](docs/contributing.md).
308+
309+
## Why TFRecorder?
310+
Using the TFRecord storage format is important for optimal machine learning pipelines and getting the most from your hardware (in cloud or on prem).
311+
312+
TFRecords help when:
313+
* Your model is input bound (reading data is impacting training time).
314+
* Anytime you want to use tf.Dataset
315+
* When your dataset can't fit into memory
316+
317+
318+
In our work at [Google Cloud AI Services](https://cloud.google.com/consulting) we wanted to help our users spend their time writing AI/ML applications, and spend less time converting data.
319+

RELEASE.md

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Release 0.2
2+
3+
## Breaking Changes
4+
* None known
5+
6+
## Known Caveats
7+
* Python 3.8 is not compatible due to upstream requirements.
8+
9+
## Major Features and Improvements
10+
* Added flexible schemas - TFRecorder now works with any input schema, but the 'Image CSV' schema is still default.
11+
12+
## Bug Fixes and Other Changes

requirements.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,6 @@ numpy < 1.19.0
99
pylint >= 2.5.3
1010
fire >= 0.3.1
1111
jupyter >= 1.0.0
12-
tensorflow >= 2.3.0
13-
pyarrow >= 0.17
12+
tensorflow >= 2.3.1
13+
pyarrow <0.18,>=0.17
1414
frozendict >= 1.2

setup.py

+5-2
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,9 @@
2020
from setuptools import setup
2121

2222

23+
# Semantic versioning (PEP 440)
24+
VERSION = '1.0'
25+
2326
REQUIRED_PACKAGES = [
2427
"apache-beam[gcp] >= 2.22.0",
2528
"avro >= 1.10.0",
@@ -35,14 +38,14 @@
3538
"pylint >= 2.5.3",
3639
"pytz >= 2020.1",
3740
"python-dateutil",
38-
"tensorflow == 2.3.0",
41+
"tensorflow == 2.3.1",
3942
"tensorflow_transform >= 0.22",
4043
]
4144

4245

4346
setup(
4447
name='tfrecorder',
45-
version='0.1.2',
48+
version=VERSION,
4649
install_requires=REQUIRED_PACKAGES,
4750
packages=find_packages(),
4851
include_package_data=True,

tfrecorder/accessor.py

+10-5
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727

2828
from tfrecorder import client
2929
from tfrecorder import constants
30+
from tfrecorder import schema
3031

3132

3233
@pd.api.extensions.register_dataframe_accessor('tensorflow')
@@ -40,6 +41,7 @@ def __init__(self, pandas_obj):
4041
def to_tfr(
4142
self,
4243
output_dir: str,
44+
schema_map: Dict[str, schema.SchemaMap] = schema.image_csv_schema,
4345
runner: str = 'DirectRunner',
4446
project: Optional[str] = None,
4547
region: Optional[str] = None,
@@ -63,14 +65,16 @@ def to_tfr(
6365
num_shards=10)
6466
6567
Args:
68+
schema_map: A dict mapping column names to supported types.
6669
output_dir: Local directory or GCS Location to save TFRecords to.
67-
runner: Beam runner. Can be DirectRunner or DataFlowRunner.
68-
project: GCP project name (Required if DataFlowRunner).
69-
region: GCP region name (Required if DataFlowRunner).
70-
tfrecorder_wheel: Path to the tfrecorder wheel DataFlow will run.
70+
Note: GCS required for DataflowRunner
71+
runner: Beam runner. Can be DirectRunner or DataflowRunner.
72+
project: GCP project name (Required if DataflowRunner).
73+
region: GCP region name (Required if DataflowRunner).
74+
tfrecorder_wheel: Path to the tfrecorder wheel Dataflow will run.
7175
(create with 'python setup.py sdist' or
7276
'pip download tfrecorder --no-deps')
73-
dataflow_options: Optional dictionary containing DataFlow options.
77+
dataflow_options: Optional dictionary containing Dataflow options.
7478
job_label: User supplied description for the beam job name.
7579
compression: Can be 'gzip' or None for no compression.
7680
num_shards: Number of shards to divide the TFRecords into. Default is
@@ -85,6 +89,7 @@ def to_tfr(
8589
r = client.create_tfrecords(
8690
self._df,
8791
output_dir=output_dir,
92+
schema_map=schema_map,
8893
runner=runner,
8994
project=project,
9095
region=region,

tfrecorder/beam_image.py

+5-1
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,11 @@ def decode(b64_bytes, width, height, channels) -> Image:
6060

6161

6262
def load(image_uri):
63-
"""Loads an image."""
63+
"""Loads an image using Pillow.
64+
65+
Supported formats:
66+
https://pillow.readthedocs.io/en/5.1.x/handbook/image-file-formats.html
67+
"""
6468

6569
try:
6670
with tf.io.gfile.GFile(image_uri, 'rb') as f:

0 commit comments

Comments
 (0)