Create Table

The Create Table node creates a new table in your BigQuery dataset with flexible schema definition options. You can define schemas manually, automatically from sample data, or create empty tables for later population. This is an AI-powered node that can understand natural language instructions.

When to Use It

Set up new data tables in BigQuery datasets
Create tables with specific schemas for data ingestion
Build partitioned tables for performance optimization
Generate table structures from sample data automatically
Initialize database schemas as part of data pipeline setup
Create empty tables that will be populated by other workflows

Inputs

Field	Type	Required	Description
Project	Select	Yes	Select the Google BigQuery project containing the dataset
Dataset	Select	Yes	Select the dataset to create the table in
Table ID	Text	Yes	Unique identifier for the table within the dataset
Schema Mode	Select	Yes	How to define the table schema (Manual/Auto/Empty)
Schema Definition	JSON	Conditional	Table schema when using Manual mode
Sample Data	JSON	Conditional	Sample data for Auto schema generation
Partition Field	Text	No	Field name for table partitioning (date/timestamp fields)
Skip Error If Already There	Toggle	No	If enabled, won’t fail if table already exists (default: false)

Schema Mode Options

Mode	Description	When to Use
Manual	Define exact schema with fields and types	Precise control over table structure
Auto	Generate schema from sample data	Quick setup with representative data
Empty	Create table without schema	Schema will be added later via inserts

Manual Schema Format

When using Manual mode, provide schema as JSON array:

[
  {
    "name": "id",
    "type": "INTEGER",
    "mode": "REQUIRED"
  },
  {
    "name": "name",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "created_at",
    "type": "TIMESTAMP",
    "mode": "REQUIRED"
  },
  {
    "name": "metadata",
    "type": "JSON",
    "mode": "NULLABLE"
  }
]

Sample Data Format (Auto Mode)

Provide representative data for automatic schema detection:

[
  {
    "id": 1,
    "name": "John Doe",
    "created_at": "2024-10-17T10:30:00Z",
    "metadata": {"role": "admin", "active": true}
  },
  {
    "id": 2,
    "name": "Jane Smith",
    "created_at": "2024-10-17T11:45:00Z",
    "metadata": {"role": "user", "active": false}
  }
]

Supported Data Types

BigQuery Type	Description	Example Values
STRING	Text data	”John Doe”, “Marketing”
INTEGER	Whole numbers	123, -456
FLOAT	Decimal numbers	123.45, -67.89
BOOLEAN	True/false values	true, false
TIMESTAMP	Date and time	”2024-10-17T10:30:00Z”
DATE	Date only	”2024-10-17”
TIME	Time only	”10:30:00”
DATETIME	Date and time (no timezone)	“2024-10-17 10:30:00”
JSON	JSON objects	`{"key": "value"}`
ARRAY	Arrays of values	`["item1", "item2"]`
RECORD	Nested structures	Complex objects

Field Modes

Mode	Description	Usage
REQUIRED	Field must have a value	Primary keys, essential data
NULLABLE	Field can be empty	Optional information
REPEATED	Field can have multiple values	Arrays, lists

Output

Returns table creation confirmation and details:

{
  "table_id": "user_profiles",
  "dataset_id": "analytics_data",
  "project_id": "my-project-123",
  "schema_mode": "Manual",
  "schema": [
    {
      "name": "id",
      "type": "INTEGER",
      "mode": "REQUIRED"
    }
  ],
  "partition_field": "created_at",
  "creation_time": "2024-10-17T10:30:00Z",
  "exists_ok_used": false
}

Output Fields:

Field	Description
table_id	The created table identifier
dataset_id	Dataset containing the table
project_id	BigQuery project
schema_mode	How schema was defined
schema	Final table schema
partition_field	Partitioning field (if used)
creation_time	When table was created
exists_ok_used	Whether table already existed

Credit Cost

Cost per run: 1 credit

FAQs

When should I use Manual vs Auto vs Empty schema modes?

Manual Schema Mode: Best for:

Production tables with strict requirements
Tables with specific data types or constraints
When you know exact schema upfront
Tables that need optimal performance

Advantages:

Complete control over field types
Can specify required vs nullable fields
Optimal storage and query performance
Clear documentation of data structure

Example Use Cases:

User profiles with known fields
Financial data requiring precision
API response tables with fixed structure

Auto Schema Mode: Best for:

Rapid prototyping and development
Exploring new data sources
When sample data represents full structure
Quick table setup for testing

Advantages:

Fast setup with representative data
Automatically detects appropriate types
Good for iterative development
Handles complex nested structures

Example Use Cases:

Importing CSV or JSON files
Processing API responses
Data exploration and analysis

Empty Schema Mode: Best for:

Tables that will be populated later
Dynamic schemas determined at runtime
When exact structure is unknown
Streaming data with varying structures

Advantages:

Maximum flexibility
Schema evolves with data
Works with dynamic data sources
Good for unstructured data

Example Use Cases:

Event logging tables
Raw data ingestion
Machine learning feature stores

How do I design effective table schemas for BigQuery?

Schema Design Best Practices:Choose Appropriate Data Types:

// Good - Specific types
{
  "user_id": "INTEGER",
  "email": "STRING", 
  "signup_date": "DATE",
  "last_login": "TIMESTAMP",
  "is_active": "BOOLEAN",
  "settings": "JSON"
}

// Avoid - Everything as STRING
{
  "user_id": "STRING",  // Should be INTEGER
  "signup_date": "STRING",  // Should be DATE
  "is_active": "STRING"  // Should be BOOLEAN
}

Optimize for Query Patterns:

Frequently filtered fields: Consider partitioning
Join keys: Use consistent types across tables
Aggregated fields: Use appropriate numeric types
Text search: Use STRING for full-text search

Nested vs Flat Structures:Use RECORD/JSON for:

Related data that’s queried together
Semi-structured API responses
Configuration objects
Metadata that varies by record

Use flat columns for:

Frequently filtered fields
Fields used in JOIN operations
Simple aggregation targets
Performance-critical queries

Field Naming Conventions:

Use snake_case: user_id not userId
Be descriptive: click_timestamp not ts
Consistent prefixes: is_active, has_premium
Avoid reserved words: Don’t use order, group, etc.

What are the benefits of table partitioning and when should I use it?

Table Partitioning Benefits:Performance Improvements:

Faster queries: Only scan relevant partitions
Lower costs: Pay only for data queried
Better parallelization: Process partitions simultaneously
Improved maintenance: Delete/update specific time ranges

Cost Optimization:

Reduced scanning: Eliminate irrelevant data from queries
Storage efficiency: Automatic data organization
Query slot optimization: Better resource utilization

When to Use Partitioning:Time-Series Data:

// Perfect for partitioning
events_table (partitioned by event_date)
├── 2024-10-01: 1M records
├── 2024-10-02: 1.2M records
└── 2024-10-03: 0.8M records

// Query: SELECT * FROM events WHERE event_date = '2024-10-02'
// Only scans 1.2M records instead of 3M

Large Tables with Time Filters:

Log data: Application logs, web analytics
Transaction data: Sales, payments, user actions
IoT data: Sensor readings, device telemetry
Social media: Posts, interactions, messages

Partition Field Requirements:

DATE or TIMESTAMP: Most common partition types
Top-level field: Cannot be nested in RECORD
Frequently filtered: Used in WHERE clauses
Immutable: Value doesn’t change after insert

Partition Strategies:Daily Partitioning (most common):

WHERE created_date BETWEEN '2024-10-01' AND '2024-10-07'
-- Scans only 7 days of data

Hourly Partitioning (high-volume data):

WHERE event_timestamp >= '2024-10-17 10:00:00'
-- Scans only from specific hour

Monthly Partitioning (historical data):

WHERE month_year = '2024-10'
-- Scans only October 2024 data

How do I handle schema evolution and changes over time?

Schema Evolution Strategies:Additive Changes (Safe):

Add new columns: Always safe, existing queries unaffected
Make required fields nullable: Allows missing data
Add new nested fields: Expands JSON/RECORD structures

{
  "name": "STRING", 
  "age": "INTEGER"
}

Safe evolution - add optional field:

{
  "name": "STRING", 
  "age": "INTEGER", 
  "email": "STRING"
}

Breaking Changes (Risky):

Change column types: May break existing queries
Remove columns: Existing queries will fail
Rename columns: Applications need updates
Change nullable to required: May reject existing data

Schema Versioning Approaches:Version in Table Names:

user_profiles_v1 → user_profiles_v2 → user_profiles_v3

Pros: Clear versioning, no data migration needed
Cons: Multiple tables to maintain, complex queries

Schema Migration Pattern:

Add new column as nullable
Backfill existing data
Update applications to use new column
Make column required (if needed)
Remove old column (optional)

Handling Missing Fields:

-- Use COALESCE for nullable fields
SELECT 
  name,
  COALESCE(email, 'no-email@example.com') as email
FROM user_profiles

-- Use conditional logic
SELECT 
  name,
  CASE 
    WHEN email IS NOT NULL THEN email
    ELSE 'Email not provided'
  END as email_status
FROM user_profiles

Best Practices:

Start with nullable fields: Add as required later if needed
Use JSON for flexible data: Schema-less nested structures
Version your schemas: Document changes over time
Test schema changes: Validate with sample data first
Communicate changes: Update team on schema modifications

What are the table naming and organization best practices?

Table Naming Conventions:Descriptive Structure:

{data_type}_{purpose}_{timeframe}
user_profiles_current
order_transactions_daily
campaign_performance_hourly

Data Processing Stages:

{source}_{stage}
google_ads_raw
google_ads_cleaned
google_ads_aggregated

Examples:

google_ads_raw: Raw imported data
google_ads_cleaned: Processed and validated
google_ads_aggregated: Summarized for reporting

Temporal Organization:

{purpose}_{time_granularity}
events_daily
revenue_monthly
users_snapshot

Business Function Based:

{department}_{entity}_{purpose}
marketing_campaigns_performance
sales_leads_qualified
finance_transactions_reconciled

Character Rules:

Use underscores: Not dashes or spaces
Lowercase only: BigQuery is case-sensitive
Start with letter: Not numbers or symbols
Max 1024 characters: Practically keep under 64
No reserved words: Avoid SQL keywords

Organization Strategies:By Data Source:

google_ads_data

├── campaigns
├── adgroups  
├── keywords
└── conversions

By Business Process:

sales_pipeline

├── leads_raw
├── leads_qualified
├── opportunities
└── closed_deals

By Time Granularity:

analytics_warehouse

├── events_hourly
├── sessions_daily
├── users_weekly
└── revenue_monthly

Anti-Patterns to Avoid:

Generic names: table1, data, temp
Abbreviations: usr_prof instead of user_profiles
Inconsistent naming: Mix of conventions
Too long names: Hard to type and read
Version in names: table_v2 (use schema evolution instead)

How do I integrate table creation with data pipelines?

Common Pipeline Patterns:ETL Pipeline Setup:

[Extract Data] → [Create Target Table] → [Transform & Load]
→ [Validate Data] → [Update Metadata]

Dynamic Table Creation:

[Analyze Data Source] → [AI Generate Schema] → [Create Table]
→ [Map Source Fields] → [Insert Data]

Multi-Environment Setup:

[Create Dev Table] → [Test Schema] → [Create Staging Table]
→ [Validate Pipeline] → [Create Prod Table]

Automated Schema Management:

[Detect Schema Changes] → [Create New Table Version]
→ [Migrate Data] → [Update References]

Error Handling Strategies:Idempotent Operations:

Always enable “Skip Error If Already There”
Use consistent naming conventions
Validate inputs before creation
Handle partial failures gracefully

Schema Validation:

[Check Existing Schema] → [Compare with New Schema]
→ [Create Table if Different] → [Log Changes]

Rollback Procedures:

[Backup Existing Table] → [Create New Table]
→ [Test Migration] → [Rollback if Failed]

Integration Patterns:With Data Sources:

API to Table: Auto-generate schema from API responses
File to Table: Detect schema from CSV/JSON files
Database to Table: Mirror external database schemas

With Analytics Tools:

BI Tool Setup: Create fact and dimension tables
ML Pipeline: Create feature store tables
Reporting: Create aggregated summary tables

Monitoring and Alerting:

Track table creation success/failure
Monitor schema evolution
Alert on naming convention violations
Log table usage patterns

What are the performance and cost optimization tips for BigQuery tables?

Performance Optimization:Partitioning Strategy:

Time-based partitioning: Use for time-series data
Partition pruning: Always filter on partition field
Avoid cross-partition queries: Stay within date ranges
Monitor partition skew: Ensure even data distribution

Clustering (Advanced):

Cluster frequently filtered fields: Up to 4 clustering columns
Order matters: Most selective fields first
Use with partitioning: Cluster within partitions
Monitor cluster effectiveness: Check query performance

Schema Design for Performance:

// Optimized schema
{
  "event_date": "DATE",        // Partition field
  "user_id": "INTEGER",        // Cluster field
  "event_type": "STRING",      // Cluster field  
  "properties": "JSON"         // Flexible data
}

Query Optimization:

Select specific columns: Avoid SELECT *
Use partition filters: Always include date ranges
Denormalize when needed: Reduce JOINs
Use APPROXIMATE functions: COUNT(DISTINCT) vs APPROX_COUNT_DISTINCT

Cost Optimization:Storage Costs:

Partition old data: Automatic lifecycle management
Use appropriate types: INTEGER vs STRING for numbers
Compress JSON data: Smaller storage footprint
Archive unused tables: Move to cheaper storage

Query Costs:

Limit data scanned: Use WHERE clauses effectively
Cache results: Reuse expensive computations
Use materialized views: Pre-compute aggregations
Monitor slot usage: Optimize concurrent queries

Data Lifecycle Management:

-- Set table expiration
CREATE TABLE dataset.temp_table
OPTIONS(
  expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
)

-- Partition expiration
ALTER TABLE dataset.events_table
SET OPTIONS(
  partition_expiration_days=90
)

Monitoring and Alerts:

Track query costs: Monitor bytes processed
Set up billing alerts: Prevent cost overruns
Monitor table growth: Track storage usage
Analyze query patterns: Optimize frequent queries

Best Practices Summary:

Partition time-series data on date/timestamp fields
Cluster on frequently filtered fields (up to 4 columns)
Use appropriate data types for storage efficiency
Design schema for query patterns you’ll actually use
Set up monitoring and alerting for costs and performance
Archive or delete unused data regularly
Test with representative data before production deployment

Get Started

Guides

Nodes

When to Use It

Inputs

Schema Mode Options

Manual Schema Format

Sample Data Format (Auto Mode)

Supported Data Types

Field Modes

Output

Output Fields:

Credit Cost

FAQs

Get Started

Guides

Nodes

​When to Use It

​Inputs

​Schema Mode Options

​Manual Schema Format

​Sample Data Format (Auto Mode)

​Supported Data Types

​Field Modes

​Output

​Output Fields:

​Credit Cost

​FAQs

When to Use It

Inputs

Schema Mode Options

Manual Schema Format

Sample Data Format (Auto Mode)

Supported Data Types

Field Modes

Output

Output Fields:

Credit Cost

FAQs