Skip to main content
The Create Table node creates a new table in your BigQuery dataset with flexible schema definition options. You can define schemas manually, automatically from sample data, or create empty tables for later population. This is an AI-powered node that can understand natural language instructions.

When to Use It

  • Set up new data tables in BigQuery datasets
  • Create tables with specific schemas for data ingestion
  • Build partitioned tables for performance optimization
  • Generate table structures from sample data automatically
  • Initialize database schemas as part of data pipeline setup
  • Create empty tables that will be populated by other workflows

Inputs

FieldTypeRequiredDescription
ProjectSelectYesSelect the Google BigQuery project containing the dataset
DatasetSelectYesSelect the dataset to create the table in
Table IDTextYesUnique identifier for the table within the dataset
Schema ModeSelectYesHow to define the table schema (Manual/Auto/Empty)
Schema DefinitionJSONConditionalTable schema when using Manual mode
Sample DataJSONConditionalSample data for Auto schema generation
Partition FieldTextNoField name for table partitioning (date/timestamp fields)
Skip Error If Already ThereToggleNoIf enabled, won’t fail if table already exists (default: false)

Schema Mode Options

ModeDescriptionWhen to Use
ManualDefine exact schema with fields and typesPrecise control over table structure
AutoGenerate schema from sample dataQuick setup with representative data
EmptyCreate table without schemaSchema will be added later via inserts

Manual Schema Format

When using Manual mode, provide schema as JSON array:
[
  {
    "name": "id",
    "type": "INTEGER",
    "mode": "REQUIRED"
  },
  {
    "name": "name",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "created_at",
    "type": "TIMESTAMP",
    "mode": "REQUIRED"
  },
  {
    "name": "metadata",
    "type": "JSON",
    "mode": "NULLABLE"
  }
]

Sample Data Format (Auto Mode)

Provide representative data for automatic schema detection:
[
  {
    "id": 1,
    "name": "John Doe",
    "created_at": "2024-10-17T10:30:00Z",
    "metadata": {"role": "admin", "active": true}
  },
  {
    "id": 2,
    "name": "Jane Smith",
    "created_at": "2024-10-17T11:45:00Z",
    "metadata": {"role": "user", "active": false}
  }
]

Supported Data Types

BigQuery TypeDescriptionExample Values
STRINGText data”John Doe”, “Marketing”
INTEGERWhole numbers123, -456
FLOATDecimal numbers123.45, -67.89
BOOLEANTrue/false valuestrue, false
TIMESTAMPDate and time”2024-10-17T10:30:00Z”
DATEDate only”2024-10-17”
TIMETime only”10:30:00”
DATETIMEDate and time (no timezone)“2024-10-17 10:30:00”
JSONJSON objects{"key": "value"}
ARRAYArrays of values["item1", "item2"]
RECORDNested structuresComplex objects

Field Modes

ModeDescriptionUsage
REQUIREDField must have a valuePrimary keys, essential data
NULLABLEField can be emptyOptional information
REPEATEDField can have multiple valuesArrays, lists

Output

Returns table creation confirmation and details:
{
  "table_id": "user_profiles",
  "dataset_id": "analytics_data",
  "project_id": "my-project-123",
  "schema_mode": "Manual",
  "schema": [
    {
      "name": "id",
      "type": "INTEGER",
      "mode": "REQUIRED"
    }
  ],
  "partition_field": "created_at",
  "creation_time": "2024-10-17T10:30:00Z",
  "exists_ok_used": false
}

Output Fields:

FieldDescription
table_idThe created table identifier
dataset_idDataset containing the table
project_idBigQuery project
schema_modeHow schema was defined
schemaFinal table schema
partition_fieldPartitioning field (if used)
creation_timeWhen table was created
exists_ok_usedWhether table already existed

Credit Cost

  • Cost per run: 1 credit

FAQs

Manual Schema Mode: Best for:
  • Production tables with strict requirements
  • Tables with specific data types or constraints
  • When you know exact schema upfront
  • Tables that need optimal performance
Advantages:
  • Complete control over field types
  • Can specify required vs nullable fields
  • Optimal storage and query performance
  • Clear documentation of data structure
Example Use Cases:
  • User profiles with known fields
  • Financial data requiring precision
  • API response tables with fixed structure
Auto Schema Mode: Best for:
  • Rapid prototyping and development
  • Exploring new data sources
  • When sample data represents full structure
  • Quick table setup for testing
Advantages:
  • Fast setup with representative data
  • Automatically detects appropriate types
  • Good for iterative development
  • Handles complex nested structures
Example Use Cases:
  • Importing CSV or JSON files
  • Processing API responses
  • Data exploration and analysis
Empty Schema Mode: Best for:
  • Tables that will be populated later
  • Dynamic schemas determined at runtime
  • When exact structure is unknown
  • Streaming data with varying structures
Advantages:
  • Maximum flexibility
  • Schema evolves with data
  • Works with dynamic data sources
  • Good for unstructured data
Example Use Cases:
  • Event logging tables
  • Raw data ingestion
  • Machine learning feature stores
Schema Design Best Practices:Choose Appropriate Data Types:
// Good - Specific types
{
  "user_id": "INTEGER",
  "email": "STRING", 
  "signup_date": "DATE",
  "last_login": "TIMESTAMP",
  "is_active": "BOOLEAN",
  "settings": "JSON"
}

// Avoid - Everything as STRING
{
  "user_id": "STRING",  // Should be INTEGER
  "signup_date": "STRING",  // Should be DATE
  "is_active": "STRING"  // Should be BOOLEAN
}
Optimize for Query Patterns:
  • Frequently filtered fields: Consider partitioning
  • Join keys: Use consistent types across tables
  • Aggregated fields: Use appropriate numeric types
  • Text search: Use STRING for full-text search
Nested vs Flat Structures:Use RECORD/JSON for:
  • Related data that’s queried together
  • Semi-structured API responses
  • Configuration objects
  • Metadata that varies by record
Use flat columns for:
  • Frequently filtered fields
  • Fields used in JOIN operations
  • Simple aggregation targets
  • Performance-critical queries
Field Naming Conventions:
  • Use snake_case: user_id not userId
  • Be descriptive: click_timestamp not ts
  • Consistent prefixes: is_active, has_premium
  • Avoid reserved words: Don’t use order, group, etc.
Table Partitioning Benefits:Performance Improvements:
  • Faster queries: Only scan relevant partitions
  • Lower costs: Pay only for data queried
  • Better parallelization: Process partitions simultaneously
  • Improved maintenance: Delete/update specific time ranges
Cost Optimization:
  • Reduced scanning: Eliminate irrelevant data from queries
  • Storage efficiency: Automatic data organization
  • Query slot optimization: Better resource utilization
When to Use Partitioning:Time-Series Data:
// Perfect for partitioning
events_table (partitioned by event_date)
├── 2024-10-01: 1M records
├── 2024-10-02: 1.2M records
└── 2024-10-03: 0.8M records

// Query: SELECT * FROM events WHERE event_date = '2024-10-02'
// Only scans 1.2M records instead of 3M
Large Tables with Time Filters:
  • Log data: Application logs, web analytics
  • Transaction data: Sales, payments, user actions
  • IoT data: Sensor readings, device telemetry
  • Social media: Posts, interactions, messages
Partition Field Requirements:
  • DATE or TIMESTAMP: Most common partition types
  • Top-level field: Cannot be nested in RECORD
  • Frequently filtered: Used in WHERE clauses
  • Immutable: Value doesn’t change after insert
Partition Strategies:Daily Partitioning (most common):
WHERE created_date BETWEEN '2024-10-01' AND '2024-10-07'
-- Scans only 7 days of data
Hourly Partitioning (high-volume data):
WHERE event_timestamp >= '2024-10-17 10:00:00'
-- Scans only from specific hour
Monthly Partitioning (historical data):
WHERE month_year = '2024-10'
-- Scans only October 2024 data
Schema Evolution Strategies:Additive Changes (Safe):
  • Add new columns: Always safe, existing queries unaffected
  • Make required fields nullable: Allows missing data
  • Add new nested fields: Expands JSON/RECORD structures
{
  "name": "STRING", 
  "age": "INTEGER"
}
Safe evolution - add optional field:
{
  "name": "STRING", 
  "age": "INTEGER", 
  "email": "STRING"
}
Breaking Changes (Risky):
  • Change column types: May break existing queries
  • Remove columns: Existing queries will fail
  • Rename columns: Applications need updates
  • Change nullable to required: May reject existing data
Schema Versioning Approaches:Version in Table Names:
user_profiles_v1 → user_profiles_v2 → user_profiles_v3
  • Pros: Clear versioning, no data migration needed
  • Cons: Multiple tables to maintain, complex queries
Schema Migration Pattern:
1. Add new column as nullable
2. Backfill existing data
3. Update applications to use new column
4. Make column required (if needed)
5. Remove old column (optional)
Handling Missing Fields:
-- Use COALESCE for nullable fields
SELECT 
  name,
  COALESCE(email, 'no-email@example.com') as email
FROM user_profiles

-- Use conditional logic
SELECT 
  name,
  CASE 
    WHEN email IS NOT NULL THEN email
    ELSE 'Email not provided'
  END as email_status
FROM user_profiles
Best Practices:
  • Start with nullable fields: Add as required later if needed
  • Use JSON for flexible data: Schema-less nested structures
  • Version your schemas: Document changes over time
  • Test schema changes: Validate with sample data first
  • Communicate changes: Update team on schema modifications
Table Naming Conventions:Descriptive Structure:
{data_type}_{purpose}_{timeframe}
user_profiles_current
order_transactions_daily
campaign_performance_hourly
Data Processing Stages:
{source}_{stage}
google_ads_raw
google_ads_cleaned
google_ads_aggregated
Examples:
  • google_ads_raw: Raw imported data
  • google_ads_cleaned: Processed and validated
  • google_ads_aggregated: Summarized for reporting
Temporal Organization:
{purpose}_{time_granularity}
events_daily
revenue_monthly
users_snapshot
Business Function Based:
{department}_{entity}_{purpose}
marketing_campaigns_performance
sales_leads_qualified
finance_transactions_reconciled
Character Rules:
  • Use underscores: Not dashes or spaces
  • Lowercase only: BigQuery is case-sensitive
  • Start with letter: Not numbers or symbols
  • Max 1024 characters: Practically keep under 64
  • No reserved words: Avoid SQL keywords
Organization Strategies:By Data Source:
google_ads_data
├── campaigns
├── adgroups  
├── keywords
└── conversions
By Business Process:
sales_pipeline
├── leads_raw
├── leads_qualified
├── opportunities
└── closed_deals
By Time Granularity:
analytics_warehouse
├── events_hourly
├── sessions_daily
├── users_weekly
└── revenue_monthly
Anti-Patterns to Avoid:
  • Generic names: table1, data, temp
  • Abbreviations: usr_prof instead of user_profiles
  • Inconsistent naming: Mix of conventions
  • Too long names: Hard to type and read
  • Version in names: table_v2 (use schema evolution instead)
Common Pipeline Patterns:ETL Pipeline Setup:
[Extract Data] → [Create Target Table] → [Transform & Load]
→ [Validate Data] → [Update Metadata]
Dynamic Table Creation:
[Analyze Data Source] → [AI Generate Schema] → [Create Table]
→ [Map Source Fields] → [Insert Data]
Multi-Environment Setup:
[Create Dev Table] → [Test Schema] → [Create Staging Table]
→ [Validate Pipeline] → [Create Prod Table]
Automated Schema Management:
[Detect Schema Changes] → [Create New Table Version]
→ [Migrate Data] → [Update References]
Error Handling Strategies:Idempotent Operations:
  • Always enable “Skip Error If Already There”
  • Use consistent naming conventions
  • Validate inputs before creation
  • Handle partial failures gracefully
Schema Validation:
[Check Existing Schema] → [Compare with New Schema]
→ [Create Table if Different] → [Log Changes]
Rollback Procedures:
[Backup Existing Table] → [Create New Table]
→ [Test Migration] → [Rollback if Failed]
Integration Patterns:With Data Sources:
  • API to Table: Auto-generate schema from API responses
  • File to Table: Detect schema from CSV/JSON files
  • Database to Table: Mirror external database schemas
With Analytics Tools:
  • BI Tool Setup: Create fact and dimension tables
  • ML Pipeline: Create feature store tables
  • Reporting: Create aggregated summary tables
Monitoring and Alerting:
  • Track table creation success/failure
  • Monitor schema evolution
  • Alert on naming convention violations
  • Log table usage patterns
Performance Optimization:Partitioning Strategy:
  • Time-based partitioning: Use for time-series data
  • Partition pruning: Always filter on partition field
  • Avoid cross-partition queries: Stay within date ranges
  • Monitor partition skew: Ensure even data distribution
Clustering (Advanced):
  • Cluster frequently filtered fields: Up to 4 clustering columns
  • Order matters: Most selective fields first
  • Use with partitioning: Cluster within partitions
  • Monitor cluster effectiveness: Check query performance
Schema Design for Performance:
// Optimized schema
{
  "event_date": "DATE",        // Partition field
  "user_id": "INTEGER",        // Cluster field
  "event_type": "STRING",      // Cluster field  
  "properties": "JSON"         // Flexible data
}
Query Optimization:
  • Select specific columns: Avoid SELECT *
  • Use partition filters: Always include date ranges
  • Denormalize when needed: Reduce JOINs
  • Use APPROXIMATE functions: COUNT(DISTINCT) vs APPROX_COUNT_DISTINCT
Cost Optimization:Storage Costs:
  • Partition old data: Automatic lifecycle management
  • Use appropriate types: INTEGER vs STRING for numbers
  • Compress JSON data: Smaller storage footprint
  • Archive unused tables: Move to cheaper storage
Query Costs:
  • Limit data scanned: Use WHERE clauses effectively
  • Cache results: Reuse expensive computations
  • Use materialized views: Pre-compute aggregations
  • Monitor slot usage: Optimize concurrent queries
Data Lifecycle Management:
-- Set table expiration
CREATE TABLE dataset.temp_table
OPTIONS(
  expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
)

-- Partition expiration
ALTER TABLE dataset.events_table
SET OPTIONS(
  partition_expiration_days=90
)
Monitoring and Alerts:
  • Track query costs: Monitor bytes processed
  • Set up billing alerts: Prevent cost overruns
  • Monitor table growth: Track storage usage
  • Analyze query patterns: Optimize frequent queries
Best Practices Summary:
  1. Partition time-series data on date/timestamp fields
  2. Cluster on frequently filtered fields (up to 4 columns)
  3. Use appropriate data types for storage efficiency
  4. Design schema for query patterns you’ll actually use
  5. Set up monitoring and alerting for costs and performance
  6. Archive or delete unused data regularly
  7. Test with representative data before production deployment
I