The Create Table node creates a new table in your BigQuery dataset with flexible schema definition options. You can define schemas manually, automatically from sample data, or create empty tables for later population. This is an AI-powered node that can understand natural language instructions.Documentation Index
Fetch the complete documentation index at: https://docs.markifact.com/llms.txt
Use this file to discover all available pages before exploring further.
When to Use It
- Set up new data tables in BigQuery datasets
- Create tables with specific schemas for data ingestion
- Build partitioned tables for performance optimization
- Generate table structures from sample data automatically
- Initialize database schemas as part of data pipeline setup
- Create empty tables that will be populated by other workflows
Inputs
| Field | Type | Required | Description |
|---|---|---|---|
| Project | Select | Yes | Select the Google BigQuery project containing the dataset |
| Dataset | Select | Yes | Select the dataset to create the table in |
| Table ID | Text | Yes | Unique identifier for the table within the dataset |
| Schema Mode | Select | Yes | How to define the table schema (Manual/Auto/Empty) |
| Schema Definition | JSON | Conditional | Table schema when using Manual mode |
| Sample Data | JSON | Conditional | Sample data for Auto schema generation |
| Partition Field | Text | No | Field name for table partitioning (date/timestamp fields) |
| Skip Error If Already There | Toggle | No | If enabled, won’t fail if table already exists (default: false) |
Schema Mode Options
| Mode | Description | When to Use |
|---|---|---|
| Manual | Define exact schema with fields and types | Precise control over table structure |
| Auto | Generate schema from sample data | Quick setup with representative data |
| Empty | Create table without schema | Schema will be added later via inserts |
Manual Schema Format
When using Manual mode, provide schema as JSON array:Sample Data Format (Auto Mode)
Provide representative data for automatic schema detection:Supported Data Types
| BigQuery Type | Description | Example Values |
|---|---|---|
| STRING | Text data | ”John Doe”, “Marketing” |
| INTEGER | Whole numbers | 123, -456 |
| FLOAT | Decimal numbers | 123.45, -67.89 |
| BOOLEAN | True/false values | true, false |
| TIMESTAMP | Date and time | ”2024-10-17T10:30:00Z” |
| DATE | Date only | ”2024-10-17” |
| TIME | Time only | ”10:30:00” |
| DATETIME | Date and time (no timezone) | “2024-10-17 10:30:00” |
| JSON | JSON objects | {"key": "value"} |
| ARRAY | Arrays of values | ["item1", "item2"] |
| RECORD | Nested structures | Complex objects |
Field Modes
| Mode | Description | Usage |
|---|---|---|
| REQUIRED | Field must have a value | Primary keys, essential data |
| NULLABLE | Field can be empty | Optional information |
| REPEATED | Field can have multiple values | Arrays, lists |
Output
Returns table creation confirmation and details:Output Fields:
| Field | Description |
|---|---|
| table_id | The created table identifier |
| dataset_id | Dataset containing the table |
| project_id | BigQuery project |
| schema_mode | How schema was defined |
| schema | Final table schema |
| partition_field | Partitioning field (if used) |
| creation_time | When table was created |
| exists_ok_used | Whether table already existed |
Credit Cost
- Cost per run: 1 credit
FAQs
When should I use Manual vs Auto vs Empty schema modes?
When should I use Manual vs Auto vs Empty schema modes?
Manual Schema Mode:
Best for:
- Production tables with strict requirements
- Tables with specific data types or constraints
- When you know exact schema upfront
- Tables that need optimal performance
- Complete control over field types
- Can specify required vs nullable fields
- Optimal storage and query performance
- Clear documentation of data structure
- User profiles with known fields
- Financial data requiring precision
- API response tables with fixed structure
- Rapid prototyping and development
- Exploring new data sources
- When sample data represents full structure
- Quick table setup for testing
- Fast setup with representative data
- Automatically detects appropriate types
- Good for iterative development
- Handles complex nested structures
- Importing CSV or JSON files
- Processing API responses
- Data exploration and analysis
- Tables that will be populated later
- Dynamic schemas determined at runtime
- When exact structure is unknown
- Streaming data with varying structures
- Maximum flexibility
- Schema evolves with data
- Works with dynamic data sources
- Good for unstructured data
- Event logging tables
- Raw data ingestion
- Machine learning feature stores
How do I design effective table schemas for BigQuery?
How do I design effective table schemas for BigQuery?
Schema Design Best Practices:Choose Appropriate Data Types:Optimize for Query Patterns:
- Frequently filtered fields: Consider partitioning
- Join keys: Use consistent types across tables
- Aggregated fields: Use appropriate numeric types
- Text search: Use STRING for full-text search
- Related data that’s queried together
- Semi-structured API responses
- Configuration objects
- Metadata that varies by record
- Frequently filtered fields
- Fields used in JOIN operations
- Simple aggregation targets
- Performance-critical queries
- Use snake_case:
user_idnotuserId - Be descriptive:
click_timestampnotts - Consistent prefixes:
is_active,has_premium - Avoid reserved words: Don’t use
order,group, etc.
What are the benefits of table partitioning and when should I use it?
What are the benefits of table partitioning and when should I use it?
Table Partitioning Benefits:Performance Improvements:Large Tables with Time Filters:Hourly Partitioning (high-volume data):Monthly Partitioning (historical data):
- Faster queries: Only scan relevant partitions
- Lower costs: Pay only for data queried
- Better parallelization: Process partitions simultaneously
- Improved maintenance: Delete/update specific time ranges
- Reduced scanning: Eliminate irrelevant data from queries
- Storage efficiency: Automatic data organization
- Query slot optimization: Better resource utilization
- Log data: Application logs, web analytics
- Transaction data: Sales, payments, user actions
- IoT data: Sensor readings, device telemetry
- Social media: Posts, interactions, messages
- DATE or TIMESTAMP: Most common partition types
- Top-level field: Cannot be nested in RECORD
- Frequently filtered: Used in WHERE clauses
- Immutable: Value doesn’t change after insert
How do I handle schema evolution and changes over time?
How do I handle schema evolution and changes over time?
Schema Evolution Strategies:Additive Changes (Safe):Safe evolution - add optional field:Breaking Changes (Risky):Handling Missing Fields:Best Practices:
- Add new columns: Always safe, existing queries unaffected
- Make required fields nullable: Allows missing data
- Add new nested fields: Expands JSON/RECORD structures
- Change column types: May break existing queries
- Remove columns: Existing queries will fail
- Rename columns: Applications need updates
- Change nullable to required: May reject existing data
- Pros: Clear versioning, no data migration needed
- Cons: Multiple tables to maintain, complex queries
- Start with nullable fields: Add as required later if needed
- Use JSON for flexible data: Schema-less nested structures
- Version your schemas: Document changes over time
- Test schema changes: Validate with sample data first
- Communicate changes: Update team on schema modifications
What are the table naming and organization best practices?
What are the table naming and organization best practices?
Table Naming Conventions:Descriptive Structure:Data Processing Stages:Examples:Business Function Based:Character Rules:By Business Process:By Time Granularity:Anti-Patterns to Avoid:
google_ads_raw: Raw imported datagoogle_ads_cleaned: Processed and validatedgoogle_ads_aggregated: Summarized for reporting
- Use underscores: Not dashes or spaces
- Lowercase only: BigQuery is case-sensitive
- Start with letter: Not numbers or symbols
- Max 1024 characters: Practically keep under 64
- No reserved words: Avoid SQL keywords
google_ads_data
sales_pipeline
analytics_warehouse
- Generic names:
table1,data,temp - Abbreviations:
usr_profinstead ofuser_profiles - Inconsistent naming: Mix of conventions
- Too long names: Hard to type and read
- Version in names:
table_v2(use schema evolution instead)
How do I integrate table creation with data pipelines?
How do I integrate table creation with data pipelines?
Common Pipeline Patterns:ETL Pipeline Setup:Dynamic Table Creation:Multi-Environment Setup:Automated Schema Management:Error Handling Strategies:Idempotent Operations:Rollback Procedures:Integration Patterns:With Data Sources:
- Always enable “Skip Error If Already There”
- Use consistent naming conventions
- Validate inputs before creation
- Handle partial failures gracefully
- API to Table: Auto-generate schema from API responses
- File to Table: Detect schema from CSV/JSON files
- Database to Table: Mirror external database schemas
- BI Tool Setup: Create fact and dimension tables
- ML Pipeline: Create feature store tables
- Reporting: Create aggregated summary tables
- Track table creation success/failure
- Monitor schema evolution
- Alert on naming convention violations
- Log table usage patterns
What are the performance and cost optimization tips for BigQuery tables?
What are the performance and cost optimization tips for BigQuery tables?
Performance Optimization:Partitioning Strategy:Query Optimization:Monitoring and Alerts:
- Time-based partitioning: Use for time-series data
- Partition pruning: Always filter on partition field
- Avoid cross-partition queries: Stay within date ranges
- Monitor partition skew: Ensure even data distribution
- Cluster frequently filtered fields: Up to 4 clustering columns
- Order matters: Most selective fields first
- Use with partitioning: Cluster within partitions
- Monitor cluster effectiveness: Check query performance
- Select specific columns: Avoid SELECT *
- Use partition filters: Always include date ranges
- Denormalize when needed: Reduce JOINs
- Use APPROXIMATE functions: COUNT(DISTINCT) vs APPROX_COUNT_DISTINCT
- Partition old data: Automatic lifecycle management
- Use appropriate types: INTEGER vs STRING for numbers
- Compress JSON data: Smaller storage footprint
- Archive unused tables: Move to cheaper storage
- Limit data scanned: Use WHERE clauses effectively
- Cache results: Reuse expensive computations
- Use materialized views: Pre-compute aggregations
- Monitor slot usage: Optimize concurrent queries
- Track query costs: Monitor bytes processed
- Set up billing alerts: Prevent cost overruns
- Monitor table growth: Track storage usage
- Analyze query patterns: Optimize frequent queries
- Partition time-series data on date/timestamp fields
- Cluster on frequently filtered fields (up to 4 columns)
- Use appropriate data types for storage efficiency
- Design schema for query patterns you’ll actually use
- Set up monitoring and alerting for costs and performance
- Archive or delete unused data regularly
- Test with representative data before production deployment

