Create Dataset

The Create Dataset node creates a new dataset in your Google BigQuery project. Datasets are containers that organize tables and control access to your data. This is an AI-powered node that can understand natural language instructions.

When to Use It

Set up new data warehousing projects in BigQuery
Organize tables by business unit, data source, or project
Create isolated environments for development, testing, and production
Establish data governance boundaries with different access controls
Build automated data pipeline setup workflows
Initialize BigQuery infrastructure as part of larger workflows

Inputs

Field	Type	Required	Description
Project	Select	Yes	Select the Google BigQuery project to create the dataset in
Dataset ID	Text	Yes	Unique identifier for the dataset (alphanumeric and underscores only)
Location	Text	No	Geographic location for the dataset (e.g., US, EU, asia-southeast1)
Description	Text	No	Optional description to document the dataset’s purpose
Skip Error If Already There	Toggle	No	If enabled, won’t fail if dataset already exists (default: false)

Dataset ID Requirements

Characters: Letters, numbers, and underscores only
Length: Up to 1024 characters
Case sensitive: MyDataset and mydataset are different
Uniqueness: Must be unique within the project
No spaces: Use underscores instead of spaces

Good examples: marketing_data, sales_2024, user_analytics Bad examples: marketing data, sales-2024, user@analytics

Location Options

Location	Description	Use Case
US	Multi-region in United States	Default, best for US-based operations
EU	Multi-region in European Union	GDPR compliance, EU operations
asia-southeast1	Singapore	Asia-Pacific operations
us-central1	Iowa, USA	Specific US region
europe-west1	Belgium	Specific EU region

Important: Location cannot be changed after dataset creation. Choose based on:

Data residency requirements
Performance (closer to users/applications)
Compliance regulations (GDPR, etc.)

Output

Returns dataset creation confirmation and details:

{
  "dataset_id": "marketing_data",
  "project_id": "my-project-123",
  "location": "US",
  "creation_time": "2024-10-17T10:30:00Z",
  "description": "Marketing analytics data warehouse",
  "exists_ok_used": false
}

Output Fields:

Field	Description
dataset_id	The created dataset identifier
project_id	BigQuery project containing the dataset
location	Geographic location of the dataset
creation_time	When the dataset was created
description	Dataset description (if provided)
exists_ok_used	Whether the dataset already existed

Credit Cost

Cost per run: 1 credit

FAQs

What happens if the dataset already exists?

Default Behavior (Skip Error If Already There = false):

The operation will fail with an error
Workflow will stop execution
Useful for ensuring new dataset creation

With Skip Error If Already There = true:

Operation succeeds even if dataset exists
No changes made to existing dataset
exists_ok_used: true in output
Workflow continues normally

Best Practice: Enable “Skip Error If Already There” for idempotent workflows that should run multiple times safely.

How do I choose the right location for my dataset?

Consider These Factors:Data Residency:

GDPR compliance: Use EU locations for European user data
Local regulations: Some countries require data to stay within borders
Company policies: Internal data governance requirements

Performance:

User proximity: Choose location closest to end users
Application location: Co-locate with your applications
Data sources: Near where your data originates

Cost Optimization:

Multi-region: Higher availability, slightly higher cost
Single region: Lower cost, regional availability
Egress charges: Consider data export costs

Common Patterns:

Global business: US (multi-region) for flexibility
EU operations: EU (multi-region) for compliance
Asian markets: asia-southeast1 or other Asian regions
Cost-sensitive: Specific single regions

What's the difference between dataset and table organization?

Dataset Level (Container):

Purpose: High-level organization and access control
Contains: Multiple related tables
Access control: IAM permissions at dataset level
Location: Fixed geographic location
Billing: Costs roll up to dataset level

Table Level (Data Storage):

Purpose: Actual data storage and schema definition
Contains: Rows and columns of data
Access control: Inherits from dataset (can be restricted further)
Location: Same as parent dataset
Billing: Storage and query costs

Organization Strategies:By Business Unit:

marketing_data → campaigns, leads, attribution
sales_data → opportunities, customers, revenue
finance_data → transactions, budgets, forecasts

By Data Source:

google_ads → campaigns, keywords, ads
facebook_ads → campaigns, adsets, creatives
crm_data → contacts, deals, activities

By Environment:

production_data → live operational data
staging_data → testing and development
analytics_data → processed analytical datasets

Can I modify dataset settings after creation?

Modifiable After Creation:

Description: Can be updated anytime
Access controls: IAM permissions can be changed
Labels: Can add/modify/remove labels
Default table expiration: Can be set or changed

Cannot Be Modified:

Dataset ID: Cannot be renamed (must recreate)
Location: Cannot be changed (must recreate)
Project: Cannot move between projects

Best Practices:

Plan dataset ID carefully: Include version numbers if needed
Choose location wisely: Cannot be changed later
Use descriptive names: Make purpose clear from the name
Document thoroughly: Use descriptions and labels

How do I set up proper access controls for datasets?

BigQuery IAM Roles for Datasets:Read Access:

BigQuery Data Viewer: Read tables and run queries
BigQuery User: Read + create temporary tables

Write Access:

BigQuery Data Editor: Read + write + delete data
BigQuery Admin: Full control including schema changes

Management Access:

BigQuery Admin: Full dataset management
BigQuery Resource Admin: Manage datasets and jobs

Access Control Strategies:By Business Function:

Marketing Team → BigQuery Data Viewer on marketing_data
Data Scientists → BigQuery Data Editor on analytics_data
ETL Service Account → BigQuery Admin on staging_data

By Environment:

Production → Strict controls, minimal write access
Staging → Broader access for testing
Development → Full access for iteration

Security Best Practices:

Principle of least privilege: Grant minimum necessary access
Use service accounts: For automated workflows
Regular audits: Review and update permissions
Monitor usage: Track who accesses what data

What naming conventions should I follow for datasets?

Recommended Naming Patterns:Descriptive Structure:

{business_unit}_{data_type}_{environment}
marketing_analytics_prod
sales_crm_staging
finance_reports_dev

Data Source Based:

{source_system}_{data_type}
google_ads_raw
salesforce_cleaned
website_analytics

Temporal Organization:

{purpose}_{time_period}
marketing_2024
historical_archive
current_quarter

Best Practices:

Use underscores: Not dashes or spaces
Be consistent: Follow same pattern across organization
Include context: Make purpose clear
Plan for growth: Consider future datasets
Avoid abbreviations: Use clear, full words
Include environment: Distinguish prod/staging/dev

Examples by Use Case:

Agency: client_name_data_type (acme_google_ads)
Enterprise: dept_function_env (marketing_analytics_prod)
Startup: data_source_purpose (ads_performance, user_behavior)

How do I automate dataset creation in workflows?

Common Automation Patterns:Client Onboarding:

[Trigger: New Client] → [Create Dataset: {client_name}_data]
→ [Create Tables] → [Set Permissions] → [Notify Team]

Environment Setup:

[Trigger: New Project] → [Create Dataset: {project}_prod]
→ [Create Dataset: {project}_staging] → [Setup IAM]

Data Pipeline Initialization:

[Schedule: Monthly] → [Create Dataset: archive_{year}_{month}]
→ [Move Old Data] → [Update References]

Dynamic Dataset Creation:

[AI Agent] → [Determine Dataset Name] → [Create Dataset]
→ [Create Tables] → [Load Initial Data]

Error Handling Strategies:

Always enable “Skip Error If Already There” for recurring workflows
Validate names before creation to avoid failures
Plan rollback procedures for failed setups
Monitor creation success and alert on failures

Integration with Other Nodes:

Create Dataset → Create Table → Insert Rows
List Datasets → Conditional Logic → Create Dataset
Create Dataset → Set IAM Permissions → Notify Stakeholders

Get Started

Guides

Nodes

When to Use It

Inputs

Dataset ID Requirements

Location Options

Output

Output Fields:

Credit Cost

FAQs

Get Started

Guides

Nodes

​When to Use It

​Inputs

​Dataset ID Requirements

​Location Options

​Output

​Output Fields:

​Credit Cost

​FAQs

When to Use It

Inputs

Dataset ID Requirements

Location Options

Output

Output Fields:

Credit Cost

FAQs