ChatGPT prompts to obtain machine learning datasets

Posted by Youngchatgpt 
ChatGPT prompts to obtain machine learning datasets
December 07, 2024 12:20AM
As machine learning develops, obtaining high-quality datasets becomes increasingly important. Datasets are crucial for evaluating the accuracy and effectiveness of the final model, and are a prerequisite for any machine learning project. In this article, we will learn how to collect various datasets for different machine learning applications using the ChatGPT[OpenAI] template prompts and collect these datasets in Python.
midjourney api
midjourney account purchase

Steps to Generate Dataset Using ChatGPT
Step 1: Install OpenAI Library in Python
!pip install -q openai
Step 2: Import OpenAI Library in Python
Python3
gpt4 Buy

import
openai
Step 3: Assign your API key to OpenAI environment variable
Python3

openai.api_key
=
"YOUR_API_KEY"
Step 4: Create a custom function to call ChatGPT API
Python3

def
chat(message):

response
=
openai.ChatCompletion.create(

model
=
"gpt-3.5-turbo"
,

messages
=
[

{
"role"
:
"user"
,
"content"
: f
"{message}"
},

]

)

return
response[
'choices'
][
0
][
'message'
][
'content'
]
Step 5: Call the function and pass in the prompt
res = chat('massage')
print(res)
Prompts Collect/Generate Machine Learning Datasets
Prompt 1:

Create a list of datasets that can be used to train a model on {topic}. Make sure the dataset is provided in CSV format. The goal is to learn about {topic} using this dataset. midjourney accountAlso, if possible, provide a link to the dataset. Create a list in tabular form with the following columns: Dataset name, dataset, URL, dataset description

Python3

prompt
=
'''
Create a list of datasets that can be used to train logistic regression models.
Ensure that the datasets are available in CSV format.
The objective is to use this dataset to learn about logistic regression models
and related nuances such as training the models. Also provide links to the dataset if possible.
Create the list in tabular form with the following columns:
Dataset name, dataset, URL, dataset description
'''
res
=
chat(prompt)
print
(res)

Output:

Dataset name | Dataset | URL | Dataset description--- | --- | --- | ---Titanic - Learning Machine Learning from Disasters | titanic.csv | [www.kaggle.com] | Contains data on Titanic passengers, including features such as age, gender, and cabin, and whether they survived. Wine Quality | winequality-red.csv | [archive.ics.uci.edu] | Contains data on various physical and chemical properties of red wine and their associated quality ratings. Bank Marketing | bank-additional-full.csv | [archive.ics.uci.edu] | Contains information on a bank's telemarketing campaigns, including the contact details of customers and whether they have subscribed to a term deposit. Breast Cancer in Wisconsin (Diagnosis) | wdbc.csv | [archive.ics.uci.edu]) | Contains data on various features extracted from digitized images of breast cancer biopsies, and whether the biopsies were benign or malignant. Adults | adult.csv | [archive.ics.uci.edu] | Contains demographic data on individuals, and whether their income is above a certain threshold. heart disease | heart.csv | [www.kaggle.com] | Contains data on various medical measurements taken on individuals, and whether they have heart disease. Pima Indians Diabetes | pima-indians-diabetes.csv | [www.kaggle.com] | Contains data on various medical measurements taken on Pima Indian women, and whether they have diabetes. Iris | iris.csv | [archive.ics.uci.edu] | Contains data on various measurements taken on Iris flowers, and what species they are. Loan Prediction | train.csv | [datahack.analyticsvidhya.com] | Contains various demographic data on loan applicants, and whether their applications were approved.
Tip 2:

Generate a dummy dataset to train and test {machine learning model name} for educational purposes.

Python3

res
=
chat('generate a dummy dataset to train
and
test a logistic regression model\
for
educational purposes. Ensure that the dataset
is
available
in
csv
format
')
print
(res)

Output:

Below is the CSV for educational purposes Example of a virtual dataset in the format:
```
Age, gender, income, education, employment status, marital status, loan approval
23, male, 25000, high school, unemployed, single, unmarried
32, female, 45000, bachelor's degree, working, married, unapproved
45, male, 120000, master's degree, working, married, approved
38, female, 60000, bachelor's degree, working, married, approved
26, male, 32000, college degree, working, unmarried, unapproved
29, female, 28000, high school, working, single, unapproved
41, male, 80000, doctorate, working, divorced, approved
54, male, 95000, master's degree, working, married, approved
```
This dataset contains demographic and financial information of 8 people and whether they were approved for a loan. The goal is to train a logistic regression model to predict loan approval based on other variables.
Tip 3:

List the datasets you want to practice {topic} on. If possible, also include a link to the dataset and a description. Create a list in tabular format

Python

prompt

chat(prompt)
print
(res)

Output:

1. TED Talks Corpus: This dataset contains parallel transcripts of TED talks in English and Hindi. It is available in text format and can be downloaded from the official website: [www.ted.com]
2. United Nations Parallel Corpus: This corpus contains parallel texts of speeches delivered by United Nations delegates in Hindi and English. It is available in text format and can be downloaded from the official website: [conferences.unite.un.org]
3. OPUS Corpus: This corpus contains parallel texts in multiple languages ​​such as Hindi and English. It includes data in a wide range of areas such as news, legal documents, and subtitles. It is available in text format and can be downloaded from the official website: [op]
Sorry, only registered users may post in this forum.